Message ID | 20250406024010.1177927-3-longman@redhat.com |
---|---|
State | New |
Headers | show |
Series | memcg: Fix test_memcg_min/low test failures | expand |
On Sat, Apr 05, 2025 at 10:40:10PM -0400, Waiman Long wrote: > The test_memcg_protection() function is used for the test_memcg_min and > test_memcg_low sub-tests. This function generates a set of parent/child > cgroups like: > > parent: memory.min/low = 50M > child 0: memory.min/low = 75M, memory.current = 50M > child 1: memory.min/low = 25M, memory.current = 50M > child 2: memory.min/low = 0, memory.current = 50M > > After applying memory pressure, the function expects the following > actual memory usages. > > parent: memory.current ~= 50M > child 0: memory.current ~= 29M > child 1: memory.current ~= 21M > child 2: memory.current ~= 0 > > In reality, the actual memory usages can differ quite a bit from the > expected values. It uses an error tolerance of 10% with the values_close() > helper. > > Both the test_memcg_min and test_memcg_low sub-tests can fail > sporadically because the actual memory usage exceeds the 10% error > tolerance. Below are a sample of the usage data of the tests runs > that fail. > > Child Actual usage Expected usage %err > ----- ------------ -------------- ---- > 1 16990208 22020096 -12.9% > 1 17252352 22020096 -12.1% > 0 37699584 30408704 +10.7% > 1 14368768 22020096 -21.0% > 1 16871424 22020096 -13.2% > > The current 10% error tolerenace might be right at the time > test_memcontrol.c was first introduced in v4.18 kernel, but memory > reclaim have certainly evolved quite a bit since then which may result > in a bit more run-to-run variation than previously expected. > > Increase the error tolerance to 15% for child 0 and 20% for child 1 to > minimize the chance of this type of failure. The tolerance is bigger > for child 1 because an upswing in child 0 corresponds to a smaller > %err than a similar downswing in child 1 due to the way %err is used > in values_close(). > > Before this patch, a 100 test runs of test_memcontrol produced the > following results: > > 17 not ok 1 test_memcg_min > 22 not ok 2 test_memcg_low > > After applying this patch, there were no test failure for test_memcg_min > and test_memcg_low in 100 test runs. Ideally we want to calculate these values dynamically based on the machine size (number of cpus and total memory size). We can calculate the memcg error margin and scale memcg sizes if necessarily. It's the only way to make it pass both on a 2-CPU's vm and 512-CPU's physical server. Not a blocker for this patch, just an idea for the future. Thanks!
diff --git a/tools/testing/selftests/cgroup/test_memcontrol.c b/tools/testing/selftests/cgroup/test_memcontrol.c index bab826b6b7b0..8f4f2479650e 100644 --- a/tools/testing/selftests/cgroup/test_memcontrol.c +++ b/tools/testing/selftests/cgroup/test_memcontrol.c @@ -495,10 +495,10 @@ static int test_memcg_protection(const char *root, bool min) for (i = 0; i < ARRAY_SIZE(children); i++) c[i] = cg_read_long(children[i], "memory.current"); - if (!values_close(c[0], MB(29), 10)) + if (!values_close(c[0], MB(29), 15)) goto cleanup; - if (!values_close(c[1], MB(21), 10)) + if (!values_close(c[1], MB(21), 20)) goto cleanup; if (c[3] != 0)
The test_memcg_protection() function is used for the test_memcg_min and test_memcg_low sub-tests. This function generates a set of parent/child cgroups like: parent: memory.min/low = 50M child 0: memory.min/low = 75M, memory.current = 50M child 1: memory.min/low = 25M, memory.current = 50M child 2: memory.min/low = 0, memory.current = 50M After applying memory pressure, the function expects the following actual memory usages. parent: memory.current ~= 50M child 0: memory.current ~= 29M child 1: memory.current ~= 21M child 2: memory.current ~= 0 In reality, the actual memory usages can differ quite a bit from the expected values. It uses an error tolerance of 10% with the values_close() helper. Both the test_memcg_min and test_memcg_low sub-tests can fail sporadically because the actual memory usage exceeds the 10% error tolerance. Below are a sample of the usage data of the tests runs that fail. Child Actual usage Expected usage %err ----- ------------ -------------- ---- 1 16990208 22020096 -12.9% 1 17252352 22020096 -12.1% 0 37699584 30408704 +10.7% 1 14368768 22020096 -21.0% 1 16871424 22020096 -13.2% The current 10% error tolerenace might be right at the time test_memcontrol.c was first introduced in v4.18 kernel, but memory reclaim have certainly evolved quite a bit since then which may result in a bit more run-to-run variation than previously expected. Increase the error tolerance to 15% for child 0 and 20% for child 1 to minimize the chance of this type of failure. The tolerance is bigger for child 1 because an upswing in child 0 corresponds to a smaller %err than a similar downswing in child 1 due to the way %err is used in values_close(). Before this patch, a 100 test runs of test_memcontrol produced the following results: 17 not ok 1 test_memcg_min 22 not ok 2 test_memcg_low After applying this patch, there were no test failure for test_memcg_min and test_memcg_low in 100 test runs. Signed-off-by: Waiman Long <longman@redhat.com> --- tools/testing/selftests/cgroup/test_memcontrol.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)