[RFC] Add mempressure cgroup

Message ID	20121128102908.GA15415@lizard
State	New
Headers	show Return-Path: <patch+caf_=linaro-patchwork=canonical.com@linaro.org> Received-SPF: neutral (google.com: 209.85.160.52 is neither permitted nor denied by best guess record for domain of anton.vorontsov@linaro.org) client-ip=209.85.160.52; Date: Wed, 28 Nov 2012 02:29:08 -0800 From: Anton Vorontsov <anton.vorontsov@linaro.org> To: David Rientjes <rientjes@google.com> Cc: Pekka Enberg <penberg@kernel.org>, Mel Gorman <mgorman@suse.de>, Glauber Costa <glommer@parallels.com>, Michal Hocko <mhocko@suse.cz>, "Kirill A. Shutemov" <kirill@shutemov.name>, Luiz Capitulino <lcapitulino@redhat.com>, Andrew Morton <akpm@linux-foundation.org>, Greg Thelen <gthelen@google.com>, Leonid Moiseichuk <leonid.moiseichuk@nokia.com>, KOSAKI Motohiro <kosaki.motohiro@gmail.com>, Minchan Kim <minchan@kernel.org>, Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>, John Stultz <john.stultz@linaro.org>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linaro-kernel@lists.linaro.org, patches@linaro.org, kernel-team@android.com Subject: [RFC] Add mempressure cgroup Message-ID: <20121128102908.GA15415@lizard> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15)

Anton Vorontsov Nov. 28, 2012, 10:29 a.m. UTC

This is an attempt to implement David Rientjes' idea of mempressure
cgroup.

The main characteristics are the same to what I've tried to add to vmevent
API:

  Internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
  pressure index calculation. But we don't expose the index to the
  userland. Instead, there are three levels of the pressure:

  o low (just reclaiming, e.g. caches are draining);
  o medium (allocation cost becomes high, e.g. swapping);
  o oom (about to oom very soon).

  The rationale behind exposing levels and not the raw pressure index
  described here: http://lkml.org/lkml/2012/11/16/675

The API uses standard cgroups eventfd notifications:

  $ gcc Documentation/cgroups/cgroup_event_listener.c -o \
	cgroup_event_listener
  $ cd /sys/fs/cgroup/
  $ mkdir mempressure
  $ mount -t cgroup cgroup ./mempressure -o mempressure
  $ cd mempressure
  $ cgroup_event_listener ./mempressure.level low
  ("low", "medium", "oom" are permitted values.)

  Upon hitting the threshold, you should see "/sys/fs/cgroup/mempressure
  low: crossed" messages.

To test that it actually works on per-cgroup basis, I did a small trick: I
moved all kswapd into a separate cgroup, and hooked the listener onto
another (non-root) cgroup. The listener no longer received global reclaim
pressure, which is expected.

For a task it is possible to be in both cpusets, memcg and mempressure
cgroups, so by rearranging the tasks it should be possible to watch a
specific pressure.

Note that while this adds the cgroups support, the code is well separated
and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
But this is another story.

Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
---
 include/linux/cgroup_subsys.h |   6 +
 include/linux/vmstat.h        |   8 ++
 init/Kconfig                  |   5 +
 mm/Makefile                   |   1 +
 mm/mempressure.c              | 287 ++++++++++++++++++++++++++++++++++++++++++
 mm/vmscan.c                   |   3 +
 6 files changed, 310 insertions(+)
 create mode 100644 mm/mempressure.c

Michal Hocko Nov. 28, 2012, 4:29 p.m. UTC | #1

On Wed 28-11-12 02:29:08, Anton Vorontsov wrote:
> This is an attempt to implement David Rientjes' idea of mempressure
> cgroup.
> 
> The main characteristics are the same to what I've tried to add to vmevent
> API:
> 
>   Internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
>   pressure index calculation. But we don't expose the index to the
>   userland. Instead, there are three levels of the pressure:
> 
>   o low (just reclaiming, e.g. caches are draining);
>   o medium (allocation cost becomes high, e.g. swapping);
>   o oom (about to oom very soon).
> 
>   The rationale behind exposing levels and not the raw pressure index
>   described here: http://lkml.org/lkml/2012/11/16/675
> 
> The API uses standard cgroups eventfd notifications:
> 
>   $ gcc Documentation/cgroups/cgroup_event_listener.c -o \
> 	cgroup_event_listener
>   $ cd /sys/fs/cgroup/
>   $ mkdir mempressure
>   $ mount -t cgroup cgroup ./mempressure -o mempressure
>   $ cd mempressure
>   $ cgroup_event_listener ./mempressure.level low
>   ("low", "medium", "oom" are permitted values.)
> 
>   Upon hitting the threshold, you should see "/sys/fs/cgroup/mempressure
>   low: crossed" messages.
> 
> To test that it actually works on per-cgroup basis, I did a small trick: I
> moved all kswapd into a separate cgroup, and hooked the listener onto
> another (non-root) cgroup. The listener no longer received global reclaim
> pressure, which is expected.

Is this really expected? So you want to be notified only about the
direct reclaim?
I am not sure how much useful is that. If you co-mount with e.g. memcg then
the picture is different because even global memory pressure is spread
among groups so it would be just a matter of the proper accounting
(which can be handled similar to lruvec when your code doesn't have to
care about memcg internally).
Co-mounting with cpusets makes sense as well because then you get a
pressure notification based on the placement policy.

So does it make much sense to mount mempressure on its own without
co-mounting with other controllers?

> For a task it is possible to be in both cpusets, memcg and mempressure
> cgroups, so by rearranging the tasks it should be possible to watch a
> specific pressure.

Could you be more specific what you mean by rearranging? Creating a same
hierarchy? Co-mounting?

> Note that while this adds the cgroups support, the code is well separated
> and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
> But this is another story.

I think it would be nice to follow freezer and split this into 2 files.
Generic and cgroup spefici.

> Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
> ---
[...]
> +/* These are defaults. Might make them configurable one day. */
> +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;

I realize this is just an RFC but could you be more specific what is the
meaning of vmpressure_win?

> +static const uint vmpressure_level_med = 60;
> +static const uint vmpressure_level_oom = 99;
> +static const uint vmpressure_level_oom_prio = 4;
> +
> +enum vmpressure_levels {
> +	VMPRESSURE_LOW = 0,
> +	VMPRESSURE_MEDIUM,
> +	VMPRESSURE_OOM,
> +	VMPRESSURE_NUM_LEVELS,
> +};
> +
> +static const char const *vmpressure_str_levels[] = {
> +	[VMPRESSURE_LOW] = "low",
> +	[VMPRESSURE_MEDIUM] = "medium",
> +	[VMPRESSURE_OOM] = "oom",
> +};
> +
> +static enum vmpressure_levels vmpressure_level(uint pressure)
> +{
> +	if (pressure >= vmpressure_level_oom)
> +		return VMPRESSURE_OOM;
> +	else if (pressure >= vmpressure_level_med)
> +		return VMPRESSURE_MEDIUM;
> +	return VMPRESSURE_LOW;
> +}
> +
> +static ulong vmpressure_calc_level(uint win, uint s, uint r)
> +{
> +	ulong p;
> +
> +	if (!s)
> +		return 0;
> +
> +	/*
> +	 * We calculate the ratio (in percents) of how many pages were
> +	 * scanned vs. reclaimed in a given time frame (window). Note that
> +	 * time is in VM reclaimer's "ticks", i.e. number of pages
> +	 * scanned. This makes it possible to set desired reaction time
> +	 * and serves as a ratelimit.
> +	 */
> +	p = win - (r * win / s);
> +	p = p * 100 / win;

Do we need the win at all?
	p = 100 - (100 * r / s);
> +
> +	pr_debug("%s: %3lu  (s: %6u  r: %6u)\n", __func__, p, s, r);
> +
> +	return vmpressure_level(p);
> +}
> +
[...]
> +static int mpc_pre_destroy(struct cgroup *cg)
> +{
> +	struct mpc_state *mpc = cg2mpc(cg);
> +	int ret = 0;
> +
> +	mutex_lock(&mpc->lock);
> +
> +	if (mpc->eventfd)
> +		ret = -EBUSY;

The current cgroup's core doesn't allow pre_destroy to fail anymore. The
code is marked for 3.8

[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 48550c6..430d8a5 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1877,6 +1877,8 @@ restart:
>  		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
>  				   sc, LRU_ACTIVE_ANON);
>  
> +	vmpressure(sc->nr_scanned - nr_scanned, nr_reclaimed);
> +

I think this should already report to a proper group otherwise all the
global reclaim would go to a group where kswapd sits rather than to the
target group as I mentioned above (so it at least wouldn't work with a
co-mounted cases).

>  	/* reclaim/compaction might need reclaim to continue */
>  	if (should_continue_reclaim(lruvec, nr_reclaimed,
>  				    sc->nr_scanned - nr_scanned, sc))
> @@ -2099,6 +2101,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  		count_vm_event(ALLOCSTALL);
>  
>  	do {
> +		vmpressure_prio(sc->priority);

Shouldn't this go into shrink_lruvec or somewhere at that level to catch
also kswapd low priorities? If you insist on the direct reclaim then you
should hook into __zone_reclaim as well.

>  		sc->nr_scanned = 0;
>  		aborted_reclaim = shrink_zones(zonelist, sc);

Andrew Morton Nov. 28, 2012, 11:14 p.m. UTC | #2

On Wed, 28 Nov 2012 02:29:08 -0800
Anton Vorontsov <anton.vorontsov@linaro.org> wrote:

> The main characteristics are the same to what I've tried to add to vmevent
> API:
> 
>   Internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
>   pressure index calculation. But we don't expose the index to the
>   userland. Instead, there are three levels of the pressure:
> 
>   o low (just reclaiming, e.g. caches are draining);
>   o medium (allocation cost becomes high, e.g. swapping);
>   o oom (about to oom very soon).
> 
>   The rationale behind exposing levels and not the raw pressure index
>   described here: http://lkml.org/lkml/2012/11/16/675

This rationale is central to the overall design (and is hence central
to the review).  It would be better to include it in the changelogs
where it can be maintained, understood and discussed.

I see a problem with it:

It blurs the question of "who is in control".  We tell userspace "hey,
we're getting a bit tight here, please do something".  And userspace
makes the decision about what "something" is.  So userspace is in
control of part of the reclaim function and the kernel is in control of
another part.  Strange interactions are likely.

Also, the system as a whole is untestable by kernel developers - it
puts the onus onto each and every userspace developer to develop, test
and tune his application against a particular kernel version.

And the more carefully the userspace developer tunes his application,
the more vulnerable he becomes to regressions which were caused by
subtle changes in the kernel's behaviour.

Compare this with the shrink_slab() shrinkers.  With these, the VM can
query and then control the clients.  If something goes wrong or is out
of balance, it's the VM's problem to solve.

So I'm thinking that a better design would be one which puts the kernel
VM in control of userspace scanning and freeing.  Presumably with a
query-and-control interface similar to the slab shrinkers.

IOW, we make the kernel smarter and make userspace dumber.  Userspace
just sits there and does what the kernel tells it to do.

This gives the kernel developers the ability to tune and tweak (ie:
alter) userspace's behaviour *years* after that userspace code was
written.

Probably most significantly, this approach has a really big advantage:
we can test it.  Once we have defined that userspace query/control
interface we can write a compliant userspace test application then fire
it up and observe the overall system behaviour.  We can fix bugs and we
can tune it.  This cannot be done with your proposed interface because
we just don't know what userspace will do in response to changes in the
exposed metric.

Anton Vorontsov Nov. 29, 2012, 1:27 a.m. UTC | #3

On Wed, Nov 28, 2012 at 03:14:32PM -0800, Andrew Morton wrote:
[...]
> Compare this with the shrink_slab() shrinkers.  With these, the VM can
> query and then control the clients.  If something goes wrong or is out
> of balance, it's the VM's problem to solve.
> 
> So I'm thinking that a better design would be one which puts the kernel
> VM in control of userspace scanning and freeing.  Presumably with a
> query-and-control interface similar to the slab shrinkers.

Thanks for the ideas, Andrew.

Query-and-control scheme looks very attractive, and that's actually
resembles my "balance" level idea, when userland tells the kernel how much
reclaimable memory it has. Except the your scheme works in the reverse
direction, i.e. the kernel becomes in charge.

But there is one, rather major issue: we're crossing kernel-userspace
boundary. And with the scheme we'll have to cross the boundary four times:
query / reply-available / control / reply-shrunk / (and repeat if
necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat
synchronously (all the four stages), and/or we have to make a "userspace
shrinker" thread working in parallel with the normal shrinker, and here,
I'm afraid, we'll see more strange interactions. :)

But there is a good news: for these kind of fine-grained control we have a
better interface, where we don't have to communicate [very often] w/ the
kernel. These are "volatile ranges", where userland itself marks chunks of
data as "I might need it, but I won't cry if you recycle it; but when I
access it next time, let me know if you actually recycled it". Yes,
userland no longer able to decide which exact page it permits to recycle,
but we don't have use-cases when we actually care that much. And if we do,
we'd rather introduce volatile LRUs with different priorities, or
something alike.

So, we really don't need the full-fledged userland shrinker, since we can
just let the in-kernel shrinker do its job. If we work with the
bytes/pages granularity it is just easier (and more efficient in terms of
communication) to do the volatile ranges.

For the pressure notifications use-cases, we don't even know bytes/pages
information: "activity managers" are separate processes looking after
overall system performance.

So, we're not trying to make userland too smart, quite the contrary: we
realized that for this interface we don't want to mess with the bytes and
pages, and that's why we cut this stuff down to only three levels. Before
this, we were actually trying to count bytes, we did not like it and we
ran away screaming.

OTOH, your scheme makes volatile ranges unneeded, since a thread might
register a shrinker hook and free stuff by itself. But again, I believe
this involves more communication with the kernel.

Thanks,
Anton.

Anton Vorontsov Nov. 29, 2012, 3:32 a.m. UTC | #4

On Wed, Nov 28, 2012 at 05:27:51PM -0800, Anton Vorontsov wrote:
> On Wed, Nov 28, 2012 at 03:14:32PM -0800, Andrew Morton wrote:
> [...]
> > Compare this with the shrink_slab() shrinkers.  With these, the VM can
> > query and then control the clients.  If something goes wrong or is out
> > of balance, it's the VM's problem to solve.
> > 
> > So I'm thinking that a better design would be one which puts the kernel
> > VM in control of userspace scanning and freeing.  Presumably with a
> > query-and-control interface similar to the slab shrinkers.
> 
> Thanks for the ideas, Andrew.
> 
> Query-and-control scheme looks very attractive, and that's actually
> resembles my "balance" level idea, when userland tells the kernel how much
> reclaimable memory it has. Except the your scheme works in the reverse
> direction, i.e. the kernel becomes in charge.
> 
> But there is one, rather major issue: we're crossing kernel-userspace
> boundary. And with the scheme we'll have to cross the boundary four times:
> query / reply-available / control / reply-shrunk / (and repeat if
> necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat
> synchronously (all the four stages), and/or we have to make a "userspace
> shrinker" thread working in parallel with the normal shrinker, and here,
> I'm afraid, we'll see more strange interactions. :)
> 
> But there is a good news: for these kind of fine-grained control we have a
> better interface, where we don't have to communicate [very often] w/ the
> kernel. These are "volatile ranges", where userland itself marks chunks of
> data as "I might need it, but I won't cry if you recycle it; but when I
> access it next time, let me know if you actually recycled it". Yes,
> userland no longer able to decide which exact page it permits to recycle,
> but we don't have use-cases when we actually care that much. And if we do,
> we'd rather introduce volatile LRUs with different priorities, or
> something alike.
> 
> So, we really don't need the full-fledged userland shrinker, since we can
> just let the in-kernel shrinker do its job. If we work with the
> bytes/pages granularity it is just easier (and more efficient in terms of
> communication) to do the volatile ranges.
> 
> For the pressure notifications use-cases, we don't even know bytes/pages
> information: "activity managers" are separate processes looking after
> overall system performance.
> 
> So, we're not trying to make userland too smart, quite the contrary: we
> realized that for this interface we don't want to mess with the bytes and
> pages, and that's why we cut this stuff down to only three levels. Before
> this, we were actually trying to count bytes, we did not like it and we
> ran away screaming.
> 
> OTOH, your scheme makes volatile ranges unneeded, since a thread might
> register a shrinker hook and free stuff by itself. But again, I believe
> this involves more communication with the kernel.

Btw, I believe your idea is something completely new, and I surely cannot
fully evaluate it on my own -- I might be wrong here. So I invite folks to
express their opinions too.

Guys, it's about Andrew's idea of exposing shrinker-alike logic to the
userland (and I made it 'vs. volatile ranges'):

	http://lkml.org/lkml/2012/11/28/607

Thanks,
Anton.

Anton Vorontsov Nov. 29, 2012, 4:17 a.m. UTC | #5

Hello Michal,

Thanks a lot for taking a look into this!

On Wed, Nov 28, 2012 at 05:29:24PM +0100, Michal Hocko wrote:
> On Wed 28-11-12 02:29:08, Anton Vorontsov wrote:
> > This is an attempt to implement David Rientjes' idea of mempressure
> > cgroup.
> > 
> > The main characteristics are the same to what I've tried to add to vmevent
> > API:
> > 
> >   Internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for
> >   pressure index calculation. But we don't expose the index to the
> >   userland. Instead, there are three levels of the pressure:
> > 
> >   o low (just reclaiming, e.g. caches are draining);
> >   o medium (allocation cost becomes high, e.g. swapping);
> >   o oom (about to oom very soon).
> > 
> >   The rationale behind exposing levels and not the raw pressure index
> >   described here: http://lkml.org/lkml/2012/11/16/675
> > 
> > The API uses standard cgroups eventfd notifications:
> > 
> >   $ gcc Documentation/cgroups/cgroup_event_listener.c -o \
> > 	cgroup_event_listener
> >   $ cd /sys/fs/cgroup/
> >   $ mkdir mempressure
> >   $ mount -t cgroup cgroup ./mempressure -o mempressure
> >   $ cd mempressure
> >   $ cgroup_event_listener ./mempressure.level low
> >   ("low", "medium", "oom" are permitted values.)
> > 
> >   Upon hitting the threshold, you should see "/sys/fs/cgroup/mempressure
> >   low: crossed" messages.
> > 
> > To test that it actually works on per-cgroup basis, I did a small trick: I
> > moved all kswapd into a separate cgroup, and hooked the listener onto
> > another (non-root) cgroup. The listener no longer received global reclaim
> > pressure, which is expected.
> 
> Is this really expected? So you want to be notified only about the
> direct reclaim?

I didn't try to put much meaning into assinging a task to a non-global
reclaim watchers, I just mentioned this as an easiest way to test that we
actually can account things on per-thread basis. :)

> I am not sure how much useful is that. If you co-mount with e.g. memcg then
> the picture is different because even global memory pressure is spread
> among groups so it would be just a matter of the proper accounting
> (which can be handled similar to lruvec when your code doesn't have to
> care about memcg internally).
> Co-mounting with cpusets makes sense as well because then you get a
> pressure notification based on the placement policy.
> 
> So does it make much sense to mount mempressure on its own without
> co-mounting with other controllers?

Android does not actually need any of these (memcg or cpusets), but we
still want to get notifications (for a root cgroup would be enough for us
-- but I'm trying to make things generic, of course).

> > For a task it is possible to be in both cpusets, memcg and mempressure
> > cgroups, so by rearranging the tasks it should be possible to watch a
> > specific pressure.
> 
> Could you be more specific what you mean by rearranging? Creating a same
> hierarchy? Co-mounting?
> 
> > Note that while this adds the cgroups support, the code is well separated
> > and eventually we might add a lightweight, non-cgroups API, i.e. vmevent.
> > But this is another story.
> 
> I think it would be nice to follow freezer and split this into 2 files.
> Generic and cgroup spefici.

Yeah, this is surely an option, but so far it's only a few hundrends lines
of code, plus we don't have any other users for the "internals". So, for
the time being, I'd rather keep it in one file.

> > Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org>
> > ---
> [...]
> > +/* These are defaults. Might make them configurable one day. */
> > +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16;
> 
> I realize this is just an RFC but could you be more specific what is the
> meaning of vmpressure_win?

Sure, let me just copy the text from the previous RFC, to which you were
not Cc'ed:

 When the system is short on idle pages, the new memory is allocated by
 reclaiming least recently used resources: kernel scans pages to be
 reclaimed (e.g. from file caches, mmap(2) volatile ranges, etc.; and
 potentially swapping some pages out). The index shows the relative time
 spent by the kernel uselessly scanning pages, or, in other words, the
 percentage of scans of pages (vmpressure_window) that were not reclaimed.
 ...
 Window size is used as a rate-limit tunable for VMPRESSURE_LOW
 notifications and for averaging for VMPRESSURE_{MEDIUM,OOM} levels. So,
 using small window sizes can cause lot of false positives for _MEDIUM and
 _OOM levels, but too big window size may delay notifications. By default
 the window size equals to 256 pages (1MB).

You can find more about the tunables in the previus RFC:

	http://lkml.org/lkml/2012/11/7/169 

> > +static const uint vmpressure_level_med = 60;
> > +static const uint vmpressure_level_oom = 99;
> > +static const uint vmpressure_level_oom_prio = 4;
> > +
> > +enum vmpressure_levels {
> > +	VMPRESSURE_LOW = 0,
> > +	VMPRESSURE_MEDIUM,
> > +	VMPRESSURE_OOM,
> > +	VMPRESSURE_NUM_LEVELS,
> > +};
> > +
> > +static const char const *vmpressure_str_levels[] = {
> > +	[VMPRESSURE_LOW] = "low",
> > +	[VMPRESSURE_MEDIUM] = "medium",
> > +	[VMPRESSURE_OOM] = "oom",
> > +};
> > +
> > +static enum vmpressure_levels vmpressure_level(uint pressure)
> > +{
> > +	if (pressure >= vmpressure_level_oom)
> > +		return VMPRESSURE_OOM;
> > +	else if (pressure >= vmpressure_level_med)
> > +		return VMPRESSURE_MEDIUM;
> > +	return VMPRESSURE_LOW;
> > +}
> > +
> > +static ulong vmpressure_calc_level(uint win, uint s, uint r)
> > +{
> > +	ulong p;
> > +
> > +	if (!s)
> > +		return 0;
> > +
> > +	/*
> > +	 * We calculate the ratio (in percents) of how many pages were
> > +	 * scanned vs. reclaimed in a given time frame (window). Note that
> > +	 * time is in VM reclaimer's "ticks", i.e. number of pages
> > +	 * scanned. This makes it possible to set desired reaction time
> > +	 * and serves as a ratelimit.
> > +	 */
> > +	p = win - (r * win / s);
> > +	p = p * 100 / win;
> 
> Do we need the win at all?
> 	p = 100 - (100 * r / s);

Other than for me being pedant, pretty much no. :) It's just less
"precise" (try s=1000, r=9). (But in return, my version is prone to
misbehave when window is too large.)

> > +
> > +	pr_debug("%s: %3lu  (s: %6u  r: %6u)\n", __func__, p, s, r);
> > +
> > +	return vmpressure_level(p);
> > +}
> > +
> [...]
> > +static int mpc_pre_destroy(struct cgroup *cg)
> > +{
> > +	struct mpc_state *mpc = cg2mpc(cg);
> > +	int ret = 0;
> > +
> > +	mutex_lock(&mpc->lock);
> > +
> > +	if (mpc->eventfd)
> > +		ret = -EBUSY;
> 
> The current cgroup's core doesn't allow pre_destroy to fail anymore. The
> code is marked for 3.8

Sure, I can rebase. (Currently, the code is based on the v3.7-rc6, which
isn't even released but seems way too old already, heh. :)

> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 48550c6..430d8a5 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1877,6 +1877,8 @@ restart:
> >  		shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
> >  				   sc, LRU_ACTIVE_ANON);
> >  
> > +	vmpressure(sc->nr_scanned - nr_scanned, nr_reclaimed);
> > +
> 
> I think this should already report to a proper group otherwise all the
> global reclaim would go to a group where kswapd sits rather than to the
> target group as I mentioned above (so it at least wouldn't work with a
> co-mounted cases).

Um. Yeah, I guess I was too optimistic here, relying on the things to
"just work". I guess I still need to pass memcg pointer to the
vmpressure() and check if the process is also part of the
sc->target_mem_cgroup.

> >  	/* reclaim/compaction might need reclaim to continue */
> >  	if (should_continue_reclaim(lruvec, nr_reclaimed,
> >  				    sc->nr_scanned - nr_scanned, sc))
> > @@ -2099,6 +2101,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> >  		count_vm_event(ALLOCSTALL);
> >  
> >  	do {
> > +		vmpressure_prio(sc->priority);
> 
> Shouldn't this go into shrink_lruvec or somewhere at that level to catch
> also kswapd low priorities? If you insist on the direct reclaim then you
> should hook into __zone_reclaim as well.

Probably... Thanks for pointing out, I'll take a closer look once we
resolve the global/design issues.

Thanks!
Anton.

Kirill A. Shutemov Nov. 29, 2012, 6:14 a.m. UTC | #6

On Wed, Nov 28, 2012 at 02:29:08AM -0800, Anton Vorontsov wrote:
> +static int mpc_pre_destroy(struct cgroup *cg)
> +{
> +	struct mpc_state *mpc = cg2mpc(cg);
> +	int ret = 0;
> +
> +	mutex_lock(&mpc->lock);
> +
> +	if (mpc->eventfd)
> +		ret = -EBUSY;

cgroup_rmdir() will unregister all events for you. No need to handle it
here.

> +
> +	mutex_unlock(&mpc->lock);
> +
> +	return ret;
> +}

> +static int mpc_register_level_event(struct cgroup *cg, struct cftype *cft,
> +				    struct eventfd_ctx *eventfd,
> +				    const char *args)
> +{
> +	struct mpc_state *mpc = cg2mpc(cg);
> +	int i;
> +	int ret;
> +
> +	mutex_lock(&mpc->lock);
> +
> +	/*
> +	 * It's easy to implement multiple thresholds, but so far we don't
> +	 * need it.
> +	 */
> +	if (mpc->eventfd) {
> +		ret = -EBUSY;
> +		goto out_unlock;
> +	}

One user which listen for one threashold per cgroup?
I think it's wrong. It's essensial for API to serve multiple users.

> +
> +	ret = -EINVAL;
> +	for (i = 0; i < VMPRESSURE_NUM_LEVELS; i++) {
> +		if (strcmp(vmpressure_str_levels[i], args))
> +			continue;
> +		mpc->eventfd = eventfd;
> +		mpc->thres = i;
> +		ret = 0;
> +		break;
> +	}
> +out_unlock:
> +	mutex_unlock(&mpc->lock);
> +
> +	return ret;
> +}

Anton Vorontsov Nov. 29, 2012, 6:21 a.m. UTC | #7

On Thu, Nov 29, 2012 at 08:14:13AM +0200, Kirill A. Shutemov wrote:
> On Wed, Nov 28, 2012 at 02:29:08AM -0800, Anton Vorontsov wrote:
> > +static int mpc_pre_destroy(struct cgroup *cg)
> > +{
> > +	struct mpc_state *mpc = cg2mpc(cg);
> > +	int ret = 0;
> > +
> > +	mutex_lock(&mpc->lock);
> > +
> > +	if (mpc->eventfd)
> > +		ret = -EBUSY;
> 
> cgroup_rmdir() will unregister all events for you. No need to handle it
> here.

Okie, thanks!

[...]
> > +static int mpc_register_level_event(struct cgroup *cg, struct cftype *cft,
> > +				    struct eventfd_ctx *eventfd,
> > +				    const char *args)
> > +{
> > +	struct mpc_state *mpc = cg2mpc(cg);
> > +	int i;
> > +	int ret;
> > +
> > +	mutex_lock(&mpc->lock);
> > +
> > +	/*
> > +	 * It's easy to implement multiple thresholds, but so far we don't
> > +	 * need it.
> > +	 */
> > +	if (mpc->eventfd) {
> > +		ret = -EBUSY;
> > +		goto out_unlock;
> > +	}
> 
> One user which listen for one threashold per cgroup?
> I think it's wrong. It's essensial for API to serve multiple users.

Yea, if we'll consider merging this, I'll definitely fix this. Just didn't
want to bring the complexity into the code.

Thanks,
Anton.

Luiz Capitulino Nov. 30, 2012, 5:47 p.m. UTC | #8

On Wed, 28 Nov 2012 17:27:51 -0800
Anton Vorontsov <anton.vorontsov@linaro.org> wrote:

> On Wed, Nov 28, 2012 at 03:14:32PM -0800, Andrew Morton wrote:
> [...]
> > Compare this with the shrink_slab() shrinkers.  With these, the VM can
> > query and then control the clients.  If something goes wrong or is out
> > of balance, it's the VM's problem to solve.
> > 
> > So I'm thinking that a better design would be one which puts the kernel
> > VM in control of userspace scanning and freeing.  Presumably with a
> > query-and-control interface similar to the slab shrinkers.
> 
> Thanks for the ideas, Andrew.
> 
> Query-and-control scheme looks very attractive, and that's actually
> resembles my "balance" level idea, when userland tells the kernel how much
> reclaimable memory it has. Except the your scheme works in the reverse
> direction, i.e. the kernel becomes in charge.
> 
> But there is one, rather major issue: we're crossing kernel-userspace
> boundary. And with the scheme we'll have to cross the boundary four times:
> query / reply-available / control / reply-shrunk / (and repeat if
> necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat
> synchronously (all the four stages), and/or we have to make a "userspace
> shrinker" thread working in parallel with the normal shrinker, and here,
> I'm afraid, we'll see more strange interactions. :)

Wouldn't this be just like kswapd?

> But there is a good news: for these kind of fine-grained control we have a
> better interface, where we don't have to communicate [very often] w/ the
> kernel. These are "volatile ranges", where userland itself marks chunks of
> data as "I might need it, but I won't cry if you recycle it; but when I
> access it next time, let me know if you actually recycled it". Yes,
> userland no longer able to decide which exact page it permits to recycle,
> but we don't have use-cases when we actually care that much. And if we do,
> we'd rather introduce volatile LRUs with different priorities, or
> something alike.

I'm new to this stuff so please take this with a grain of salt, but I'm
not sure volatile ranges would be a good fit for our use case: we want to
make (kvm) guests reduce their memory when the host is getting memory
pressure.

Having a notification seems just fine for this purpose, but I'm not sure
how this would work with volatile ranges, as we'd have to mark pages volatile
in advance.

Andrew's idea seems to give a lot more freedom to apps, IMHO.

Anton Vorontsov Dec. 1, 2012, 8:01 a.m. UTC | #9

Hi Luiz,

Thanks for your email!

On Fri, Nov 30, 2012 at 03:47:25PM -0200, Luiz Capitulino wrote:
[...]
> > But there is one, rather major issue: we're crossing kernel-userspace
> > boundary. And with the scheme we'll have to cross the boundary four times:
> > query / reply-available / control / reply-shrunk / (and repeat if
> > necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat
> > synchronously (all the four stages), and/or we have to make a "userspace
> > shrinker" thread working in parallel with the normal shrinker, and here,
> > I'm afraid, we'll see more strange interactions. :)
> 
> Wouldn't this be just like kswapd?

Sure, this is similar, but only for indirect reclaim (obviously).

How we'd do this for the direct reclaim I have no idea, honestly, with
Andrew's idea it must be all synchronous, so playing ping-pong with
userland during the direct reclaim will be hard.

So, the best thing to do with the direct recaim, IMHO, is just send a
notification.

> > But there is a good news: for these kind of fine-grained control we have a
> > better interface, where we don't have to communicate [very often] w/ the
> > kernel. These are "volatile ranges", where userland itself marks chunks of
> > data as "I might need it, but I won't cry if you recycle it; but when I
> > access it next time, let me know if you actually recycled it". Yes,
> > userland no longer able to decide which exact page it permits to recycle,
> > but we don't have use-cases when we actually care that much. And if we do,
> > we'd rather introduce volatile LRUs with different priorities, or
> > something alike.
> 
> I'm new to this stuff so please take this with a grain of salt, but I'm
> not sure volatile ranges would be a good fit for our use case: we want to
> make (kvm) guests reduce their memory when the host is getting memory
> pressure.

Yes, for this kind of things you want a simple notification.

I wasn't saying that volatile ranges must be a substitute for
notifications, quite the opposite: I was saying that you can do volatile
ranges in userland by using "userland-shrinker".

It can be even wrapped into a library, with the same mmap() libc
interface. But it will be inefficient.

Thanks,
Anton.

Anton Vorontsov Dec. 1, 2012, 11:18 a.m. UTC | #10

On Fri, Nov 30, 2012 at 03:47:25PM -0200, Luiz Capitulino wrote:
[...]
> > Query-and-control scheme looks very attractive, and that's actually
> > resembles my "balance" level idea, when userland tells the kernel how much
> > reclaimable memory it has. Except the your scheme works in the reverse
> > direction, i.e. the kernel becomes in charge.
> > 
> > But there is one, rather major issue: we're crossing kernel-userspace
> > boundary. And with the scheme we'll have to cross the boundary four times:
> > query / reply-available / control / reply-shrunk / (and repeat if
> > necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat
> > synchronously (all the four stages), and/or we have to make a "userspace
> > shrinker" thread working in parallel with the normal shrinker, and here,
> > I'm afraid, we'll see more strange interactions. :)
[...]
> Andrew's idea seems to give a lot more freedom to apps, IMHO.

OK, thinking about it some more...

===
=== Long explanations below, scroll to 'END' for the short version. :)
===

The typical query-control shrinker interaction would look like this:

   Kernel: "Can you please free <Y> pages?"
 Userland: "Here you go, <Z> pages freed."

Now let's assume that we are the Activity Manager, so we know that we have
<N> reclaimable pages in total (it's not always possible to know, but
let's pretend we do know). And assume that we are the only source of
reclaimable pages (this is important). OK, the kernel asks us to reclaim
<Y> pages.

Now, what if we divide <Y> (needed pages) by <N> (total reclaimable
pages)? :)

This will be the memory pressure factor, what a coincidence. E.g. if Y >=
N, the factor would be >= 1, which was our definition of OOM. If no pages
needed, the factor is 0.

Okay, let's see how our current vmpressure notification works inside:

- The notification comes every 'window size' (<W>) pages scanned;

- Alongside with the notification itself we can also receive the pressure
  factor <F> (it is 1 - reclaimed/scanned). (We use levels nowadays, but
  internally it is still the factor.)

So, by doing <W> * <F> we can find out the amount of memory that the
kernel was missing this round (scanned - reclaimed), which pretty much the
same meaning as "Please free <Y> pages" in the "userland-shrinker" scheme
above.

Except that in the notifications case the "<Y>" was is in the past
already, so we should read "the kernel had difficulty with reclaiming <Y>
pages", and userland just received the notification about this past event.
The <Y> pages were probably reclaimed already.

Now, can we assume that in the next second, the system will need the same
<Y> pages reclaimed? Well, if the window size was small enough, it's OK to
assume that the workload didn't change much. So, yes, we can assume this,
the only "bad" thing that can happen, we can free a little bit more than
it was needed.

Let's look how we'd use the raw factor in the imaginary userland shrinker:

	while (1) {
		/* blocking, triggers every "window size" pages, <W> */
		factor = get_pressure();

		/* Finds the smallest chunk(s) w/ size >= <W> * <F> */
		resource = get_resource(factor);

		free(resource);
	}

So, in the each round we'd free at least <W> * <F> pages. Again, the
product just tells how much memory it is best to free at this time, which
by definition is 'scanned - reclaimed' (<F> = 1 - reclaimed/scanned; <W> =
scanned). That is, we don't need the factor, we need the scanned and
reclaimed difference.

In sum:

- Reporting the 'scanned - reclaimed' seems like an option for
  implementing the userland shrinker;

- B using small 'window size' we can mitigate effect of async nature of
  our shrinker.

Although, the shrinker is not a substitution to the pressure factor (or
levels). The plain "I need <Y> pages" still does not tell how bad things
there are in the system, how much scanning there are. So, the
reclaimed/scanned ratio is important, too.

===
=== END
===

The lengthy text above boils down to this:

Yes, I tend to agree that Andrew's idea gives some freedom to the apps,
and that with the three levels it is not possible to implement a good,
predictable "userland shrinker". Even though we don't need it just now.

Based on the above, I think I have a solution for this. For the next RFC,
I'd like to keep the pressure levels, but I will also add a file that will
report 'scanned - reclaimed' difference. I'll call it something like
nr_to_reclaim. Since the 'scanned - reclaimed' is still an approximation
(although I believe a good one), we may want to tune it without breaking
things.

And with the nr_to_reclaim, implementing a predictable userland shrinker
will be a piece of cake: apps will blindly free the given amount of pages,
nothing more.

Thanks,
Anton.

[RFC] Add mempressure cgroup

Commit Message

Comments

Patch