Message ID | 20121128102908.GA15415@lizard |
---|---|
State | New |
Headers | show |
On Wed 28-11-12 02:29:08, Anton Vorontsov wrote: > This is an attempt to implement David Rientjes' idea of mempressure > cgroup. > > The main characteristics are the same to what I've tried to add to vmevent > API: > > Internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for > pressure index calculation. But we don't expose the index to the > userland. Instead, there are three levels of the pressure: > > o low (just reclaiming, e.g. caches are draining); > o medium (allocation cost becomes high, e.g. swapping); > o oom (about to oom very soon). > > The rationale behind exposing levels and not the raw pressure index > described here: http://lkml.org/lkml/2012/11/16/675 > > The API uses standard cgroups eventfd notifications: > > $ gcc Documentation/cgroups/cgroup_event_listener.c -o \ > cgroup_event_listener > $ cd /sys/fs/cgroup/ > $ mkdir mempressure > $ mount -t cgroup cgroup ./mempressure -o mempressure > $ cd mempressure > $ cgroup_event_listener ./mempressure.level low > ("low", "medium", "oom" are permitted values.) > > Upon hitting the threshold, you should see "/sys/fs/cgroup/mempressure > low: crossed" messages. > > To test that it actually works on per-cgroup basis, I did a small trick: I > moved all kswapd into a separate cgroup, and hooked the listener onto > another (non-root) cgroup. The listener no longer received global reclaim > pressure, which is expected. Is this really expected? So you want to be notified only about the direct reclaim? I am not sure how much useful is that. If you co-mount with e.g. memcg then the picture is different because even global memory pressure is spread among groups so it would be just a matter of the proper accounting (which can be handled similar to lruvec when your code doesn't have to care about memcg internally). Co-mounting with cpusets makes sense as well because then you get a pressure notification based on the placement policy. So does it make much sense to mount mempressure on its own without co-mounting with other controllers? > For a task it is possible to be in both cpusets, memcg and mempressure > cgroups, so by rearranging the tasks it should be possible to watch a > specific pressure. Could you be more specific what you mean by rearranging? Creating a same hierarchy? Co-mounting? > Note that while this adds the cgroups support, the code is well separated > and eventually we might add a lightweight, non-cgroups API, i.e. vmevent. > But this is another story. I think it would be nice to follow freezer and split this into 2 files. Generic and cgroup spefici. > Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> > --- [...] > +/* These are defaults. Might make them configurable one day. */ > +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16; I realize this is just an RFC but could you be more specific what is the meaning of vmpressure_win? > +static const uint vmpressure_level_med = 60; > +static const uint vmpressure_level_oom = 99; > +static const uint vmpressure_level_oom_prio = 4; > + > +enum vmpressure_levels { > + VMPRESSURE_LOW = 0, > + VMPRESSURE_MEDIUM, > + VMPRESSURE_OOM, > + VMPRESSURE_NUM_LEVELS, > +}; > + > +static const char const *vmpressure_str_levels[] = { > + [VMPRESSURE_LOW] = "low", > + [VMPRESSURE_MEDIUM] = "medium", > + [VMPRESSURE_OOM] = "oom", > +}; > + > +static enum vmpressure_levels vmpressure_level(uint pressure) > +{ > + if (pressure >= vmpressure_level_oom) > + return VMPRESSURE_OOM; > + else if (pressure >= vmpressure_level_med) > + return VMPRESSURE_MEDIUM; > + return VMPRESSURE_LOW; > +} > + > +static ulong vmpressure_calc_level(uint win, uint s, uint r) > +{ > + ulong p; > + > + if (!s) > + return 0; > + > + /* > + * We calculate the ratio (in percents) of how many pages were > + * scanned vs. reclaimed in a given time frame (window). Note that > + * time is in VM reclaimer's "ticks", i.e. number of pages > + * scanned. This makes it possible to set desired reaction time > + * and serves as a ratelimit. > + */ > + p = win - (r * win / s); > + p = p * 100 / win; Do we need the win at all? p = 100 - (100 * r / s); > + > + pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r); > + > + return vmpressure_level(p); > +} > + [...] > +static int mpc_pre_destroy(struct cgroup *cg) > +{ > + struct mpc_state *mpc = cg2mpc(cg); > + int ret = 0; > + > + mutex_lock(&mpc->lock); > + > + if (mpc->eventfd) > + ret = -EBUSY; The current cgroup's core doesn't allow pre_destroy to fail anymore. The code is marked for 3.8 [...] > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 48550c6..430d8a5 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1877,6 +1877,8 @@ restart: > shrink_active_list(SWAP_CLUSTER_MAX, lruvec, > sc, LRU_ACTIVE_ANON); > > + vmpressure(sc->nr_scanned - nr_scanned, nr_reclaimed); > + I think this should already report to a proper group otherwise all the global reclaim would go to a group where kswapd sits rather than to the target group as I mentioned above (so it at least wouldn't work with a co-mounted cases). > /* reclaim/compaction might need reclaim to continue */ > if (should_continue_reclaim(lruvec, nr_reclaimed, > sc->nr_scanned - nr_scanned, sc)) > @@ -2099,6 +2101,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, > count_vm_event(ALLOCSTALL); > > do { > + vmpressure_prio(sc->priority); Shouldn't this go into shrink_lruvec or somewhere at that level to catch also kswapd low priorities? If you insist on the direct reclaim then you should hook into __zone_reclaim as well. > sc->nr_scanned = 0; > aborted_reclaim = shrink_zones(zonelist, sc);
On Wed, 28 Nov 2012 02:29:08 -0800 Anton Vorontsov <anton.vorontsov@linaro.org> wrote: > The main characteristics are the same to what I've tried to add to vmevent > API: > > Internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for > pressure index calculation. But we don't expose the index to the > userland. Instead, there are three levels of the pressure: > > o low (just reclaiming, e.g. caches are draining); > o medium (allocation cost becomes high, e.g. swapping); > o oom (about to oom very soon). > > The rationale behind exposing levels and not the raw pressure index > described here: http://lkml.org/lkml/2012/11/16/675 This rationale is central to the overall design (and is hence central to the review). It would be better to include it in the changelogs where it can be maintained, understood and discussed. I see a problem with it: It blurs the question of "who is in control". We tell userspace "hey, we're getting a bit tight here, please do something". And userspace makes the decision about what "something" is. So userspace is in control of part of the reclaim function and the kernel is in control of another part. Strange interactions are likely. Also, the system as a whole is untestable by kernel developers - it puts the onus onto each and every userspace developer to develop, test and tune his application against a particular kernel version. And the more carefully the userspace developer tunes his application, the more vulnerable he becomes to regressions which were caused by subtle changes in the kernel's behaviour. Compare this with the shrink_slab() shrinkers. With these, the VM can query and then control the clients. If something goes wrong or is out of balance, it's the VM's problem to solve. So I'm thinking that a better design would be one which puts the kernel VM in control of userspace scanning and freeing. Presumably with a query-and-control interface similar to the slab shrinkers. IOW, we make the kernel smarter and make userspace dumber. Userspace just sits there and does what the kernel tells it to do. This gives the kernel developers the ability to tune and tweak (ie: alter) userspace's behaviour *years* after that userspace code was written. Probably most significantly, this approach has a really big advantage: we can test it. Once we have defined that userspace query/control interface we can write a compliant userspace test application then fire it up and observe the overall system behaviour. We can fix bugs and we can tune it. This cannot be done with your proposed interface because we just don't know what userspace will do in response to changes in the exposed metric.
On Wed, Nov 28, 2012 at 03:14:32PM -0800, Andrew Morton wrote: [...] > Compare this with the shrink_slab() shrinkers. With these, the VM can > query and then control the clients. If something goes wrong or is out > of balance, it's the VM's problem to solve. > > So I'm thinking that a better design would be one which puts the kernel > VM in control of userspace scanning and freeing. Presumably with a > query-and-control interface similar to the slab shrinkers. Thanks for the ideas, Andrew. Query-and-control scheme looks very attractive, and that's actually resembles my "balance" level idea, when userland tells the kernel how much reclaimable memory it has. Except the your scheme works in the reverse direction, i.e. the kernel becomes in charge. But there is one, rather major issue: we're crossing kernel-userspace boundary. And with the scheme we'll have to cross the boundary four times: query / reply-available / control / reply-shrunk / (and repeat if necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat synchronously (all the four stages), and/or we have to make a "userspace shrinker" thread working in parallel with the normal shrinker, and here, I'm afraid, we'll see more strange interactions. :) But there is a good news: for these kind of fine-grained control we have a better interface, where we don't have to communicate [very often] w/ the kernel. These are "volatile ranges", where userland itself marks chunks of data as "I might need it, but I won't cry if you recycle it; but when I access it next time, let me know if you actually recycled it". Yes, userland no longer able to decide which exact page it permits to recycle, but we don't have use-cases when we actually care that much. And if we do, we'd rather introduce volatile LRUs with different priorities, or something alike. So, we really don't need the full-fledged userland shrinker, since we can just let the in-kernel shrinker do its job. If we work with the bytes/pages granularity it is just easier (and more efficient in terms of communication) to do the volatile ranges. For the pressure notifications use-cases, we don't even know bytes/pages information: "activity managers" are separate processes looking after overall system performance. So, we're not trying to make userland too smart, quite the contrary: we realized that for this interface we don't want to mess with the bytes and pages, and that's why we cut this stuff down to only three levels. Before this, we were actually trying to count bytes, we did not like it and we ran away screaming. OTOH, your scheme makes volatile ranges unneeded, since a thread might register a shrinker hook and free stuff by itself. But again, I believe this involves more communication with the kernel. Thanks, Anton.
On Wed, Nov 28, 2012 at 05:27:51PM -0800, Anton Vorontsov wrote: > On Wed, Nov 28, 2012 at 03:14:32PM -0800, Andrew Morton wrote: > [...] > > Compare this with the shrink_slab() shrinkers. With these, the VM can > > query and then control the clients. If something goes wrong or is out > > of balance, it's the VM's problem to solve. > > > > So I'm thinking that a better design would be one which puts the kernel > > VM in control of userspace scanning and freeing. Presumably with a > > query-and-control interface similar to the slab shrinkers. > > Thanks for the ideas, Andrew. > > Query-and-control scheme looks very attractive, and that's actually > resembles my "balance" level idea, when userland tells the kernel how much > reclaimable memory it has. Except the your scheme works in the reverse > direction, i.e. the kernel becomes in charge. > > But there is one, rather major issue: we're crossing kernel-userspace > boundary. And with the scheme we'll have to cross the boundary four times: > query / reply-available / control / reply-shrunk / (and repeat if > necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat > synchronously (all the four stages), and/or we have to make a "userspace > shrinker" thread working in parallel with the normal shrinker, and here, > I'm afraid, we'll see more strange interactions. :) > > But there is a good news: for these kind of fine-grained control we have a > better interface, where we don't have to communicate [very often] w/ the > kernel. These are "volatile ranges", where userland itself marks chunks of > data as "I might need it, but I won't cry if you recycle it; but when I > access it next time, let me know if you actually recycled it". Yes, > userland no longer able to decide which exact page it permits to recycle, > but we don't have use-cases when we actually care that much. And if we do, > we'd rather introduce volatile LRUs with different priorities, or > something alike. > > So, we really don't need the full-fledged userland shrinker, since we can > just let the in-kernel shrinker do its job. If we work with the > bytes/pages granularity it is just easier (and more efficient in terms of > communication) to do the volatile ranges. > > For the pressure notifications use-cases, we don't even know bytes/pages > information: "activity managers" are separate processes looking after > overall system performance. > > So, we're not trying to make userland too smart, quite the contrary: we > realized that for this interface we don't want to mess with the bytes and > pages, and that's why we cut this stuff down to only three levels. Before > this, we were actually trying to count bytes, we did not like it and we > ran away screaming. > > OTOH, your scheme makes volatile ranges unneeded, since a thread might > register a shrinker hook and free stuff by itself. But again, I believe > this involves more communication with the kernel. Btw, I believe your idea is something completely new, and I surely cannot fully evaluate it on my own -- I might be wrong here. So I invite folks to express their opinions too. Guys, it's about Andrew's idea of exposing shrinker-alike logic to the userland (and I made it 'vs. volatile ranges'): http://lkml.org/lkml/2012/11/28/607 Thanks, Anton.
Hello Michal, Thanks a lot for taking a look into this! On Wed, Nov 28, 2012 at 05:29:24PM +0100, Michal Hocko wrote: > On Wed 28-11-12 02:29:08, Anton Vorontsov wrote: > > This is an attempt to implement David Rientjes' idea of mempressure > > cgroup. > > > > The main characteristics are the same to what I've tried to add to vmevent > > API: > > > > Internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for > > pressure index calculation. But we don't expose the index to the > > userland. Instead, there are three levels of the pressure: > > > > o low (just reclaiming, e.g. caches are draining); > > o medium (allocation cost becomes high, e.g. swapping); > > o oom (about to oom very soon). > > > > The rationale behind exposing levels and not the raw pressure index > > described here: http://lkml.org/lkml/2012/11/16/675 > > > > The API uses standard cgroups eventfd notifications: > > > > $ gcc Documentation/cgroups/cgroup_event_listener.c -o \ > > cgroup_event_listener > > $ cd /sys/fs/cgroup/ > > $ mkdir mempressure > > $ mount -t cgroup cgroup ./mempressure -o mempressure > > $ cd mempressure > > $ cgroup_event_listener ./mempressure.level low > > ("low", "medium", "oom" are permitted values.) > > > > Upon hitting the threshold, you should see "/sys/fs/cgroup/mempressure > > low: crossed" messages. > > > > To test that it actually works on per-cgroup basis, I did a small trick: I > > moved all kswapd into a separate cgroup, and hooked the listener onto > > another (non-root) cgroup. The listener no longer received global reclaim > > pressure, which is expected. > > Is this really expected? So you want to be notified only about the > direct reclaim? I didn't try to put much meaning into assinging a task to a non-global reclaim watchers, I just mentioned this as an easiest way to test that we actually can account things on per-thread basis. :) > I am not sure how much useful is that. If you co-mount with e.g. memcg then > the picture is different because even global memory pressure is spread > among groups so it would be just a matter of the proper accounting > (which can be handled similar to lruvec when your code doesn't have to > care about memcg internally). > Co-mounting with cpusets makes sense as well because then you get a > pressure notification based on the placement policy. > > So does it make much sense to mount mempressure on its own without > co-mounting with other controllers? Android does not actually need any of these (memcg or cpusets), but we still want to get notifications (for a root cgroup would be enough for us -- but I'm trying to make things generic, of course). > > For a task it is possible to be in both cpusets, memcg and mempressure > > cgroups, so by rearranging the tasks it should be possible to watch a > > specific pressure. > > Could you be more specific what you mean by rearranging? Creating a same > hierarchy? Co-mounting? > > > Note that while this adds the cgroups support, the code is well separated > > and eventually we might add a lightweight, non-cgroups API, i.e. vmevent. > > But this is another story. > > I think it would be nice to follow freezer and split this into 2 files. > Generic and cgroup spefici. Yeah, this is surely an option, but so far it's only a few hundrends lines of code, plus we don't have any other users for the "internals". So, for the time being, I'd rather keep it in one file. > > Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> > > --- > [...] > > +/* These are defaults. Might make them configurable one day. */ > > +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16; > > I realize this is just an RFC but could you be more specific what is the > meaning of vmpressure_win? Sure, let me just copy the text from the previous RFC, to which you were not Cc'ed: When the system is short on idle pages, the new memory is allocated by reclaiming least recently used resources: kernel scans pages to be reclaimed (e.g. from file caches, mmap(2) volatile ranges, etc.; and potentially swapping some pages out). The index shows the relative time spent by the kernel uselessly scanning pages, or, in other words, the percentage of scans of pages (vmpressure_window) that were not reclaimed. ... Window size is used as a rate-limit tunable for VMPRESSURE_LOW notifications and for averaging for VMPRESSURE_{MEDIUM,OOM} levels. So, using small window sizes can cause lot of false positives for _MEDIUM and _OOM levels, but too big window size may delay notifications. By default the window size equals to 256 pages (1MB). You can find more about the tunables in the previus RFC: http://lkml.org/lkml/2012/11/7/169 > > +static const uint vmpressure_level_med = 60; > > +static const uint vmpressure_level_oom = 99; > > +static const uint vmpressure_level_oom_prio = 4; > > + > > +enum vmpressure_levels { > > + VMPRESSURE_LOW = 0, > > + VMPRESSURE_MEDIUM, > > + VMPRESSURE_OOM, > > + VMPRESSURE_NUM_LEVELS, > > +}; > > + > > +static const char const *vmpressure_str_levels[] = { > > + [VMPRESSURE_LOW] = "low", > > + [VMPRESSURE_MEDIUM] = "medium", > > + [VMPRESSURE_OOM] = "oom", > > +}; > > + > > +static enum vmpressure_levels vmpressure_level(uint pressure) > > +{ > > + if (pressure >= vmpressure_level_oom) > > + return VMPRESSURE_OOM; > > + else if (pressure >= vmpressure_level_med) > > + return VMPRESSURE_MEDIUM; > > + return VMPRESSURE_LOW; > > +} > > + > > +static ulong vmpressure_calc_level(uint win, uint s, uint r) > > +{ > > + ulong p; > > + > > + if (!s) > > + return 0; > > + > > + /* > > + * We calculate the ratio (in percents) of how many pages were > > + * scanned vs. reclaimed in a given time frame (window). Note that > > + * time is in VM reclaimer's "ticks", i.e. number of pages > > + * scanned. This makes it possible to set desired reaction time > > + * and serves as a ratelimit. > > + */ > > + p = win - (r * win / s); > > + p = p * 100 / win; > > Do we need the win at all? > p = 100 - (100 * r / s); Other than for me being pedant, pretty much no. :) It's just less "precise" (try s=1000, r=9). (But in return, my version is prone to misbehave when window is too large.) > > + > > + pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r); > > + > > + return vmpressure_level(p); > > +} > > + > [...] > > +static int mpc_pre_destroy(struct cgroup *cg) > > +{ > > + struct mpc_state *mpc = cg2mpc(cg); > > + int ret = 0; > > + > > + mutex_lock(&mpc->lock); > > + > > + if (mpc->eventfd) > > + ret = -EBUSY; > > The current cgroup's core doesn't allow pre_destroy to fail anymore. The > code is marked for 3.8 Sure, I can rebase. (Currently, the code is based on the v3.7-rc6, which isn't even released but seems way too old already, heh. :) > [...] > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 48550c6..430d8a5 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -1877,6 +1877,8 @@ restart: > > shrink_active_list(SWAP_CLUSTER_MAX, lruvec, > > sc, LRU_ACTIVE_ANON); > > > > + vmpressure(sc->nr_scanned - nr_scanned, nr_reclaimed); > > + > > I think this should already report to a proper group otherwise all the > global reclaim would go to a group where kswapd sits rather than to the > target group as I mentioned above (so it at least wouldn't work with a > co-mounted cases). Um. Yeah, I guess I was too optimistic here, relying on the things to "just work". I guess I still need to pass memcg pointer to the vmpressure() and check if the process is also part of the sc->target_mem_cgroup. > > /* reclaim/compaction might need reclaim to continue */ > > if (should_continue_reclaim(lruvec, nr_reclaimed, > > sc->nr_scanned - nr_scanned, sc)) > > @@ -2099,6 +2101,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, > > count_vm_event(ALLOCSTALL); > > > > do { > > + vmpressure_prio(sc->priority); > > Shouldn't this go into shrink_lruvec or somewhere at that level to catch > also kswapd low priorities? If you insist on the direct reclaim then you > should hook into __zone_reclaim as well. Probably... Thanks for pointing out, I'll take a closer look once we resolve the global/design issues. Thanks! Anton.
On Wed, Nov 28, 2012 at 02:29:08AM -0800, Anton Vorontsov wrote: > +static int mpc_pre_destroy(struct cgroup *cg) > +{ > + struct mpc_state *mpc = cg2mpc(cg); > + int ret = 0; > + > + mutex_lock(&mpc->lock); > + > + if (mpc->eventfd) > + ret = -EBUSY; cgroup_rmdir() will unregister all events for you. No need to handle it here. > + > + mutex_unlock(&mpc->lock); > + > + return ret; > +} > +static int mpc_register_level_event(struct cgroup *cg, struct cftype *cft, > + struct eventfd_ctx *eventfd, > + const char *args) > +{ > + struct mpc_state *mpc = cg2mpc(cg); > + int i; > + int ret; > + > + mutex_lock(&mpc->lock); > + > + /* > + * It's easy to implement multiple thresholds, but so far we don't > + * need it. > + */ > + if (mpc->eventfd) { > + ret = -EBUSY; > + goto out_unlock; > + } One user which listen for one threashold per cgroup? I think it's wrong. It's essensial for API to serve multiple users. > + > + ret = -EINVAL; > + for (i = 0; i < VMPRESSURE_NUM_LEVELS; i++) { > + if (strcmp(vmpressure_str_levels[i], args)) > + continue; > + mpc->eventfd = eventfd; > + mpc->thres = i; > + ret = 0; > + break; > + } > +out_unlock: > + mutex_unlock(&mpc->lock); > + > + return ret; > +}
On Thu, Nov 29, 2012 at 08:14:13AM +0200, Kirill A. Shutemov wrote: > On Wed, Nov 28, 2012 at 02:29:08AM -0800, Anton Vorontsov wrote: > > +static int mpc_pre_destroy(struct cgroup *cg) > > +{ > > + struct mpc_state *mpc = cg2mpc(cg); > > + int ret = 0; > > + > > + mutex_lock(&mpc->lock); > > + > > + if (mpc->eventfd) > > + ret = -EBUSY; > > cgroup_rmdir() will unregister all events for you. No need to handle it > here. Okie, thanks! [...] > > +static int mpc_register_level_event(struct cgroup *cg, struct cftype *cft, > > + struct eventfd_ctx *eventfd, > > + const char *args) > > +{ > > + struct mpc_state *mpc = cg2mpc(cg); > > + int i; > > + int ret; > > + > > + mutex_lock(&mpc->lock); > > + > > + /* > > + * It's easy to implement multiple thresholds, but so far we don't > > + * need it. > > + */ > > + if (mpc->eventfd) { > > + ret = -EBUSY; > > + goto out_unlock; > > + } > > One user which listen for one threashold per cgroup? > I think it's wrong. It's essensial for API to serve multiple users. Yea, if we'll consider merging this, I'll definitely fix this. Just didn't want to bring the complexity into the code. Thanks, Anton.
On Wed, 28 Nov 2012 17:27:51 -0800 Anton Vorontsov <anton.vorontsov@linaro.org> wrote: > On Wed, Nov 28, 2012 at 03:14:32PM -0800, Andrew Morton wrote: > [...] > > Compare this with the shrink_slab() shrinkers. With these, the VM can > > query and then control the clients. If something goes wrong or is out > > of balance, it's the VM's problem to solve. > > > > So I'm thinking that a better design would be one which puts the kernel > > VM in control of userspace scanning and freeing. Presumably with a > > query-and-control interface similar to the slab shrinkers. > > Thanks for the ideas, Andrew. > > Query-and-control scheme looks very attractive, and that's actually > resembles my "balance" level idea, when userland tells the kernel how much > reclaimable memory it has. Except the your scheme works in the reverse > direction, i.e. the kernel becomes in charge. > > But there is one, rather major issue: we're crossing kernel-userspace > boundary. And with the scheme we'll have to cross the boundary four times: > query / reply-available / control / reply-shrunk / (and repeat if > necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat > synchronously (all the four stages), and/or we have to make a "userspace > shrinker" thread working in parallel with the normal shrinker, and here, > I'm afraid, we'll see more strange interactions. :) Wouldn't this be just like kswapd? > But there is a good news: for these kind of fine-grained control we have a > better interface, where we don't have to communicate [very often] w/ the > kernel. These are "volatile ranges", where userland itself marks chunks of > data as "I might need it, but I won't cry if you recycle it; but when I > access it next time, let me know if you actually recycled it". Yes, > userland no longer able to decide which exact page it permits to recycle, > but we don't have use-cases when we actually care that much. And if we do, > we'd rather introduce volatile LRUs with different priorities, or > something alike. I'm new to this stuff so please take this with a grain of salt, but I'm not sure volatile ranges would be a good fit for our use case: we want to make (kvm) guests reduce their memory when the host is getting memory pressure. Having a notification seems just fine for this purpose, but I'm not sure how this would work with volatile ranges, as we'd have to mark pages volatile in advance. Andrew's idea seems to give a lot more freedom to apps, IMHO.
Hi Luiz, Thanks for your email! On Fri, Nov 30, 2012 at 03:47:25PM -0200, Luiz Capitulino wrote: [...] > > But there is one, rather major issue: we're crossing kernel-userspace > > boundary. And with the scheme we'll have to cross the boundary four times: > > query / reply-available / control / reply-shrunk / (and repeat if > > necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat > > synchronously (all the four stages), and/or we have to make a "userspace > > shrinker" thread working in parallel with the normal shrinker, and here, > > I'm afraid, we'll see more strange interactions. :) > > Wouldn't this be just like kswapd? Sure, this is similar, but only for indirect reclaim (obviously). How we'd do this for the direct reclaim I have no idea, honestly, with Andrew's idea it must be all synchronous, so playing ping-pong with userland during the direct reclaim will be hard. So, the best thing to do with the direct recaim, IMHO, is just send a notification. > > But there is a good news: for these kind of fine-grained control we have a > > better interface, where we don't have to communicate [very often] w/ the > > kernel. These are "volatile ranges", where userland itself marks chunks of > > data as "I might need it, but I won't cry if you recycle it; but when I > > access it next time, let me know if you actually recycled it". Yes, > > userland no longer able to decide which exact page it permits to recycle, > > but we don't have use-cases when we actually care that much. And if we do, > > we'd rather introduce volatile LRUs with different priorities, or > > something alike. > > I'm new to this stuff so please take this with a grain of salt, but I'm > not sure volatile ranges would be a good fit for our use case: we want to > make (kvm) guests reduce their memory when the host is getting memory > pressure. Yes, for this kind of things you want a simple notification. I wasn't saying that volatile ranges must be a substitute for notifications, quite the opposite: I was saying that you can do volatile ranges in userland by using "userland-shrinker". It can be even wrapped into a library, with the same mmap() libc interface. But it will be inefficient. Thanks, Anton.
On Fri, Nov 30, 2012 at 03:47:25PM -0200, Luiz Capitulino wrote: [...] > > Query-and-control scheme looks very attractive, and that's actually > > resembles my "balance" level idea, when userland tells the kernel how much > > reclaimable memory it has. Except the your scheme works in the reverse > > direction, i.e. the kernel becomes in charge. > > > > But there is one, rather major issue: we're crossing kernel-userspace > > boundary. And with the scheme we'll have to cross the boundary four times: > > query / reply-available / control / reply-shrunk / (and repeat if > > necessary, every SHRINK_BATCH pages). Plus, it has to be done somewhat > > synchronously (all the four stages), and/or we have to make a "userspace > > shrinker" thread working in parallel with the normal shrinker, and here, > > I'm afraid, we'll see more strange interactions. :) [...] > Andrew's idea seems to give a lot more freedom to apps, IMHO. OK, thinking about it some more... === === Long explanations below, scroll to 'END' for the short version. :) === The typical query-control shrinker interaction would look like this: Kernel: "Can you please free <Y> pages?" Userland: "Here you go, <Z> pages freed." Now let's assume that we are the Activity Manager, so we know that we have <N> reclaimable pages in total (it's not always possible to know, but let's pretend we do know). And assume that we are the only source of reclaimable pages (this is important). OK, the kernel asks us to reclaim <Y> pages. Now, what if we divide <Y> (needed pages) by <N> (total reclaimable pages)? :) This will be the memory pressure factor, what a coincidence. E.g. if Y >= N, the factor would be >= 1, which was our definition of OOM. If no pages needed, the factor is 0. Okay, let's see how our current vmpressure notification works inside: - The notification comes every 'window size' (<W>) pages scanned; - Alongside with the notification itself we can also receive the pressure factor <F> (it is 1 - reclaimed/scanned). (We use levels nowadays, but internally it is still the factor.) So, by doing <W> * <F> we can find out the amount of memory that the kernel was missing this round (scanned - reclaimed), which pretty much the same meaning as "Please free <Y> pages" in the "userland-shrinker" scheme above. Except that in the notifications case the "<Y>" was is in the past already, so we should read "the kernel had difficulty with reclaiming <Y> pages", and userland just received the notification about this past event. The <Y> pages were probably reclaimed already. Now, can we assume that in the next second, the system will need the same <Y> pages reclaimed? Well, if the window size was small enough, it's OK to assume that the workload didn't change much. So, yes, we can assume this, the only "bad" thing that can happen, we can free a little bit more than it was needed. Let's look how we'd use the raw factor in the imaginary userland shrinker: while (1) { /* blocking, triggers every "window size" pages, <W> */ factor = get_pressure(); /* Finds the smallest chunk(s) w/ size >= <W> * <F> */ resource = get_resource(factor); free(resource); } So, in the each round we'd free at least <W> * <F> pages. Again, the product just tells how much memory it is best to free at this time, which by definition is 'scanned - reclaimed' (<F> = 1 - reclaimed/scanned; <W> = scanned). That is, we don't need the factor, we need the scanned and reclaimed difference. In sum: - Reporting the 'scanned - reclaimed' seems like an option for implementing the userland shrinker; - B using small 'window size' we can mitigate effect of async nature of our shrinker. Although, the shrinker is not a substitution to the pressure factor (or levels). The plain "I need <Y> pages" still does not tell how bad things there are in the system, how much scanning there are. So, the reclaimed/scanned ratio is important, too. === === END === The lengthy text above boils down to this: Yes, I tend to agree that Andrew's idea gives some freedom to the apps, and that with the three levels it is not possible to implement a good, predictable "userland shrinker". Even though we don't need it just now. Based on the above, I think I have a solution for this. For the next RFC, I'd like to keep the pressure levels, but I will also add a file that will report 'scanned - reclaimed' difference. I'll call it something like nr_to_reclaim. Since the 'scanned - reclaimed' is still an approximation (although I believe a good one), we may want to tune it without breaking things. And with the nr_to_reclaim, implementing a predictable userland shrinker will be a piece of cake: apps will blindly free the given amount of pages, nothing more. Thanks, Anton.
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h index f204a7a..b9802e2 100644 --- a/include/linux/cgroup_subsys.h +++ b/include/linux/cgroup_subsys.h @@ -37,6 +37,12 @@ SUBSYS(mem_cgroup) /* */ +#if IS_SUBSYS_ENABLED(CONFIG_CGROUP_MEMPRESSURE) +SUBSYS(mpc_cgroup) +#endif + +/* */ + #if IS_SUBSYS_ENABLED(CONFIG_CGROUP_DEVICE) SUBSYS(devices) #endif diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 92a86b2..7698341 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -10,6 +10,14 @@ extern int sysctl_stat_interval; +#ifdef CONFIG_CGROUP_MEMPRESSURE +extern void vmpressure(ulong scanned, ulong reclaimed); +extern void vmpressure_prio(int prio); +#else +static inline void vmpressure(ulong scanned, ulong reclaimed) {} +static inline void vmpressure_prio(int prio) {} +#endif + #ifdef CONFIG_VM_EVENT_COUNTERS /* * Light weight per cpu counter implementation. diff --git a/init/Kconfig b/init/Kconfig index 6fdd6e3..7065e44 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -826,6 +826,11 @@ config MEMCG_KMEM the kmem extension can use it to guarantee that no group of processes will ever exhaust kernel resources alone. +config CGROUP_MEMPRESSURE + bool "Memory pressure monitor for Control Groups" + help + TODO + config CGROUP_HUGETLB bool "HugeTLB Resource Controller for Control Groups" depends on RESOURCE_COUNTERS && HUGETLB_PAGE && EXPERIMENTAL diff --git a/mm/Makefile b/mm/Makefile index 6b025f8..40cee19 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -50,6 +50,7 @@ obj-$(CONFIG_MIGRATION) += migrate.o obj-$(CONFIG_QUICKLIST) += quicklist.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o obj-$(CONFIG_MEMCG) += memcontrol.o page_cgroup.o +obj-$(CONFIG_CGROUP_MEMPRESSURE) += mempressure.o obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o diff --git a/mm/mempressure.c b/mm/mempressure.c new file mode 100644 index 0000000..5c85bbe --- /dev/null +++ b/mm/mempressure.c @@ -0,0 +1,287 @@ +/* + * Linux VM pressure notifications + * + * Copyright 2012 Linaro Ltd. + * Anton Vorontsov <anton.vorontsov@linaro.org> + * + * Based on ideas from David Rientjes, KOSAKI Motohiro, Leonid Moiseichuk, + * Mel Gorman, Minchan Kim and Pekka Enberg. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + */ + +#include <linux/cgroup.h> +#include <linux/fs.h> +#include <linux/sched.h> +#include <linux/mm.h> +#include <linux/vmstat.h> +#include <linux/atomic.h> +#include <linux/eventfd.h> +#include <linux/swap.h> +#include <linux/printk.h> + +static void mpc_vmpressure(ulong scanned, ulong reclaimed); + +/* + * Generic VM Pressure routines (no cgroups or any other API details) + */ + +/* These are defaults. Might make them configurable one day. */ +static const uint vmpressure_win = SWAP_CLUSTER_MAX * 16; +static const uint vmpressure_level_med = 60; +static const uint vmpressure_level_oom = 99; +static const uint vmpressure_level_oom_prio = 4; + +enum vmpressure_levels { + VMPRESSURE_LOW = 0, + VMPRESSURE_MEDIUM, + VMPRESSURE_OOM, + VMPRESSURE_NUM_LEVELS, +}; + +static const char const *vmpressure_str_levels[] = { + [VMPRESSURE_LOW] = "low", + [VMPRESSURE_MEDIUM] = "medium", + [VMPRESSURE_OOM] = "oom", +}; + +static enum vmpressure_levels vmpressure_level(uint pressure) +{ + if (pressure >= vmpressure_level_oom) + return VMPRESSURE_OOM; + else if (pressure >= vmpressure_level_med) + return VMPRESSURE_MEDIUM; + return VMPRESSURE_LOW; +} + +static ulong vmpressure_calc_level(uint win, uint s, uint r) +{ + ulong p; + + if (!s) + return 0; + + /* + * We calculate the ratio (in percents) of how many pages were + * scanned vs. reclaimed in a given time frame (window). Note that + * time is in VM reclaimer's "ticks", i.e. number of pages + * scanned. This makes it possible to set desired reaction time + * and serves as a ratelimit. + */ + p = win - (r * win / s); + p = p * 100 / win; + + pr_debug("%s: %3lu (s: %6u r: %6u)\n", __func__, p, s, r); + + return vmpressure_level(p); +} + +void vmpressure(ulong scanned, ulong reclaimed) +{ + if (!scanned) + return; + mpc_vmpressure(scanned, reclaimed); +} + +void vmpressure_prio(int prio) +{ + if (prio > vmpressure_level_oom_prio) + return; + + /* OK, the prio is below the threshold, send the pre-OOM event. */ + vmpressure(vmpressure_win, 0); +} + +/* + * Memory pressure cgroup code + */ + +struct mpc_state { + struct cgroup_subsys_state css; + uint scanned; + uint reclaimed; + struct mutex lock; + struct eventfd_ctx *eventfd; + enum vmpressure_levels thres; +}; + +static struct mpc_state *css2mpc(struct cgroup_subsys_state *css) +{ + return container_of(css, struct mpc_state, css); +} + +static struct mpc_state *tsk2mpc(struct task_struct *tsk) +{ + return css2mpc(task_subsys_state(tsk, mpc_cgroup_subsys_id)); +} + +static struct mpc_state *cg2mpc(struct cgroup *cg) +{ + return css2mpc(cgroup_subsys_state(cg, mpc_cgroup_subsys_id)); +} + +static void __mpc_vmpressure(ulong scanned, ulong reclaimed) +{ + struct mpc_state *mpc = tsk2mpc(current); + int level; + + mpc->scanned += scanned; + mpc->reclaimed += reclaimed; + + if (mpc->scanned < vmpressure_win) + return; + + level = vmpressure_calc_level(vmpressure_win, + mpc->scanned, mpc->reclaimed); + if (level >= mpc->thres) { + mutex_lock(&mpc->lock); + if (mpc->eventfd) + eventfd_signal(mpc->eventfd, 1); + mutex_unlock(&mpc->lock); + } +} + +static void mpc_vmpressure(ulong scanned, ulong reclaimed) +{ + /* + * There are two options for implementing cgroup pressure + * notifications: + * + * - Store pressure counter atomically in the task struct. Upon + * hitting 'window' wake up a workqueue that will walk every + * task and sum per-thread pressure into cgroup pressure (to + * which the task belongs). The cons are obvious: bloats task + * struct, have to walk all processes and makes pressue less + * accurate (the window becomes per-thread); + * + * - Store pressure counters in per-cgroup state. This is easy and + * straighforward, and that's how we do things here. But this + * requires us to not put the vmpressure hooks into hotpath, + * since we have to grab some locks. + */ + task_lock(current); + __mpc_vmpressure(scanned, reclaimed); + task_unlock(current); +} + +static struct cgroup_subsys_state *mpc_create(struct cgroup *cg) +{ + struct mpc_state *mpc; + + mpc = kzalloc(sizeof(*mpc), GFP_KERNEL); + if (!mpc) + return ERR_PTR(-ENOMEM); + + mutex_init(&mpc->lock); + + return &mpc->css; +} + +static int mpc_pre_destroy(struct cgroup *cg) +{ + struct mpc_state *mpc = cg2mpc(cg); + int ret = 0; + + mutex_lock(&mpc->lock); + + if (mpc->eventfd) + ret = -EBUSY; + + mutex_unlock(&mpc->lock); + + return ret; +} + +static void mpc_destroy(struct cgroup *cg) +{ + struct mpc_state *mpc = cg2mpc(cg); + + kfree(mpc); +} + +static ssize_t mpc_read_level(struct cgroup *cg, struct cftype *cft, + struct file *file, char __user *buf, + size_t sz, loff_t *ppos) +{ + struct mpc_state *mpc = cg2mpc(cg); + uint level; + const char *str; + + mutex_lock(&mpc->lock); + + level = vmpressure_calc_level(vmpressure_win, + mpc->scanned, mpc->reclaimed); + mpc->scanned = 0; + mpc->reclaimed = 0; + + mutex_unlock(&mpc->lock); + + str = vmpressure_str_levels[level]; + return simple_read_from_buffer(buf, sz, ppos, str, strlen(str)); +} + +static int mpc_register_level_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd, + const char *args) +{ + struct mpc_state *mpc = cg2mpc(cg); + int i; + int ret; + + mutex_lock(&mpc->lock); + + /* + * It's easy to implement multiple thresholds, but so far we don't + * need it. + */ + if (mpc->eventfd) { + ret = -EBUSY; + goto out_unlock; + } + + ret = -EINVAL; + for (i = 0; i < VMPRESSURE_NUM_LEVELS; i++) { + if (strcmp(vmpressure_str_levels[i], args)) + continue; + mpc->eventfd = eventfd; + mpc->thres = i; + ret = 0; + break; + } +out_unlock: + mutex_unlock(&mpc->lock); + + return ret; +} + +static void mpc_unregister_level_event(struct cgroup *cg, struct cftype *cft, + struct eventfd_ctx *eventfd) +{ + struct mpc_state *mpc = cg2mpc(cg); + + mutex_lock(&mpc->lock); + BUG_ON(mpc->eventfd != eventfd); + mpc->eventfd = NULL; + mutex_unlock(&mpc->lock); +} + +static struct cftype mpc_files[] = { + { + .name = "level", + .read = mpc_read_level, + .register_event = mpc_register_level_event, + .unregister_event = mpc_unregister_level_event, + }, + {}, +}; + +struct cgroup_subsys mpc_cgroup_subsys = { + .name = "mempressure", + .subsys_id = mpc_cgroup_subsys_id, + .create = mpc_create, + .pre_destroy = mpc_pre_destroy, + .destroy = mpc_destroy, + .base_cftypes = mpc_files, +}; diff --git a/mm/vmscan.c b/mm/vmscan.c index 48550c6..430d8a5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1877,6 +1877,8 @@ restart: shrink_active_list(SWAP_CLUSTER_MAX, lruvec, sc, LRU_ACTIVE_ANON); + vmpressure(sc->nr_scanned - nr_scanned, nr_reclaimed); + /* reclaim/compaction might need reclaim to continue */ if (should_continue_reclaim(lruvec, nr_reclaimed, sc->nr_scanned - nr_scanned, sc)) @@ -2099,6 +2101,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, count_vm_event(ALLOCSTALL); do { + vmpressure_prio(sc->priority); sc->nr_scanned = 0; aborted_reclaim = shrink_zones(zonelist, sc);
This is an attempt to implement David Rientjes' idea of mempressure cgroup. The main characteristics are the same to what I've tried to add to vmevent API: Internally, it uses Mel Gorman's idea of scanned/reclaimed ratio for pressure index calculation. But we don't expose the index to the userland. Instead, there are three levels of the pressure: o low (just reclaiming, e.g. caches are draining); o medium (allocation cost becomes high, e.g. swapping); o oom (about to oom very soon). The rationale behind exposing levels and not the raw pressure index described here: http://lkml.org/lkml/2012/11/16/675 The API uses standard cgroups eventfd notifications: $ gcc Documentation/cgroups/cgroup_event_listener.c -o \ cgroup_event_listener $ cd /sys/fs/cgroup/ $ mkdir mempressure $ mount -t cgroup cgroup ./mempressure -o mempressure $ cd mempressure $ cgroup_event_listener ./mempressure.level low ("low", "medium", "oom" are permitted values.) Upon hitting the threshold, you should see "/sys/fs/cgroup/mempressure low: crossed" messages. To test that it actually works on per-cgroup basis, I did a small trick: I moved all kswapd into a separate cgroup, and hooked the listener onto another (non-root) cgroup. The listener no longer received global reclaim pressure, which is expected. For a task it is possible to be in both cpusets, memcg and mempressure cgroups, so by rearranging the tasks it should be possible to watch a specific pressure. Note that while this adds the cgroups support, the code is well separated and eventually we might add a lightweight, non-cgroups API, i.e. vmevent. But this is another story. Signed-off-by: Anton Vorontsov <anton.vorontsov@linaro.org> --- include/linux/cgroup_subsys.h | 6 + include/linux/vmstat.h | 8 ++ init/Kconfig | 5 + mm/Makefile | 1 + mm/mempressure.c | 287 ++++++++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 3 + 6 files changed, 310 insertions(+) create mode 100644 mm/mempressure.c