From patchwork Tue Jan 14 00:29:10 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 233991 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 55A88C33CA9 for ; Tue, 14 Jan 2020 00:29:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 29AA82084D for ; Tue, 14 Jan 2020 00:29:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1578961753; bh=YI2CR8TD42NI+UiZNILuPqhvV3jhtV6hqkOe1BZvkvk=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=prX67ZJAOEiwTHb3Op49ZCyGq4KoHp/JU6h9pCYiK6+/APnHLmuynbHtfLd+iFwH2 7Aoc0NCIlFT7NVLPb3aLbHWA4Ae07YIPP4E3MxkZAayscoeNQWXIfzlXj1+QBDZPWM yhWj/oMBP09lhSlhziLpan5HSbThHgWuLPDuQKrs= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728918AbgANA3M (ORCPT ); Mon, 13 Jan 2020 19:29:12 -0500 Received: from mail.kernel.org ([198.145.29.99]:45034 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728890AbgANA3M (ORCPT ); Mon, 13 Jan 2020 19:29:12 -0500 Received: from akpm3.svl.corp.google.com (unknown [104.133.8.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id EA2E92080D; Tue, 14 Jan 2020 00:29:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1578961751; bh=YI2CR8TD42NI+UiZNILuPqhvV3jhtV6hqkOe1BZvkvk=; h=Date:From:To:Subject:In-Reply-To:From; b=F9daBbTegDbBNOmSsgqsKJ5CTiOx5biB1BimM6sJai64VycJPZ4MWSzUiqkEginKe GiIzSpobqJ2twgFHJfpUx0k0sJnqSEoW+1hPjXGHu0m8FbQhX7vX98GEzOLIjweru3 fvEjkDJEHbgtTkUA0KsxpfUZteumroN/R8XjbKYY= Date: Mon, 13 Jan 2020 16:29:10 -0800 From: Andrew Morton To: thomas.willhalm@intel.com, stable@vger.kernel.org, otto.g.bruggeman@intel.com, kirill.shutemov@linux.intel.com, dan.j.williams@intel.com, aneesh.kumar@linux.vnet.ibm.com, kirill@shutemov.name, akpm@linux-foundation.org, linux-mm@kvack.org, mm-commits@vger.kernel.org, torvalds@linux-foundation.org Subject: [patch 03/11] mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment Message-ID: <20200114002910.rzSOD%akpm@linux-foundation.org> In-Reply-To: <20200113162831.f7d69e11e9e673c40005c9b0@linux-foundation.org> User-Agent: s-nail v14.9.15 Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: "Kirill A. Shutemov" Subject: mm/huge_memory.c: thp: fix conflict of above-47bit hint address and PMD alignment Patch series "Fix two above-47bit hint address vs. THP bugs". The two get_unmapped_area() implementations have to be fixed to provide THP-friendly mappings if above-47bit hint address is specified. This patch (of 2): Filesystems use thp_get_unmapped_area() to provide THP-friendly mappings. For DAX in particular. Normally, the kernel doesn't create userspace mappings above 47-bit, even if the machine allows this (such as with 5-level paging on x86-64). Not all user space is ready to handle wide addresses. It's known that at least some JIT compilers use higher bits in pointers to encode their information. Userspace can ask for allocation from full address space by specifying hint address (with or without MAP_FIXED) above 47-bits. If the application doesn't need a particular address, but wants to allocate from whole address space it can specify -1 as a hint address. Unfortunately, this trick breaks thp_get_unmapped_area(): the function would not try to allocate PMD-aligned area if *any* hint address specified. Modify the routine to handle it correctly: - Try to allocate the space at the specified hint address with length padding required for PMD alignment. - If failed, retry without length padding (but with the same hint address); - If the returned address matches the hint address return it. - Otherwise, align the address as required for THP and return. The user specified hint address is passed down to get_unmapped_area() so above-47bit hint address will be taken into account without breaking alignment requirements. Link: http://lkml.kernel.org/r/20191220142548.7118-2-kirill.shutemov@linux.intel.com Fixes: b569bab78d8d ("x86/mm: Prepare to expose larger address space to userspace") Signed-off-by: Kirill A. Shutemov Reported-by: Thomas Willhalm Tested-by: Dan Williams Cc: "Aneesh Kumar K . V" Cc: "Bruggeman, Otto G" Cc: Signed-off-by: Andrew Morton --- mm/huge_memory.c | 38 ++++++++++++++++++++++++-------------- 1 file changed, 24 insertions(+), 14 deletions(-) --- a/mm/huge_memory.c~thp-fix-conflict-of-above-47bit-hint-address-and-pmd-alignment +++ a/mm/huge_memory.c @@ -527,13 +527,13 @@ void prep_transhuge_page(struct page *pa set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR); } -static unsigned long __thp_get_unmapped_area(struct file *filp, unsigned long len, +static unsigned long __thp_get_unmapped_area(struct file *filp, + unsigned long addr, unsigned long len, loff_t off, unsigned long flags, unsigned long size) { - unsigned long addr; loff_t off_end = off + len; loff_t off_align = round_up(off, size); - unsigned long len_pad; + unsigned long len_pad, ret; if (off_end <= off_align || (off_end - off_align) < size) return 0; @@ -542,30 +542,40 @@ static unsigned long __thp_get_unmapped_ if (len_pad < len || (off + len_pad) < off) return 0; - addr = current->mm->get_unmapped_area(filp, 0, len_pad, + ret = current->mm->get_unmapped_area(filp, addr, len_pad, off >> PAGE_SHIFT, flags); - if (IS_ERR_VALUE(addr)) + + /* + * The failure might be due to length padding. The caller will retry + * without the padding. + */ + if (IS_ERR_VALUE(ret)) return 0; - addr += (off - addr) & (size - 1); - return addr; + /* + * Do not try to align to THP boundary if allocation at the address + * hint succeeds. + */ + if (ret == addr) + return addr; + + ret += (off - ret) & (size - 1); + return ret; } unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags) { + unsigned long ret; loff_t off = (loff_t)pgoff << PAGE_SHIFT; - if (addr) - goto out; if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD)) goto out; - addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE); - if (addr) - return addr; - - out: + ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PMD_SIZE); + if (ret) + return ret; +out: return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags); } EXPORT_SYMBOL_GPL(thp_get_unmapped_area); From patchwork Tue Jan 14 00:29:16 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 233990 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EEEB5C33CA9 for ; Tue, 14 Jan 2020 00:29:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id BA6ED2084D for ; Tue, 14 Jan 2020 00:29:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1578961758; bh=IzTMuWg7WYoPMjN/8kKUDToGlR9HPKzoT0cOLeW9rq0=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=KPQhkfc/E/MDDRZJboVkbD/YpYczBSeZra0WfLBhPbOlujZYpxEbd9OX9INdoe6b2 R+rpW5Zys1MPOGHQYroVquM6Rq6GVjb9/UZPgkIcN7jhrc9BghSmICx8ufMRGdDI4d s06AV2OoJWTr2QcjNMbgw1ZZd3pac7UExO75xsRU= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728946AbgANA3S (ORCPT ); Mon, 13 Jan 2020 19:29:18 -0500 Received: from mail.kernel.org ([198.145.29.99]:45250 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728890AbgANA3S (ORCPT ); Mon, 13 Jan 2020 19:29:18 -0500 Received: from akpm3.svl.corp.google.com (unknown [104.133.8.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 411FC222C3; Tue, 14 Jan 2020 00:29:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1578961757; bh=IzTMuWg7WYoPMjN/8kKUDToGlR9HPKzoT0cOLeW9rq0=; h=Date:From:To:Subject:In-Reply-To:From; b=IZihq/Hvv7Ro5K8m+zwFQl5us3zpziNEDTgErPRtYFhP6sYA4lP6d77dCU9jSPR5i lD3Fclnv31JjMlAyvKSAafzaJAE1MBazqTFCbo/BE9Dj6uC66zlMFXuwTVM4qava3h ltKsB+4fWd2uiVOBqn1M67Xf6ATd6rIKZd7k09LM= Date: Mon, 13 Jan 2020 16:29:16 -0800 From: Andrew Morton To: stable@vger.kernel.org, mhocko@suse.com, hannes@cmpxchg.org, guro@fb.com, akpm@linux-foundation.org, linux-mm@kvack.org, mm-commits@vger.kernel.org, torvalds@linux-foundation.org Subject: [patch 05/11] mm: memcg/slab: fix percpu slab vmstats flushing Message-ID: <20200114002916.-zlNE%akpm@linux-foundation.org> In-Reply-To: <20200113162831.f7d69e11e9e673c40005c9b0@linux-foundation.org> User-Agent: s-nail v14.9.15 Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Roman Gushchin Subject: mm: memcg/slab: fix percpu slab vmstats flushing Currently slab percpu vmstats are flushed twice: during the memcg offlining and just before freeing the memcg structure. Each time percpu counters are summed, added to the atomic counterparts and propagated up by the cgroup tree. The second flushing is required due to how recursive vmstats are implemented: counters are batched in percpu variables on a local level, and once a percpu value is crossing some predefined threshold, it spills over to atomic values on the local and each ascendant levels. It means that without flushing some numbers cached in percpu variables will be dropped on floor each time a cgroup is destroyed. And with uptime the error on upper levels might become noticeable. The first flushing aims to make counters on ancestor levels more precise. Dying cgroups may resume in the dying state for a long time. After kmem_cache reparenting which is performed during the offlining slab counters of the dying cgroup don't have any chances to be updated, because any slab operations will be performed on the parent level. It means that the inaccuracy caused by percpu batching will not decrease up to the final destruction of the cgroup. By the original idea flushing slab counters during the offlining should minimize the visible inaccuracy of slab counters on the parent level. The problem is that percpu counters are not zeroed after the first flushing. So every cached percpu value is summed twice. It creates a small error (up to 32 pages per cpu, but usually less) which accumulates on parent cgroup level. After creating and destroying of thousands of child cgroups, slab counter on parent level can be way off the real value. For now, let's just stop flushing slab counters on memcg offlining. It can't be done correctly without scheduling a work on each cpu: reading and zeroing it during css offlining can race with an asynchronous update, which doesn't expect values to be changed underneath. With this change, slab counters on parent level will become eventually consistent. Once all dying children are gone, values are correct. And if not, the error is capped by 32 * NR_CPUS pages per dying cgroup. It's not perfect, as slab are reparented, so any updates after the reparenting will happen on the parent level. It means that if a slab page was allocated, a counter on child level was bumped, then the page was reparented and freed, the annihilation of positive and negative counter values will not happen until the child cgroup is released. It makes slab counters different from others, and it might want us to implement flushing in a correct form again. But it's also a question of performance: scheduling a work on each cpu isn't free, and it's an open question if the benefit of having more accurate counters is worth it. We might also consider flushing all counters on offlining, not only slab counters. So let's fix the main problem now: make the slab counters eventually consistent, so at least the error won't grow with uptime (or more precisely the number of created and destroyed cgroups). And think about the accuracy of counters separately. Link: http://lkml.kernel.org/r/20191220042728.1045881-1-guro@fb.com Fixes: bee07b33db78 ("mm: memcontrol: flush percpu slab vmstats on kmem offlining") Signed-off-by: Roman Gushchin Acked-by: Johannes Weiner Acked-by: Michal Hocko Cc: Signed-off-by: Andrew Morton --- include/linux/mmzone.h | 5 ++--- mm/memcontrol.c | 37 +++++++++---------------------------- 2 files changed, 11 insertions(+), 31 deletions(-) --- a/include/linux/mmzone.h~mm-memcg-slab-fix-percpu-slab-vmstats-flushing +++ a/include/linux/mmzone.h @@ -215,9 +215,8 @@ enum node_stat_item { NR_INACTIVE_FILE, /* " " " " " */ NR_ACTIVE_FILE, /* " " " " " */ NR_UNEVICTABLE, /* " " " " " */ - NR_SLAB_RECLAIMABLE, /* Please do not reorder this item */ - NR_SLAB_UNRECLAIMABLE, /* and this one without looking at - * memcg_flush_percpu_vmstats() first. */ + NR_SLAB_RECLAIMABLE, + NR_SLAB_UNRECLAIMABLE, NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ WORKINGSET_NODES, --- a/mm/memcontrol.c~mm-memcg-slab-fix-percpu-slab-vmstats-flushing +++ a/mm/memcontrol.c @@ -3287,49 +3287,34 @@ static u64 mem_cgroup_read_u64(struct cg } } -static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg, bool slab_only) +static void memcg_flush_percpu_vmstats(struct mem_cgroup *memcg) { - unsigned long stat[MEMCG_NR_STAT]; + unsigned long stat[MEMCG_NR_STAT] = {0}; struct mem_cgroup *mi; int node, cpu, i; - int min_idx, max_idx; - - if (slab_only) { - min_idx = NR_SLAB_RECLAIMABLE; - max_idx = NR_SLAB_UNRECLAIMABLE; - } else { - min_idx = 0; - max_idx = MEMCG_NR_STAT; - } - - for (i = min_idx; i < max_idx; i++) - stat[i] = 0; for_each_online_cpu(cpu) - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < MEMCG_NR_STAT; i++) stat[i] += per_cpu(memcg->vmstats_percpu->stat[i], cpu); for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < MEMCG_NR_STAT; i++) atomic_long_add(stat[i], &mi->vmstats[i]); - if (!slab_only) - max_idx = NR_VM_NODE_STAT_ITEMS; - for_each_node(node) { struct mem_cgroup_per_node *pn = memcg->nodeinfo[node]; struct mem_cgroup_per_node *pi; - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) stat[i] = 0; for_each_online_cpu(cpu) - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) stat[i] += per_cpu( pn->lruvec_stat_cpu->count[i], cpu); for (pi = pn; pi; pi = parent_nodeinfo(pi, node)) - for (i = min_idx; i < max_idx; i++) + for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) atomic_long_add(stat[i], &pi->lruvec_stat[i]); } } @@ -3403,13 +3388,9 @@ static void memcg_offline_kmem(struct me parent = root_mem_cgroup; /* - * Deactivate and reparent kmem_caches. Then flush percpu - * slab statistics to have precise values at the parent and - * all ancestor levels. It's required to keep slab stats - * accurate after the reparenting of kmem_caches. + * Deactivate and reparent kmem_caches. */ memcg_deactivate_kmem_caches(memcg, parent); - memcg_flush_percpu_vmstats(memcg, true); kmemcg_id = memcg->kmemcg_id; BUG_ON(kmemcg_id < 0); @@ -4913,7 +4894,7 @@ static void mem_cgroup_free(struct mem_c * Flush percpu vmstats and vmevents to guarantee the value correctness * on parent's and all ancestor levels. */ - memcg_flush_percpu_vmstats(memcg, false); + memcg_flush_percpu_vmstats(memcg); memcg_flush_percpu_vmevents(memcg); __mem_cgroup_free(memcg); } From patchwork Tue Jan 14 00:29:32 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 233989 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5739AC33CA9 for ; Tue, 14 Jan 2020 00:29:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 2D2F2222C2 for ; Tue, 14 Jan 2020 00:29:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1578961775; bh=DcHqBL64cbNkVbSmxrP/j7rkitdcnTiPDWtLdYy4gmU=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=E3qtglPLqU7flPZ1guP1p8/q+ewd2WW6N3+K2z4CGA0yR7F/qdbTWpgPOxdxNT9E7 wFcvLjSrtNpbNvsZ0QehfZKN7+dgC4SncVJ94xExaErioe3Kek6Z3hcKyFG6jnS4xP 9n0EbiBZC5YC69ZpqDfUCyfO2HcNoG9tw5PHYfAA= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728977AbgANA3e (ORCPT ); Mon, 13 Jan 2020 19:29:34 -0500 Received: from mail.kernel.org ([198.145.29.99]:45786 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728890AbgANA3e (ORCPT ); Mon, 13 Jan 2020 19:29:34 -0500 Received: from akpm3.svl.corp.google.com (unknown [104.133.8.65]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 3FD1F21556; Tue, 14 Jan 2020 00:29:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1578961773; bh=DcHqBL64cbNkVbSmxrP/j7rkitdcnTiPDWtLdYy4gmU=; h=Date:From:To:Subject:In-Reply-To:From; b=g6vf8wLwX37XkzCTV60uaetfiDyZU8PV3/W/mv0mIgR7vd1B0AMvTZuEt+5mG9sbV izDaLs2RTWscjLS9XwJVXZtsuG8HzYTRnD8h8LjozYz3GmyNDcK1ZC2LE0mjchfvTh Aae3BHnhZe23bDs8T0e224EYWkScf7hfBYNTjHB0= Date: Mon, 13 Jan 2020 16:29:32 -0800 From: Andrew Morton To: stable@vger.kernel.org, shakeelb@google.com, rientjes@google.com, penberg@kernel.org, mhocko@kernel.org, lixc17@lenovo.com, jroedel@suse.de, iamjoonsoo.kim@lge.com, hannes@cmpxchg.org, cl@linux.com, ahuang12@lenovo.com, akpm@linux-foundation.org, linux-mm@kvack.org, mm-commits@vger.kernel.org, torvalds@linux-foundation.org Subject: [patch 10/11] mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid Message-ID: <20200114002932.ZMvXk%akpm@linux-foundation.org> In-Reply-To: <20200113162831.f7d69e11e9e673c40005c9b0@linux-foundation.org> User-Agent: s-nail v14.9.15 Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Adrian Huang Subject: mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid When booting with amd_iommu=off, the following WARNING message appears: AMD-Vi: AMD IOMMU disabled on kernel command-line ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450 Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6 Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019 RIP: 0010:flush_workqueue+0x42e/0x450 Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff <0f> 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04 RSP: 0000:ffffffff96203d80 EFLAGS: 00010246 RAX: ffffffff96203dc8 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffffffff96a63120 RSI: ffffffff95efcba2 RDI: ffffffff96203dc0 RBP: ffffffff96203e08 R08: 0000000000000000 R09: ffffffff962a1828 R10: 00000000f0000080 R11: dead000000000100 R12: ffff8d8a87c0a770 R13: dead000000000100 R14: 0000000000000456 R15: ffffffff96203da0 FS: 0000000000000000(0000) GS:ffff8d8dbd000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff8d91cfbff000 CR3: 000000078920a000 CR4: 00000000000406b0 Call Trace: ? wait_for_completion+0x51/0x180 kmem_cache_destroy+0x69/0x260 iommu_go_to_state+0x40c/0x5ab amd_iommu_prepare+0x16/0x2a irq_remapping_prepare+0x36/0x5f enable_IR_x2apic+0x21/0x172 default_setup_apic_routing+0x12/0x6f apic_intr_mode_init+0x1a1/0x1f1 x86_late_time_init+0x17/0x1c start_kernel+0x480/0x53f secondary_startup_64+0xb6/0xc0 ---[ end trace 30894107c3749449 ]--- x2apic: IRQ remapping doesn't support X2APIC mode x2apic disabled The warning is caused by the calling of 'kmem_cache_destroy()' in free_iommu_resources(). Here is the call path: free_iommu_resources kmem_cache_destroy flush_memcg_workqueue flush_workqueue The root cause is that the IOMMU subsystem runs before the workqueue subsystem, which the variable 'wq_online' is still 'false'. This leads to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is 'true'. Since the variable 'memcg_kmem_cache_wq' is not allocated during the time, it is unnecessary to call flush_memcg_workqueue(). This prevents the WARNING message triggered by flush_workqueue(). Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com Fixes: 92ee383f6daab ("mm: fix race between kmem_cache destroy, create and deactivate") Signed-off-by: Adrian Huang Reported-by: Xiaochun Lee Reviewed-by: Shakeel Butt Cc: Joerg Roedel Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Michal Hocko Cc: Johannes Weiner Cc: Signed-off-by: Andrew Morton --- mm/slab_common.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/mm/slab_common.c~mm-memcg-slab-call-flush_memcg_workqueue-only-if-memcg-workqueue-is-valid +++ a/mm/slab_common.c @@ -903,7 +903,8 @@ static void flush_memcg_workqueue(struct * deactivates the memcg kmem_caches through workqueue. Make sure all * previous workitems on workqueue are processed. */ - flush_workqueue(memcg_kmem_cache_wq); + if (likely(memcg_kmem_cache_wq)) + flush_workqueue(memcg_kmem_cache_wq); /* * If we're racing with children kmem_cache deactivation, it might