mbox series

[RFC,v1,00/13] zswap IAA compress batching

Message ID 20241018064101.336232-1-kanchana.p.sridhar@intel.com
Headers show
Series zswap IAA compress batching | expand

Message

Kanchana P Sridhar Oct. 18, 2024, 6:40 a.m. UTC
IAA Compression Batching:
=========================

This RFC patch-series introduces the use of the Intel Analytics Accelerator
(IAA) for parallel compression of pages in a folio, and for batched reclaim
of hybrid any-order batches of folios in shrink_folio_list().

The patch-series is organized as follows:

 1) iaa_crypto driver enablers for batching: Relevant patches are tagged
    with "crypto:" in the subject:

    a) async poll crypto_acomp interface without interrupts.
    b) crypto testmgr acomp poll support.
    c) Modifying the default sync_mode to "async" and disabling
       verify_compress by default, to facilitate users to run IAA easily for
       comparison with software compressors.
    d) Changing the cpu-to-iaa mappings to more evenly balance cores to IAA
       devices.
    e) Addition of a "global_wq" per IAA, which can be used as a global
       resource for the socket. If the user configures 2WQs per IAA device,
       the driver will distribute compress jobs from all cores on the
       socket to the "global_wqs" of all the IAA devices on that socket, in
       a round-robin manner. This can be used to improve compression
       throughput for workloads that see a lot of swapout activity.

 2) Migrating zswap to use async poll in zswap_compress()/decompress().
 3) A centralized batch compression API that can be used by swap modules.
 4) IAA compress batching within large folio zswap stores.
 5) IAA compress batching of any-order hybrid folios in
    shrink_folio_list(). The newly added "sysctl vm.compress-batchsize"
    parameter can be used to configure the number of folios in [1, 32] to
    be reclaimed using compress batching.

IAA compress batching can be enabled only on platforms that have IAA, by
setting this config variable:

 CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y"
 
The performance testing data with usemem 30 instances shows throughput
gains of up to 40%, elapsed time reduction of up to 22% and sys time
reduction of up to 30% with IAA compression batching.

Our internal validation of IAA compress/decompress batching in highly
contended Sapphire Rapids server setups with workloads running on 72 cores
for ~25 minutes under stringent memory limit constraints have shown up to
50% reduction in sys time and 3.5% reduction in workload run time as
compared to software compressors.


System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 10-16-2024,
commit 817952b8be34, without and with this patch-series.
Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores
per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk
partition swap. Core frequency was fixed at 2500MHz.

The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and sleeping
for 10 sec before exiting:

usemem --init-time -w -O -s 10 -n 30 10g

Other kernel configuration parameters:

    zswap compressor : deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 2,4

IAA "compression verification" is disabled and the async poll acomp
interface is used in the iaa_crypto driver (the defaults with this
series).


Performance testing (usemem30):
===============================

 4K folios: deflate-iaa:
 =======================

 -------------------------------------------------------------------------------
                mm-unstable-10-16-2024  shrink_folio_list()  shrink_folio_list()
                                         batching of folios   batching of folios
 -------------------------------------------------------------------------------
 zswap compressor          deflate-iaa          deflate-iaa          deflate-iaa
 vm.compress-batchsize             n/a                    1                   32
 vm.page-cluster                     2                    2                    2
 -------------------------------------------------------------------------------
 Total throughput            4,470,466            5,770,824            6,363,045
           (KB/s)
 Average throughput            149,015              192,360              212,101
           (KB/s)
 elapsed time                   119.24               100.96                92.99
        (sec)
 sys time (sec)               2,819.29             2,168.08             1,970.79

 -------------------------------------------------------------------------------
 memcg_high                    668,185              646,357              613,421
 memcg_swap_fail                     0                    0                    0
 zswpout                    62,991,796           58,275,673           53,070,201
 zswpin                            431                  415                  396
 pswpout                             0                    0                    0
 pswpin                              0                    0                    0
 thp_swpout                          0                    0                    0
 thp_swpout_fallback                 0                    0                    0
 pgmajfault                      3,137                3,085                3,440
 swap_ra                            99                  100                   95
 swap_ra_hit                        42                   44                   45
 -------------------------------------------------------------------------------


 16k/32/64k folios: deflate-iaa:
 ===============================
 All three large folio sizes 16k/32/64k were enabled to "always".

 -------------------------------------------------------------------------------
                mm-unstable-  zswap_store()      + shrink_folio_list()
                  10-16-2024    batching of         batching of folios
                                   pages in
                               large folios
 -------------------------------------------------------------------------------
 zswap compr     deflate-iaa     deflate-iaa          deflate-iaa
 vm.compress-            n/a             n/a         4          8             16
 batchsize
 vm.page-                  2               2         2          2              2
  cluster
 -------------------------------------------------------------------------------
 Total throughput   7,182,198   8,448,994    8,584,728    8,729,643    8,775,944
           (KB/s)             
 Avg throughput       239,406     281,633      286,157      290,988      292,531
         (KB/s)               
 elapsed time           85.04       77.84        77.03        75.18        74.98
         (sec)                
 sys time (sec)      1,730.77    1,527.40     1,528.52     1,473.76     1,465.97

 -------------------------------------------------------------------------------
 memcg_high           648,125     694,188      696,004      699,728      724,887
 memcg_swap_fail        1,550       2,540        1,627        1,577        1,517
 zswpout           57,606,876  56,624,450   56,125,082    55,999,42   57,352,204
 zswpin                   421         406          422          400          437
 pswpout                    0           0            0            0            0
 pswpin                     0           0            0            0            0
 thp_swpout                 0           0            0            0            0
 thp_swpout_fallback        0           0            0            0            0
 16kB-mthp_swpout_          0           0            0            0            0
          fallback
 32kB-mthp_swpout_          0           0            0            0            0
          fallback
 64kB-mthp_swpout_      1,550       2,539        1,627        1,577        1,517
          fallback
 pgmajfault             3,102       3,126        3,473        3,454        3,134
 swap_ra                  107         144          109          124          181
 swap_ra_hit               51          88           45           66          107
 ZSWPOUT-16kB               2           3            4            4            3
 ZSWPOUT-32kB               0           2            1            1            0
 ZSWPOUT-64kB       3,598,889   3,536,556    3,506,134    3,498,324    3,582,921
 SWPOUT-16kB                0           0            0            0            0
 SWPOUT-32kB                0           0            0            0            0
 SWPOUT-64kB                0           0            0            0            0
 -------------------------------------------------------------------------------


 2M folios: deflate-iaa:
 =======================

 -------------------------------------------------------------------------------
                   mm-unstable-10-16-2024    zswap_store() batching of pages
                                                      in pmd-mappable folios
 -------------------------------------------------------------------------------
 zswap compressor             deflate-iaa                deflate-iaa
 vm.compress-batchsize                n/a                        n/a
 vm.page-cluster                        2                          2
 -------------------------------------------------------------------------------
 Total throughput               7,444,592                 8,916,349     
           (KB/s)                                                  
 Average throughput               248,153                   297,211     
           (KB/s)                                                  
 elapsed time                       86.29                     73.44     
        (sec)                                                      
 sys time (sec)                  1,833.21                  1,418.58     
                                                                   
 -------------------------------------------------------------------------------
 memcg_high                        81,786                    89,905     
 memcg_swap_fail                       82                       395     
 zswpout                       58,874,092                57,721,884     
 zswpin                               422                       458     
 pswpout                                0                         0     
 pswpin                                 0                         0     
 thp_swpout                             0                         0     
 thp_swpout_fallback                   82                       394     
 pgmajfault                        14,864                    21,544     
 swap_ra                           34,953                    53,751     
 swap_ra_hit                       34,895                    53,660     
 ZSWPOUT-2048kB                   114,815                   112,269     
 SWPOUT-2048kB                          0                         0     
 -------------------------------------------------------------------------------

Since 4K folios account for ~0.4% of all zswapouts when pmd-mappable folios
are enabled for usemem30, we cannot expect much improvement from reclaim
batching.


Performance testing (Kernel compilation):
=========================================

As mentioned earlier, for workloads that see a lot of swapout activity, we
can benefit from configuring 2 WQs per IAA device, with compress jobs from
all same-socket cores being distributed toothe wq.1 of all IAAs on the
socket, with the "global_wq" developed in this patch-series.

Although this data includes IAA decompress batching, which will be
submitted as a separate RFC patch-series, I am listing it here to quantify
the benefit of distributing compress jobs among all IAAs. The kernel
compilation test with "allmodconfig" is able to quantify this well:


 4K folios: deflate-iaa: kernel compilation to quantify crypto patches
 =====================================================================


 ------------------------------------------------------------------------------
                   IAA shrink_folio_list() compress batching and
                       swapin_readahead() decompress batching

                                      1WQ      2WQ (distribute compress jobs)

                        1 local WQ (wq.0)    1 local WQ (wq.0) +
                                  per IAA    1 global WQ (wq.1) per IAA
                        
 ------------------------------------------------------------------------------
 zswap compressor             deflate-iaa         deflate-iaa
 vm.compress-batchsize                 32                  32
 vm.page-cluster                        4                   4
 ------------------------------------------------------------------------------
 real_sec                          746.77              745.42  
 user_sec                       15,732.66           15,738.85
 sys_sec                         5,384.14            5,247.86
 Max_Res_Set_Size_KB            1,874,432           1,872,640

 ------------------------------------------------------------------------------
 zswpout                      101,648,460         104,882,982
 zswpin                        27,418,319          29,428,515
 pswpout                              213                  22
 pswpin                               207                   6
 pgmajfault                    21,896,616          23,629,768
 swap_ra                        6,054,409           6,385,080
 swap_ra_hit                    3,791,628           3,985,141
 ------------------------------------------------------------------------------

The iaa_crypto wq stats will show almost the same number of compress calls
for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively.
We see a latency reduction of 2.5% by distributing compress jobs among all
IAA devices on the socket.

I would greatly appreciate code review comments for the iaa_crypto driver
and mm patches included in this series!

Thanks,
Kanchana



Kanchana P Sridhar (13):
  crypto: acomp - Add a poll() operation to acomp_alg and acomp_req
  crypto: iaa - Add support for irq-less crypto async interface
  crypto: testmgr - Add crypto testmgr acomp poll support.
  mm: zswap: zswap_compress()/decompress() can submit, then poll an
    acomp_req.
  crypto: iaa - Make async mode the default.
  crypto: iaa - Disable iaa_verify_compress by default.
  crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to
    IAAs.
  crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA
    node.
  mm: zswap: Config variable to enable compress batching in
    zswap_store().
  mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if
    platform has IAA.
  mm: swap: Add IAA batch compression API
    swap_crypto_acomp_compress_batch().
  mm: zswap: Compress batching with Intel IAA in zswap_store() of large
    folios.
  mm: vmscan, swap, zswap: Compress batching of folios in
    shrink_folio_list().

 crypto/acompress.c                         |   1 +
 crypto/testmgr.c                           |  70 +-
 drivers/crypto/intel/iaa/iaa_crypto_main.c | 467 +++++++++++--
 include/crypto/acompress.h                 |  18 +
 include/crypto/internal/acompress.h        |   1 +
 include/linux/fs.h                         |   2 +
 include/linux/mm.h                         |   8 +
 include/linux/writeback.h                  |   5 +
 include/linux/zswap.h                      | 106 +++
 kernel/sysctl.c                            |   9 +
 mm/Kconfig                                 |  12 +
 mm/page_io.c                               | 152 +++-
 mm/swap.c                                  |  15 +
 mm/swap.h                                  |  96 +++
 mm/swap_state.c                            | 115 +++
 mm/vmscan.c                                | 154 +++-
 mm/zswap.c                                 | 771 +++++++++++++++++++--
 17 files changed, 1870 insertions(+), 132 deletions(-)


base-commit: 817952b8be34aad40e07f6832fb9d1fc08961550

Comments

Kanchana P Sridhar Oct. 18, 2024, 6:41 a.m. UTC | #1
This patch enables the use of Intel IAA hardware compression acceleration
to reclaim a batch of folios in shrink_folio_list(). This results in
reclaim throughput and workload/sys performance improvements.

The earlier patches on compress batching deployed multiple IAA compress
engines for compressing up to SWAP_CRYPTO_SUB_BATCH_SIZE pages within a
large folio that is being stored in zswap_store(). This patch further
propagates the efficiency improvements demonstrated with IAA "batching
within folios", to vmscan "batching of folios" which will also use
batching within folios using the extensible architecture of
the __zswap_store_batch_core() procedure added earlier, that accepts
an array of folios.

A plug mechanism is introduced in swap_writepage() to aggregate a batch of
up to vm.compress-batchsize ([1, 32]) folios before processing the plug.
The plug will be processed if any of the following is true:

 1) The plug has vm.compress-batchsize folios. If the system has Intel IAA,
    "sysctl vm.compress-batchsize" can be configured to be in [1, 32]. On
    systems without IAA, or if CONFIG_ZSWAP_STORE_BATCHING_ENABLED is not
    set, "sysctl vm.compress-batchsize" can only be 1.
 2) A folio of a different swap type or folio_nid as the current folios in
    the plug, needs to be added to the plug.
 3) A pmd-mappable folio needs to be swapped out. In this case, the
    existing folios in the plug are processed. The pmd-mappable folio is
    swapped out (zswap_store() will batch compress
    SWAP_CRYPTO_SUB_BATCH_SIZE pages in the pmd-mappable folio if system
    has IAA) in a batch of its own.
Herbert Xu Oct. 18, 2024, 7:55 a.m. UTC | #2
On Thu, Oct 17, 2024 at 11:40:49PM -0700, Kanchana P Sridhar wrote:
> For async compress/decompress, provide a way for the caller to poll
> for compress/decompress completion, rather than wait for an interrupt
> to signal completion.
> 
> Callers can submit a compress/decompress using crypto_acomp_compress
> and decompress and rather than wait on a completion, call
> crypto_acomp_poll() to check for completion.
> 
> This is useful for hardware accelerators where the overhead of
> interrupts and waiting for completions is too expensive.  Typically
> the compress/decompress hw operations complete very quickly and in the
> vast majority of cases, adding the overhead of interrupt handling and
> waiting for completions simply adds unnecessary delays and cancels the
> gains of using the hw acceleration.
> 
> Signed-off-by: Tom Zanussi <tom.zanussi@linux.intel.com>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  crypto/acompress.c                  |  1 +
>  include/crypto/acompress.h          | 18 ++++++++++++++++++
>  include/crypto/internal/acompress.h |  1 +
>  3 files changed, 20 insertions(+)

How about just adding a request flag that tells the driver to
make the request synchronous if possible?

Something like

#define CRYPTO_ACOMP_REQ_POLL	0x00000001

Cheers,
Herbert Xu Oct. 19, 2024, 12:19 a.m. UTC | #3
On Fri, Oct 18, 2024 at 11:01:10PM +0000, Sridhar, Kanchana P wrote:
>
> Thanks for your code review comments. Are you referring to how the
> async/poll interface is enabled at the level of say zswap (by setting a
> flag in the acomp_req), followed by the iaa_crypto driver testing for
> the flag and submitting the request and returning -EINPROGRESS.
> Wouldn't we still need a separate API to do the polling?

Correct me if I'm wrong, but I think what you want to do is this:

	crypto_acomp_compress(req)
	crypto_acomp_poll(req)

So instead of adding this interface, where the poll essentially
turns the request synchronous, just move this logic into the driver,
based on a flag bit in req.

Cheers,
Kanchana P Sridhar Oct. 19, 2024, 7:10 p.m. UTC | #4
> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Friday, October 18, 2024 5:20 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; zanussi@kernel.org;
> viro@zeniv.linux.org.uk; brauner@kernel.org; jack@suse.cz;
> mcgrof@kernel.org; kees@kernel.org; joel.granados@kernel.org;
> bfoster@redhat.com; willy@infradead.org; linux-fsdevel@vger.kernel.org;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 01/13] crypto: acomp - Add a poll() operation to
> acomp_alg and acomp_req
> 
> On Fri, Oct 18, 2024 at 11:01:10PM +0000, Sridhar, Kanchana P wrote:
> >
> > Thanks for your code review comments. Are you referring to how the
> > async/poll interface is enabled at the level of say zswap (by setting a
> > flag in the acomp_req), followed by the iaa_crypto driver testing for
> > the flag and submitting the request and returning -EINPROGRESS.
> > Wouldn't we still need a separate API to do the polling?
> 
> Correct me if I'm wrong, but I think what you want to do is this:
> 
> 	crypto_acomp_compress(req)
> 	crypto_acomp_poll(req)
> 
> So instead of adding this interface, where the poll essentially
> turns the request synchronous, just move this logic into the driver,
> based on a flag bit in req.

Thanks Herbert, for this suggestion. I understand this better now,
and will work with Kristen for addressing this in v2.

Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Yosry Ahmed Oct. 23, 2024, 12:56 a.m. UTC | #5
On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
>
> IAA Compression Batching:
> =========================
>
> This RFC patch-series introduces the use of the Intel Analytics Accelerator
> (IAA) for parallel compression of pages in a folio, and for batched reclaim
> of hybrid any-order batches of folios in shrink_folio_list().
>
> The patch-series is organized as follows:
>
>  1) iaa_crypto driver enablers for batching: Relevant patches are tagged
>     with "crypto:" in the subject:
>
>     a) async poll crypto_acomp interface without interrupts.
>     b) crypto testmgr acomp poll support.
>     c) Modifying the default sync_mode to "async" and disabling
>        verify_compress by default, to facilitate users to run IAA easily for
>        comparison with software compressors.
>     d) Changing the cpu-to-iaa mappings to more evenly balance cores to IAA
>        devices.
>     e) Addition of a "global_wq" per IAA, which can be used as a global
>        resource for the socket. If the user configures 2WQs per IAA device,
>        the driver will distribute compress jobs from all cores on the
>        socket to the "global_wqs" of all the IAA devices on that socket, in
>        a round-robin manner. This can be used to improve compression
>        throughput for workloads that see a lot of swapout activity.
>
>  2) Migrating zswap to use async poll in zswap_compress()/decompress().
>  3) A centralized batch compression API that can be used by swap modules.
>  4) IAA compress batching within large folio zswap stores.
>  5) IAA compress batching of any-order hybrid folios in
>     shrink_folio_list(). The newly added "sysctl vm.compress-batchsize"
>     parameter can be used to configure the number of folios in [1, 32] to
>     be reclaimed using compress batching.

I am still digesting this series but I have some high level questions
that I left on some patches. My intuition though is that we should
drop (5) from the initial proposal as it's most controversial.
Batching reclaim of unrelated folios through zswap *might* make sense,
but it needs a broader conversation and it needs justification on its
own merit, without the rest of the series.

>
> IAA compress batching can be enabled only on platforms that have IAA, by
> setting this config variable:
>
>  CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y"
>
> The performance testing data with usemem 30 instances shows throughput
> gains of up to 40%, elapsed time reduction of up to 22% and sys time
> reduction of up to 30% with IAA compression batching.
>
> Our internal validation of IAA compress/decompress batching in highly
> contended Sapphire Rapids server setups with workloads running on 72 cores
> for ~25 minutes under stringent memory limit constraints have shown up to
> 50% reduction in sys time and 3.5% reduction in workload run time as
> compared to software compressors.
>
>
> System setup for testing:
> =========================
> Testing of this patch-series was done with mm-unstable as of 10-16-2024,
> commit 817952b8be34, without and with this patch-series.
> Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores
> per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk
> partition swap. Core frequency was fixed at 2500MHz.
>
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
> processes were run, each allocating and writing 10G of memory, and sleeping
> for 10 sec before exiting:
>
> usemem --init-time -w -O -s 10 -n 30 10g
>
> Other kernel configuration parameters:
>
>     zswap compressor : deflate-iaa
>     zswap allocator   : zsmalloc
>     vm.page-cluster   : 2,4
>
> IAA "compression verification" is disabled and the async poll acomp
> interface is used in the iaa_crypto driver (the defaults with this
> series).
>
>
> Performance testing (usemem30):
> ===============================
>
>  4K folios: deflate-iaa:
>  =======================
>
>  -------------------------------------------------------------------------------
>                 mm-unstable-10-16-2024  shrink_folio_list()  shrink_folio_list()
>                                          batching of folios   batching of folios
>  -------------------------------------------------------------------------------
>  zswap compressor          deflate-iaa          deflate-iaa          deflate-iaa
>  vm.compress-batchsize             n/a                    1                   32
>  vm.page-cluster                     2                    2                    2
>  -------------------------------------------------------------------------------
>  Total throughput            4,470,466            5,770,824            6,363,045
>            (KB/s)
>  Average throughput            149,015              192,360              212,101
>            (KB/s)
>  elapsed time                   119.24               100.96                92.99
>         (sec)
>  sys time (sec)               2,819.29             2,168.08             1,970.79
>
>  -------------------------------------------------------------------------------
>  memcg_high                    668,185              646,357              613,421
>  memcg_swap_fail                     0                    0                    0
>  zswpout                    62,991,796           58,275,673           53,070,201
>  zswpin                            431                  415                  396
>  pswpout                             0                    0                    0
>  pswpin                              0                    0                    0
>  thp_swpout                          0                    0                    0
>  thp_swpout_fallback                 0                    0                    0
>  pgmajfault                      3,137                3,085                3,440
>  swap_ra                            99                  100                   95
>  swap_ra_hit                        42                   44                   45
>  -------------------------------------------------------------------------------
>
>
>  16k/32/64k folios: deflate-iaa:
>  ===============================
>  All three large folio sizes 16k/32/64k were enabled to "always".
>
>  -------------------------------------------------------------------------------
>                 mm-unstable-  zswap_store()      + shrink_folio_list()
>                   10-16-2024    batching of         batching of folios
>                                    pages in
>                                large folios
>  -------------------------------------------------------------------------------
>  zswap compr     deflate-iaa     deflate-iaa          deflate-iaa
>  vm.compress-            n/a             n/a         4          8             16
>  batchsize
>  vm.page-                  2               2         2          2              2
>   cluster
>  -------------------------------------------------------------------------------
>  Total throughput   7,182,198   8,448,994    8,584,728    8,729,643    8,775,944
>            (KB/s)
>  Avg throughput       239,406     281,633      286,157      290,988      292,531
>          (KB/s)
>  elapsed time           85.04       77.84        77.03        75.18        74.98
>          (sec)
>  sys time (sec)      1,730.77    1,527.40     1,528.52     1,473.76     1,465.97
>
>  -------------------------------------------------------------------------------
>  memcg_high           648,125     694,188      696,004      699,728      724,887
>  memcg_swap_fail        1,550       2,540        1,627        1,577        1,517
>  zswpout           57,606,876  56,624,450   56,125,082    55,999,42   57,352,204
>  zswpin                   421         406          422          400          437
>  pswpout                    0           0            0            0            0
>  pswpin                     0           0            0            0            0
>  thp_swpout                 0           0            0            0            0
>  thp_swpout_fallback        0           0            0            0            0
>  16kB-mthp_swpout_          0           0            0            0            0
>           fallback
>  32kB-mthp_swpout_          0           0            0            0            0
>           fallback
>  64kB-mthp_swpout_      1,550       2,539        1,627        1,577        1,517
>           fallback
>  pgmajfault             3,102       3,126        3,473        3,454        3,134
>  swap_ra                  107         144          109          124          181
>  swap_ra_hit               51          88           45           66          107
>  ZSWPOUT-16kB               2           3            4            4            3
>  ZSWPOUT-32kB               0           2            1            1            0
>  ZSWPOUT-64kB       3,598,889   3,536,556    3,506,134    3,498,324    3,582,921
>  SWPOUT-16kB                0           0            0            0            0
>  SWPOUT-32kB                0           0            0            0            0
>  SWPOUT-64kB                0           0            0            0            0
>  -------------------------------------------------------------------------------
>
>
>  2M folios: deflate-iaa:
>  =======================
>
>  -------------------------------------------------------------------------------
>                    mm-unstable-10-16-2024    zswap_store() batching of pages
>                                                       in pmd-mappable folios
>  -------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa                deflate-iaa
>  vm.compress-batchsize                n/a                        n/a
>  vm.page-cluster                        2                          2
>  -------------------------------------------------------------------------------
>  Total throughput               7,444,592                 8,916,349
>            (KB/s)
>  Average throughput               248,153                   297,211
>            (KB/s)
>  elapsed time                       86.29                     73.44
>         (sec)
>  sys time (sec)                  1,833.21                  1,418.58
>
>  -------------------------------------------------------------------------------
>  memcg_high                        81,786                    89,905
>  memcg_swap_fail                       82                       395
>  zswpout                       58,874,092                57,721,884
>  zswpin                               422                       458
>  pswpout                                0                         0
>  pswpin                                 0                         0
>  thp_swpout                             0                         0
>  thp_swpout_fallback                   82                       394
>  pgmajfault                        14,864                    21,544
>  swap_ra                           34,953                    53,751
>  swap_ra_hit                       34,895                    53,660
>  ZSWPOUT-2048kB                   114,815                   112,269
>  SWPOUT-2048kB                          0                         0
>  -------------------------------------------------------------------------------
>
> Since 4K folios account for ~0.4% of all zswapouts when pmd-mappable folios
> are enabled for usemem30, we cannot expect much improvement from reclaim
> batching.
>
>
> Performance testing (Kernel compilation):
> =========================================
>
> As mentioned earlier, for workloads that see a lot of swapout activity, we
> can benefit from configuring 2 WQs per IAA device, with compress jobs from
> all same-socket cores being distributed toothe wq.1 of all IAAs on the
> socket, with the "global_wq" developed in this patch-series.
>
> Although this data includes IAA decompress batching, which will be
> submitted as a separate RFC patch-series, I am listing it here to quantify
> the benefit of distributing compress jobs among all IAAs. The kernel
> compilation test with "allmodconfig" is able to quantify this well:
>
>
>  4K folios: deflate-iaa: kernel compilation to quantify crypto patches
>  =====================================================================
>
>
>  ------------------------------------------------------------------------------
>                    IAA shrink_folio_list() compress batching and
>                        swapin_readahead() decompress batching
>
>                                       1WQ      2WQ (distribute compress jobs)
>
>                         1 local WQ (wq.0)    1 local WQ (wq.0) +
>                                   per IAA    1 global WQ (wq.1) per IAA
>
>  ------------------------------------------------------------------------------
>  zswap compressor             deflate-iaa         deflate-iaa
>  vm.compress-batchsize                 32                  32
>  vm.page-cluster                        4                   4
>  ------------------------------------------------------------------------------
>  real_sec                          746.77              745.42
>  user_sec                       15,732.66           15,738.85
>  sys_sec                         5,384.14            5,247.86
>  Max_Res_Set_Size_KB            1,874,432           1,872,640
>
>  ------------------------------------------------------------------------------
>  zswpout                      101,648,460         104,882,982
>  zswpin                        27,418,319          29,428,515
>  pswpout                              213                  22
>  pswpin                               207                   6
>  pgmajfault                    21,896,616          23,629,768
>  swap_ra                        6,054,409           6,385,080
>  swap_ra_hit                    3,791,628           3,985,141
>  ------------------------------------------------------------------------------
>
> The iaa_crypto wq stats will show almost the same number of compress calls
> for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively.
> We see a latency reduction of 2.5% by distributing compress jobs among all
> IAA devices on the socket.
>
> I would greatly appreciate code review comments for the iaa_crypto driver
> and mm patches included in this series!
>
> Thanks,
> Kanchana
>
>
>
> Kanchana P Sridhar (13):
>   crypto: acomp - Add a poll() operation to acomp_alg and acomp_req
>   crypto: iaa - Add support for irq-less crypto async interface
>   crypto: testmgr - Add crypto testmgr acomp poll support.
>   mm: zswap: zswap_compress()/decompress() can submit, then poll an
>     acomp_req.
>   crypto: iaa - Make async mode the default.
>   crypto: iaa - Disable iaa_verify_compress by default.
>   crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to
>     IAAs.
>   crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA
>     node.
>   mm: zswap: Config variable to enable compress batching in
>     zswap_store().
>   mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if
>     platform has IAA.
>   mm: swap: Add IAA batch compression API
>     swap_crypto_acomp_compress_batch().
>   mm: zswap: Compress batching with Intel IAA in zswap_store() of large
>     folios.
>   mm: vmscan, swap, zswap: Compress batching of folios in
>     shrink_folio_list().
>
>  crypto/acompress.c                         |   1 +
>  crypto/testmgr.c                           |  70 +-
>  drivers/crypto/intel/iaa/iaa_crypto_main.c | 467 +++++++++++--
>  include/crypto/acompress.h                 |  18 +
>  include/crypto/internal/acompress.h        |   1 +
>  include/linux/fs.h                         |   2 +
>  include/linux/mm.h                         |   8 +
>  include/linux/writeback.h                  |   5 +
>  include/linux/zswap.h                      | 106 +++
>  kernel/sysctl.c                            |   9 +
>  mm/Kconfig                                 |  12 +
>  mm/page_io.c                               | 152 +++-
>  mm/swap.c                                  |  15 +
>  mm/swap.h                                  |  96 +++
>  mm/swap_state.c                            | 115 +++
>  mm/vmscan.c                                | 154 +++-
>  mm/zswap.c                                 | 771 +++++++++++++++++++--
>  17 files changed, 1870 insertions(+), 132 deletions(-)
>
>
> base-commit: 817952b8be34aad40e07f6832fb9d1fc08961550
> --
> 2.27.0
>
>
Kanchana P Sridhar Oct. 23, 2024, 2:53 a.m. UTC | #6
Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, October 22, 2024 5:57 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux-
> fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching
> 
> On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > IAA Compression Batching:
> > =========================
> >
> > This RFC patch-series introduces the use of the Intel Analytics Accelerator
> > (IAA) for parallel compression of pages in a folio, and for batched reclaim
> > of hybrid any-order batches of folios in shrink_folio_list().
> >
> > The patch-series is organized as follows:
> >
> >  1) iaa_crypto driver enablers for batching: Relevant patches are tagged
> >     with "crypto:" in the subject:
> >
> >     a) async poll crypto_acomp interface without interrupts.
> >     b) crypto testmgr acomp poll support.
> >     c) Modifying the default sync_mode to "async" and disabling
> >        verify_compress by default, to facilitate users to run IAA easily for
> >        comparison with software compressors.
> >     d) Changing the cpu-to-iaa mappings to more evenly balance cores to IAA
> >        devices.
> >     e) Addition of a "global_wq" per IAA, which can be used as a global
> >        resource for the socket. If the user configures 2WQs per IAA device,
> >        the driver will distribute compress jobs from all cores on the
> >        socket to the "global_wqs" of all the IAA devices on that socket, in
> >        a round-robin manner. This can be used to improve compression
> >        throughput for workloads that see a lot of swapout activity.
> >
> >  2) Migrating zswap to use async poll in zswap_compress()/decompress().
> >  3) A centralized batch compression API that can be used by swap modules.
> >  4) IAA compress batching within large folio zswap stores.
> >  5) IAA compress batching of any-order hybrid folios in
> >     shrink_folio_list(). The newly added "sysctl vm.compress-batchsize"
> >     parameter can be used to configure the number of folios in [1, 32] to
> >     be reclaimed using compress batching.
> 
> I am still digesting this series but I have some high level questions
> that I left on some patches. My intuition though is that we should
> drop (5) from the initial proposal as it's most controversial.
> Batching reclaim of unrelated folios through zswap *might* make sense,
> but it needs a broader conversation and it needs justification on its
> own merit, without the rest of the series.

Thanks for these suggestions!  Sure, I can drop (5) from the initial patch-set.
Agree also, this needs a broader discussion.

I believe the 4K folios usemem30 data in this patchset does bring across
the batching reclaim benefits to provide justification on its own merit. I added
the data on batching reclaim with kernel compilation as part of the 4K folios
experiments in the IAA decompression batching patch-series [1].
Listing it here as well. I will make sure to add this data in subsequent revs.

--------------------------------------------------------------------------
 Kernel compilation in tmpfs/allmodconfig, 2G max memory:
 
 No large folios          mm-unstable-10-16-2024       shrink_folio_list()        
                                                       batching of folios     
 --------------------------------------------------------------------------
 zswap compressor         zstd       deflate-iaa       deflate-iaa   
 vm.compress-batchsize     n/a               n/a                32   
 vm.page-cluster             3                 3                 3   
 --------------------------------------------------------------------------
 real_sec               783.87            761.69            747.32   
 user_sec            15,750.07         15,716.69         15,728.39   
 sys_sec              6,522.32          5,725.28          5,399.44   
 Max_RSS_KB          1,872,640         1,870,848         1,874,432   
                                                                            
 zswpout            82,364,991        97,739,600       102,780,612   
 zswpin             21,303,393        27,684,166        29,016,252   
 pswpout                    13               222               213   
 pswpin                     12               209               202   
 pgmajfault         17,114,339        22,421,211        23,378,161   
 swap_ra             4,596,035         5,840,082         6,231,646   
 swap_ra_hit         2,903,249         3,682,444         3,940,420   
 --------------------------------------------------------------------------

The performance improvements seen does depend on compression batching in
the swap modules (zswap). The implementation in patch 12 in the compress
batching series sets up this zswap compression pipeline, that takes an array of
folios and processes them in batches of 8 pages compressed in parallel in hardware.
That being said, we do see latency improvements even with reclaim batching
combined with zswap compress batching with zstd/lzo-rle/etc. I haven't done a
lot of analysis of this, but I am guessing fewer calls from the swap layer
(swap_writepage()) into zswap could have something to do with this. If we believe
that batching can be the right thing to do even for the software compressors,
I can gather batching data with zstd for v2.


[1] https://patchwork.kernel.org/project/linux-mm/cover/20241018064805.336490-1-kanchana.p.sridhar@intel.com/

Thanks,
Kanchana

> 
> >
> > IAA compress batching can be enabled only on platforms that have IAA, by
> > setting this config variable:
> >
> >  CONFIG_ZSWAP_STORE_BATCHING_ENABLED="y"
> >
> > The performance testing data with usemem 30 instances shows throughput
> > gains of up to 40%, elapsed time reduction of up to 22% and sys time
> > reduction of up to 30% with IAA compression batching.
> >
> > Our internal validation of IAA compress/decompress batching in highly
> > contended Sapphire Rapids server setups with workloads running on 72
> cores
> > for ~25 minutes under stringent memory limit constraints have shown up to
> > 50% reduction in sys time and 3.5% reduction in workload run time as
> > compared to software compressors.
> >
> >
> > System setup for testing:
> > =========================
> > Testing of this patch-series was done with mm-unstable as of 10-16-2024,
> > commit 817952b8be34, without and with this patch-series.
> > Data was gathered on an Intel Sapphire Rapids server, dual-socket 56 cores
> > per socket, 4 IAA devices per socket, 503 GiB RAM and 525G SSD disk
> > partition swap. Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
> > processes were run, each allocating and writing 10G of memory, and
> sleeping
> > for 10 sec before exiting:
> >
> > usemem --init-time -w -O -s 10 -n 30 10g
> >
> > Other kernel configuration parameters:
> >
> >     zswap compressor : deflate-iaa
> >     zswap allocator   : zsmalloc
> >     vm.page-cluster   : 2,4
> >
> > IAA "compression verification" is disabled and the async poll acomp
> > interface is used in the iaa_crypto driver (the defaults with this
> > series).
> >
> >
> > Performance testing (usemem30):
> > ===============================
> >
> >  4K folios: deflate-iaa:
> >  =======================
> >
> >  -------------------------------------------------------------------------------
> >                 mm-unstable-10-16-2024  shrink_folio_list()  shrink_folio_list()
> >                                          batching of folios   batching of folios
> >  -------------------------------------------------------------------------------
> >  zswap compressor          deflate-iaa          deflate-iaa          deflate-iaa
> >  vm.compress-batchsize             n/a                    1                   32
> >  vm.page-cluster                     2                    2                    2
> >  -------------------------------------------------------------------------------
> >  Total throughput            4,470,466            5,770,824            6,363,045
> >            (KB/s)
> >  Average throughput            149,015              192,360              212,101
> >            (KB/s)
> >  elapsed time                   119.24               100.96                92.99
> >         (sec)
> >  sys time (sec)               2,819.29             2,168.08             1,970.79
> >
> >  -------------------------------------------------------------------------------
> >  memcg_high                    668,185              646,357              613,421
> >  memcg_swap_fail                     0                    0                    0
> >  zswpout                    62,991,796           58,275,673           53,070,201
> >  zswpin                            431                  415                  396
> >  pswpout                             0                    0                    0
> >  pswpin                              0                    0                    0
> >  thp_swpout                          0                    0                    0
> >  thp_swpout_fallback                 0                    0                    0
> >  pgmajfault                      3,137                3,085                3,440
> >  swap_ra                            99                  100                   95
> >  swap_ra_hit                        42                   44                   45
> >  -------------------------------------------------------------------------------
> >
> >
> >  16k/32/64k folios: deflate-iaa:
> >  ===============================
> >  All three large folio sizes 16k/32/64k were enabled to "always".
> >
> >  -------------------------------------------------------------------------------
> >                 mm-unstable-  zswap_store()      + shrink_folio_list()
> >                   10-16-2024    batching of         batching of folios
> >                                    pages in
> >                                large folios
> >  -------------------------------------------------------------------------------
> >  zswap compr     deflate-iaa     deflate-iaa          deflate-iaa
> >  vm.compress-            n/a             n/a         4          8             16
> >  batchsize
> >  vm.page-                  2               2         2          2              2
> >   cluster
> >  -------------------------------------------------------------------------------
> >  Total throughput   7,182,198   8,448,994    8,584,728    8,729,643
> 8,775,944
> >            (KB/s)
> >  Avg throughput       239,406     281,633      286,157      290,988      292,531
> >          (KB/s)
> >  elapsed time           85.04       77.84        77.03        75.18        74.98
> >          (sec)
> >  sys time (sec)      1,730.77    1,527.40     1,528.52     1,473.76     1,465.97
> >
> >  -------------------------------------------------------------------------------
> >  memcg_high           648,125     694,188      696,004      699,728      724,887
> >  memcg_swap_fail        1,550       2,540        1,627        1,577        1,517
> >  zswpout           57,606,876  56,624,450   56,125,082    55,999,42   57,352,204
> >  zswpin                   421         406          422          400          437
> >  pswpout                    0           0            0            0            0
> >  pswpin                     0           0            0            0            0
> >  thp_swpout                 0           0            0            0            0
> >  thp_swpout_fallback        0           0            0            0            0
> >  16kB-mthp_swpout_          0           0            0            0            0
> >           fallback
> >  32kB-mthp_swpout_          0           0            0            0            0
> >           fallback
> >  64kB-mthp_swpout_      1,550       2,539        1,627        1,577        1,517
> >           fallback
> >  pgmajfault             3,102       3,126        3,473        3,454        3,134
> >  swap_ra                  107         144          109          124          181
> >  swap_ra_hit               51          88           45           66          107
> >  ZSWPOUT-16kB               2           3            4            4            3
> >  ZSWPOUT-32kB               0           2            1            1            0
> >  ZSWPOUT-64kB       3,598,889   3,536,556    3,506,134    3,498,324
> 3,582,921
> >  SWPOUT-16kB                0           0            0            0            0
> >  SWPOUT-32kB                0           0            0            0            0
> >  SWPOUT-64kB                0           0            0            0            0
> >  -------------------------------------------------------------------------------
> >
> >
> >  2M folios: deflate-iaa:
> >  =======================
> >
> >  -------------------------------------------------------------------------------
> >                    mm-unstable-10-16-2024    zswap_store() batching of pages
> >                                                       in pmd-mappable folios
> >  -------------------------------------------------------------------------------
> >  zswap compressor             deflate-iaa                deflate-iaa
> >  vm.compress-batchsize                n/a                        n/a
> >  vm.page-cluster                        2                          2
> >  -------------------------------------------------------------------------------
> >  Total throughput               7,444,592                 8,916,349
> >            (KB/s)
> >  Average throughput               248,153                   297,211
> >            (KB/s)
> >  elapsed time                       86.29                     73.44
> >         (sec)
> >  sys time (sec)                  1,833.21                  1,418.58
> >
> >  -------------------------------------------------------------------------------
> >  memcg_high                        81,786                    89,905
> >  memcg_swap_fail                       82                       395
> >  zswpout                       58,874,092                57,721,884
> >  zswpin                               422                       458
> >  pswpout                                0                         0
> >  pswpin                                 0                         0
> >  thp_swpout                             0                         0
> >  thp_swpout_fallback                   82                       394
> >  pgmajfault                        14,864                    21,544
> >  swap_ra                           34,953                    53,751
> >  swap_ra_hit                       34,895                    53,660
> >  ZSWPOUT-2048kB                   114,815                   112,269
> >  SWPOUT-2048kB                          0                         0
> >  -------------------------------------------------------------------------------
> >
> > Since 4K folios account for ~0.4% of all zswapouts when pmd-mappable
> folios
> > are enabled for usemem30, we cannot expect much improvement from
> reclaim
> > batching.
> >
> >
> > Performance testing (Kernel compilation):
> > =========================================
> >
> > As mentioned earlier, for workloads that see a lot of swapout activity, we
> > can benefit from configuring 2 WQs per IAA device, with compress jobs from
> > all same-socket cores being distributed toothe wq.1 of all IAAs on the
> > socket, with the "global_wq" developed in this patch-series.
> >
> > Although this data includes IAA decompress batching, which will be
> > submitted as a separate RFC patch-series, I am listing it here to quantify
> > the benefit of distributing compress jobs among all IAAs. The kernel
> > compilation test with "allmodconfig" is able to quantify this well:
> >
> >
> >  4K folios: deflate-iaa: kernel compilation to quantify crypto patches
> >
> ==============================================================
> =======
> >
> >
> >  ------------------------------------------------------------------------------
> >                    IAA shrink_folio_list() compress batching and
> >                        swapin_readahead() decompress batching
> >
> >                                       1WQ      2WQ (distribute compress jobs)
> >
> >                         1 local WQ (wq.0)    1 local WQ (wq.0) +
> >                                   per IAA    1 global WQ (wq.1) per IAA
> >
> >  ------------------------------------------------------------------------------
> >  zswap compressor             deflate-iaa         deflate-iaa
> >  vm.compress-batchsize                 32                  32
> >  vm.page-cluster                        4                   4
> >  ------------------------------------------------------------------------------
> >  real_sec                          746.77              745.42
> >  user_sec                       15,732.66           15,738.85
> >  sys_sec                         5,384.14            5,247.86
> >  Max_Res_Set_Size_KB            1,874,432           1,872,640
> >
> >  ------------------------------------------------------------------------------
> >  zswpout                      101,648,460         104,882,982
> >  zswpin                        27,418,319          29,428,515
> >  pswpout                              213                  22
> >  pswpin                               207                   6
> >  pgmajfault                    21,896,616          23,629,768
> >  swap_ra                        6,054,409           6,385,080
> >  swap_ra_hit                    3,791,628           3,985,141
> >  ------------------------------------------------------------------------------
> >
> > The iaa_crypto wq stats will show almost the same number of compress
> calls
> > for wq.1 of all IAA devices. wq.0 will handle decompress calls exclusively.
> > We see a latency reduction of 2.5% by distributing compress jobs among all
> > IAA devices on the socket.
> >
> > I would greatly appreciate code review comments for the iaa_crypto driver
> > and mm patches included in this series!
> >
> > Thanks,
> > Kanchana
> >
> >
> >
> > Kanchana P Sridhar (13):
> >   crypto: acomp - Add a poll() operation to acomp_alg and acomp_req
> >   crypto: iaa - Add support for irq-less crypto async interface
> >   crypto: testmgr - Add crypto testmgr acomp poll support.
> >   mm: zswap: zswap_compress()/decompress() can submit, then poll an
> >     acomp_req.
> >   crypto: iaa - Make async mode the default.
> >   crypto: iaa - Disable iaa_verify_compress by default.
> >   crypto: iaa - Change cpu-to-iaa mappings to evenly balance cores to
> >     IAAs.
> >   crypto: iaa - Distribute compress jobs to all IAA devices on a NUMA
> >     node.
> >   mm: zswap: Config variable to enable compress batching in
> >     zswap_store().
> >   mm: zswap: Create multiple reqs/buffers in crypto_acomp_ctx if
> >     platform has IAA.
> >   mm: swap: Add IAA batch compression API
> >     swap_crypto_acomp_compress_batch().
> >   mm: zswap: Compress batching with Intel IAA in zswap_store() of large
> >     folios.
> >   mm: vmscan, swap, zswap: Compress batching of folios in
> >     shrink_folio_list().
> >
> >  crypto/acompress.c                         |   1 +
> >  crypto/testmgr.c                           |  70 +-
> >  drivers/crypto/intel/iaa/iaa_crypto_main.c | 467 +++++++++++--
> >  include/crypto/acompress.h                 |  18 +
> >  include/crypto/internal/acompress.h        |   1 +
> >  include/linux/fs.h                         |   2 +
> >  include/linux/mm.h                         |   8 +
> >  include/linux/writeback.h                  |   5 +
> >  include/linux/zswap.h                      | 106 +++
> >  kernel/sysctl.c                            |   9 +
> >  mm/Kconfig                                 |  12 +
> >  mm/page_io.c                               | 152 +++-
> >  mm/swap.c                                  |  15 +
> >  mm/swap.h                                  |  96 +++
> >  mm/swap_state.c                            | 115 +++
> >  mm/vmscan.c                                | 154 +++-
> >  mm/zswap.c                                 | 771 +++++++++++++++++++--
> >  17 files changed, 1870 insertions(+), 132 deletions(-)
> >
> >
> > base-commit: 817952b8be34aad40e07f6832fb9d1fc08961550
> > --
> > 2.27.0
> >
> >
Yosry Ahmed Oct. 23, 2024, 6:15 p.m. UTC | #7
On Tue, Oct 22, 2024 at 7:53 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi Yosry,
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Tuesday, October 22, 2024 5:57 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> > linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> > brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> > joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux-
> > fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> > Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching
> >
> > On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > >
> > > IAA Compression Batching:
> > > =========================
> > >
> > > This RFC patch-series introduces the use of the Intel Analytics Accelerator
> > > (IAA) for parallel compression of pages in a folio, and for batched reclaim
> > > of hybrid any-order batches of folios in shrink_folio_list().
> > >
> > > The patch-series is organized as follows:
> > >
> > >  1) iaa_crypto driver enablers for batching: Relevant patches are tagged
> > >     with "crypto:" in the subject:
> > >
> > >     a) async poll crypto_acomp interface without interrupts.
> > >     b) crypto testmgr acomp poll support.
> > >     c) Modifying the default sync_mode to "async" and disabling
> > >        verify_compress by default, to facilitate users to run IAA easily for
> > >        comparison with software compressors.
> > >     d) Changing the cpu-to-iaa mappings to more evenly balance cores to IAA
> > >        devices.
> > >     e) Addition of a "global_wq" per IAA, which can be used as a global
> > >        resource for the socket. If the user configures 2WQs per IAA device,
> > >        the driver will distribute compress jobs from all cores on the
> > >        socket to the "global_wqs" of all the IAA devices on that socket, in
> > >        a round-robin manner. This can be used to improve compression
> > >        throughput for workloads that see a lot of swapout activity.
> > >
> > >  2) Migrating zswap to use async poll in zswap_compress()/decompress().
> > >  3) A centralized batch compression API that can be used by swap modules.
> > >  4) IAA compress batching within large folio zswap stores.
> > >  5) IAA compress batching of any-order hybrid folios in
> > >     shrink_folio_list(). The newly added "sysctl vm.compress-batchsize"
> > >     parameter can be used to configure the number of folios in [1, 32] to
> > >     be reclaimed using compress batching.
> >
> > I am still digesting this series but I have some high level questions
> > that I left on some patches. My intuition though is that we should
> > drop (5) from the initial proposal as it's most controversial.
> > Batching reclaim of unrelated folios through zswap *might* make sense,
> > but it needs a broader conversation and it needs justification on its
> > own merit, without the rest of the series.
>
> Thanks for these suggestions!  Sure, I can drop (5) from the initial patch-set.
> Agree also, this needs a broader discussion.
>
> I believe the 4K folios usemem30 data in this patchset does bring across
> the batching reclaim benefits to provide justification on its own merit. I added
> the data on batching reclaim with kernel compilation as part of the 4K folios
> experiments in the IAA decompression batching patch-series [1].
> Listing it here as well. I will make sure to add this data in subsequent revs.
>
> --------------------------------------------------------------------------
>  Kernel compilation in tmpfs/allmodconfig, 2G max memory:
>
>  No large folios          mm-unstable-10-16-2024       shrink_folio_list()
>                                                        batching of folios
>  --------------------------------------------------------------------------
>  zswap compressor         zstd       deflate-iaa       deflate-iaa
>  vm.compress-batchsize     n/a               n/a                32
>  vm.page-cluster             3                 3                 3
>  --------------------------------------------------------------------------
>  real_sec               783.87            761.69            747.32
>  user_sec            15,750.07         15,716.69         15,728.39
>  sys_sec              6,522.32          5,725.28          5,399.44
>  Max_RSS_KB          1,872,640         1,870,848         1,874,432
>
>  zswpout            82,364,991        97,739,600       102,780,612
>  zswpin             21,303,393        27,684,166        29,016,252
>  pswpout                    13               222               213
>  pswpin                     12               209               202
>  pgmajfault         17,114,339        22,421,211        23,378,161
>  swap_ra             4,596,035         5,840,082         6,231,646
>  swap_ra_hit         2,903,249         3,682,444         3,940,420
>  --------------------------------------------------------------------------
>
> The performance improvements seen does depend on compression batching in
> the swap modules (zswap). The implementation in patch 12 in the compress
> batching series sets up this zswap compression pipeline, that takes an array of
> folios and processes them in batches of 8 pages compressed in parallel in hardware.
> That being said, we do see latency improvements even with reclaim batching
> combined with zswap compress batching with zstd/lzo-rle/etc. I haven't done a
> lot of analysis of this, but I am guessing fewer calls from the swap layer
> (swap_writepage()) into zswap could have something to do with this. If we believe
> that batching can be the right thing to do even for the software compressors,
> I can gather batching data with zstd for v2.

Thanks for sharing the data. What I meant is, I think we should focus
on supporting large folio compression batching for this series, and
only present figures for this support to avoid confusion.

Once this lands, we can discuss support for batching the compression
of different unrelated folios separately, as it spans areas beyond
just zswap and will need broader discussion.
Kanchana P Sridhar Oct. 23, 2024, 8:34 p.m. UTC | #8
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, October 23, 2024 11:16 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org; linux-
> fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal,
> Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching
> 
> On Tue, Oct 22, 2024 at 7:53 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi Yosry,
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Tuesday, October 22, 2024 5:57 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com; Huang, Ying
> > > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> > > linux-crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; zanussi@kernel.org;
> viro@zeniv.linux.org.uk;
> > > brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org;
> kees@kernel.org;
> > > joel.granados@kernel.org; bfoster@redhat.com; willy@infradead.org;
> linux-
> > > fsdevel@vger.kernel.org; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal,
> > > Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [RFC PATCH v1 00/13] zswap IAA compress batching
> > >
> > > On Thu, Oct 17, 2024 at 11:41 PM Kanchana P Sridhar
> > > <kanchana.p.sridhar@intel.com> wrote:
> > > >
> > > >
> > > > IAA Compression Batching:
> > > > =========================
> > > >
> > > > This RFC patch-series introduces the use of the Intel Analytics
> Accelerator
> > > > (IAA) for parallel compression of pages in a folio, and for batched reclaim
> > > > of hybrid any-order batches of folios in shrink_folio_list().
> > > >
> > > > The patch-series is organized as follows:
> > > >
> > > >  1) iaa_crypto driver enablers for batching: Relevant patches are tagged
> > > >     with "crypto:" in the subject:
> > > >
> > > >     a) async poll crypto_acomp interface without interrupts.
> > > >     b) crypto testmgr acomp poll support.
> > > >     c) Modifying the default sync_mode to "async" and disabling
> > > >        verify_compress by default, to facilitate users to run IAA easily for
> > > >        comparison with software compressors.
> > > >     d) Changing the cpu-to-iaa mappings to more evenly balance cores to
> IAA
> > > >        devices.
> > > >     e) Addition of a "global_wq" per IAA, which can be used as a global
> > > >        resource for the socket. If the user configures 2WQs per IAA device,
> > > >        the driver will distribute compress jobs from all cores on the
> > > >        socket to the "global_wqs" of all the IAA devices on that socket, in
> > > >        a round-robin manner. This can be used to improve compression
> > > >        throughput for workloads that see a lot of swapout activity.
> > > >
> > > >  2) Migrating zswap to use async poll in
> zswap_compress()/decompress().
> > > >  3) A centralized batch compression API that can be used by swap
> modules.
> > > >  4) IAA compress batching within large folio zswap stores.
> > > >  5) IAA compress batching of any-order hybrid folios in
> > > >     shrink_folio_list(). The newly added "sysctl vm.compress-batchsize"
> > > >     parameter can be used to configure the number of folios in [1, 32] to
> > > >     be reclaimed using compress batching.
> > >
> > > I am still digesting this series but I have some high level questions
> > > that I left on some patches. My intuition though is that we should
> > > drop (5) from the initial proposal as it's most controversial.
> > > Batching reclaim of unrelated folios through zswap *might* make sense,
> > > but it needs a broader conversation and it needs justification on its
> > > own merit, without the rest of the series.
> >
> > Thanks for these suggestions!  Sure, I can drop (5) from the initial patch-set.
> > Agree also, this needs a broader discussion.
> >
> > I believe the 4K folios usemem30 data in this patchset does bring across
> > the batching reclaim benefits to provide justification on its own merit. I
> added
> > the data on batching reclaim with kernel compilation as part of the 4K folios
> > experiments in the IAA decompression batching patch-series [1].
> > Listing it here as well. I will make sure to add this data in subsequent revs.
> >
> > --------------------------------------------------------------------------
> >  Kernel compilation in tmpfs/allmodconfig, 2G max memory:
> >
> >  No large folios          mm-unstable-10-16-2024       shrink_folio_list()
> >                                                        batching of folios
> >  --------------------------------------------------------------------------
> >  zswap compressor         zstd       deflate-iaa       deflate-iaa
> >  vm.compress-batchsize     n/a               n/a                32
> >  vm.page-cluster             3                 3                 3
> >  --------------------------------------------------------------------------
> >  real_sec               783.87            761.69            747.32
> >  user_sec            15,750.07         15,716.69         15,728.39
> >  sys_sec              6,522.32          5,725.28          5,399.44
> >  Max_RSS_KB          1,872,640         1,870,848         1,874,432
> >
> >  zswpout            82,364,991        97,739,600       102,780,612
> >  zswpin             21,303,393        27,684,166        29,016,252
> >  pswpout                    13               222               213
> >  pswpin                     12               209               202
> >  pgmajfault         17,114,339        22,421,211        23,378,161
> >  swap_ra             4,596,035         5,840,082         6,231,646
> >  swap_ra_hit         2,903,249         3,682,444         3,940,420
> >  --------------------------------------------------------------------------
> >
> > The performance improvements seen does depend on compression batching
> in
> > the swap modules (zswap). The implementation in patch 12 in the compress
> > batching series sets up this zswap compression pipeline, that takes an array
> of
> > folios and processes them in batches of 8 pages compressed in parallel in
> hardware.
> > That being said, we do see latency improvements even with reclaim
> batching
> > combined with zswap compress batching with zstd/lzo-rle/etc. I haven't
> done a
> > lot of analysis of this, but I am guessing fewer calls from the swap layer
> > (swap_writepage()) into zswap could have something to do with this. If we
> believe
> > that batching can be the right thing to do even for the software
> compressors,
> > I can gather batching data with zstd for v2.
> 
> Thanks for sharing the data. What I meant is, I think we should focus
> on supporting large folio compression batching for this series, and
> only present figures for this support to avoid confusion.
> 
> Once this lands, we can discuss support for batching the compression
> of different unrelated folios separately, as it spans areas beyond
> just zswap and will need broader discussion.

Absolutely, this makes sense, thanks Yosry! I will address this in v2.

Thanks,
Kanchana
Joel Granados Oct. 28, 2024, 2:41 p.m. UTC | #9
On Thu, Oct 17, 2024 at 11:41:01PM -0700, Kanchana P Sridhar wrote:
> This patch enables the use of Intel IAA hardware compression acceleration
> to reclaim a batch of folios in shrink_folio_list(). This results in
> reclaim throughput and workload/sys performance improvements.
> 
> The earlier patches on compress batching deployed multiple IAA compress
> engines for compressing up to SWAP_CRYPTO_SUB_BATCH_SIZE pages within a
> large folio that is being stored in zswap_store(). This patch further
> propagates the efficiency improvements demonstrated with IAA "batching
> within folios", to vmscan "batching of folios" which will also use
> batching within folios using the extensible architecture of
> the __zswap_store_batch_core() procedure added earlier, that accepts
> an array of folios.

...

> +static inline void zswap_store_batch(struct swap_in_memory_cache_cb *simc)
> +{
> +}
> +
>  static inline bool zswap_store(struct folio *folio)
>  {
>  	return false;
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 79e6cb1d5c48..b8d6b599e9ae 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -2064,6 +2064,15 @@ static struct ctl_table vm_table[] = {
>  		.extra1		= SYSCTL_ZERO,
>  		.extra2		= (void *)&page_cluster_max,
>  	},
> +	{
> +		.procname	= "compress-batchsize",
> +		.data		= &compress_batchsize,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec_minmax,
Why not use proc_douintvec_minmax? These are the reasons I think you
should use that (please correct me if I miss-read your patch):

1. Your range is [1,32] -> so no negative values
2. You are using the value to compare with an unsinged int
   (simc->nr_folios) in your `struct swap_in_memory_cache_cb`. So
   instead of going from int to uint, you should just do uint all
   around. No?
3. Using proc_douintvec_minmax will automatically error out on negative
   input without event considering your range, so there is less code
   executed at the end.

> +		.extra1		= SYSCTL_ONE,
> +		.extra2		= (void *)&compress_batchsize_max,
> +	},
>  	{
>  		.procname	= "dirtytime_expire_seconds",
>  		.data		= &dirtytime_expire_interval,
> diff --git a/mm/page_io.c b/mm/page_io.c
> index a28d28b6b3ce..065db25309b8 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -226,6 +226,131 @@ static void swap_zeromap_folio_clear(struct folio *folio)
>  	}
>  }

...

Best
Kanchana P Sridhar Oct. 28, 2024, 6:53 p.m. UTC | #10
Hi Joel,

> -----Original Message-----
> From: Joel Granados <joel.granados@kernel.org>
> Sent: Monday, October 28, 2024 7:42 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; Huang, Ying <ying.huang@intel.com>;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; zanussi@kernel.org; viro@zeniv.linux.org.uk;
> brauner@kernel.org; jack@suse.cz; mcgrof@kernel.org; kees@kernel.org;
> bfoster@redhat.com; willy@infradead.org; linux-fsdevel@vger.kernel.org;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 13/13] mm: vmscan, swap, zswap: Compress
> batching of folios in shrink_folio_list().
> 
> On Thu, Oct 17, 2024 at 11:41:01PM -0700, Kanchana P Sridhar wrote:
> > This patch enables the use of Intel IAA hardware compression acceleration
> > to reclaim a batch of folios in shrink_folio_list(). This results in
> > reclaim throughput and workload/sys performance improvements.
> >
> > The earlier patches on compress batching deployed multiple IAA compress
> > engines for compressing up to SWAP_CRYPTO_SUB_BATCH_SIZE pages
> within a
> > large folio that is being stored in zswap_store(). This patch further
> > propagates the efficiency improvements demonstrated with IAA "batching
> > within folios", to vmscan "batching of folios" which will also use
> > batching within folios using the extensible architecture of
> > the __zswap_store_batch_core() procedure added earlier, that accepts
> > an array of folios.
> 
> ...
> 
> > +static inline void zswap_store_batch(struct swap_in_memory_cache_cb
> *simc)
> > +{
> > +}
> > +
> >  static inline bool zswap_store(struct folio *folio)
> >  {
> >  	return false;
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index 79e6cb1d5c48..b8d6b599e9ae 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -2064,6 +2064,15 @@ static struct ctl_table vm_table[] = {
> >  		.extra1		= SYSCTL_ZERO,
> >  		.extra2		= (void *)&page_cluster_max,
> >  	},
> > +	{
> > +		.procname	= "compress-batchsize",
> > +		.data		= &compress_batchsize,
> > +		.maxlen		= sizeof(int),
> > +		.mode		= 0644,
> > +		.proc_handler	= proc_dointvec_minmax,
> Why not use proc_douintvec_minmax? These are the reasons I think you
> should use that (please correct me if I miss-read your patch):
> 
> 1. Your range is [1,32] -> so no negative values
> 2. You are using the value to compare with an unsinged int
>    (simc->nr_folios) in your `struct swap_in_memory_cache_cb`. So
>    instead of going from int to uint, you should just do uint all
>    around. No?
> 3. Using proc_douintvec_minmax will automatically error out on negative
>    input without event considering your range, so there is less code
>    executed at the end.

Thanks for your code review comments! Sure, what you suggest makes
sense. Based on Yosry's suggestions, I plan to separate out the
batching reclaim shrink_folio_list() changes into a separate series, and
focus on just the zswap modifications to support large folio compression
batching in the initial series. I will make sure to incorporate your comments
in the shrink_folio_list() batching reclaim series.

Thanks,
Kanchana

> 
> > +		.extra1		= SYSCTL_ONE,
> > +		.extra2		= (void *)&compress_batchsize_max,
> > +	},
> >  	{
> >  		.procname	= "dirtytime_expire_seconds",
> >  		.data		= &dirtytime_expire_interval,
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index a28d28b6b3ce..065db25309b8 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -226,6 +226,131 @@ static void swap_zeromap_folio_clear(struct folio
> *folio)
> >  	}
> >  }
> 
> ...
> 
> Best
> 
> --
> 
> Joel Granados