mbox series

[v6,0/6] RISC-V non-coherent function pointer based cache management operations + non-coherent DMA support for AX45MP

Message ID 20230106185526.260163-1-prabhakar.mahadev-lad.rj@bp.renesas.com
Headers show
Series RISC-V non-coherent function pointer based cache management operations + non-coherent DMA support for AX45MP | expand

Message

Lad, Prabhakar Jan. 6, 2023, 6:55 p.m. UTC
From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>

Hi All,

RISC-V non-coherent function pointer based cache management operations
=======================================================================
As based on the previous discussion [0]. [1], It was suggested upon the below:
* Move out cache drivers from drivers/soc/folder
* Implement function pointers for CMO instead of ALTERNATIVE_x() macro
* Add cache drivers to drivers/cache folder

[0] https://lore.kernel.org/all/43aee000-5b89-4d94-98d2-b37b1a18a83e@app.fastmail.com/
[1] https://lore.kernel.org/linux-riscv/Y62nOqzyuUKqYDpq@spud/#R

This v6 version of the patch series add support to use function pointers for CMO
and switches the current CMO implementations for zicbom and T-HEAD to use function
pointers.

Note, with the above approach we might see some performance degradation due to additional
checks.

----------x---------------------x--------------------x---------------x--------------


non-coherent DMA support for AX45MP
====================================

On the Andes AX45MP core, cache coherency is a specification option so it
may not be supported. In this case DMA will fail. To get around with this
issue this patch series does the below:

1] Andes alternative ports is implemented as errata which checks if the IOCP
is missing and only then applies to CMO errata. One vendor specific SBI EXT
(RZFIVE_SBI_EXT_IOCP_SW_WORKAROUND) is implemented as part of errata. If we could
access the DTB in errata I can get rid of this EXT ID from OpenSBI. Is there any
approach we can access the DTB in patch callback?

Below are the configs which Andes port provides (and are selected by RZ/Five):
      - ERRATA_ANDES
      - ERRATA_ANDES_CMO

2] Andes AX45MP core has a Programmable Physical Memory Attributes (PMA)
block that allows dynamic adjustment of memory attributes in the runtime.
It contains a configurable amount of PMA entries implemented as CSR
registers to control the attributes of memory locations in interest.
OpenSBI configures the PMA regions as required and creates a reserve memory
node and propagates it to the higher boot stack.

An RFC patch for OpenSBI is posted here:
https://patchwork.ozlabs.org/project/opensbi/patch/20221212094421.14556-1-prabhakar.mahadev-lad.rj@bp.renesas.com/

    reserved-memory {
        #address-cells = <2>;
        #size-cells = <2>;
        ranges;

        pma_resv0@58000000 {
            compatible = "shared-dma-pool";
            reg = <0x0 0x58000000 0x0 0x08000000>;
            no-map;
            linux,dma-default;
        };
    };

The above shared DMA pool gets appended to Linux DTB so the DMA memory
requests go through this region.

3] We provide callbacks to synchronize specific content between memory and
cache.

        - arch_sync_dma_for_device()
        - arch_sync_dma_for_cpu()

4] RZ/Five SoC selects the below configs
        - AX45MP_L2_CACHE
        - DMA_GLOBAL_POOL
        - ERRATA_ANDES
        - ERRATA_ANDES_CMO

OpenSBI implementation patches can be found here:
1] https://patchwork.ozlabs.org/project/opensbi/cover/20221210103011.7814-1-prabhakar.mahadev-lad.rj@bp.renesas.com/
2]https://patchwork.ozlabs.org/project/opensbi/patch/20221212094421.14556-1-prabhakar.mahadev-lad.rj@bp.renesas.com/
----------x---------------------x--------------------x---------------x--------------

Note,
- This series requires testing on Cores with zicbom and T-Head SoCs
- Ive used GCC 12.2.0 for compilation (tested with allmodconfig)
- Tested all the IP blocks on RZ/Five which use DMA

v5.1 -> v6
* Dropped use of ALTERNATIVE_x() macro
* Now switched to used function pointers for CMO
* Moved driver to drivers/cache folder

v5 -> v5.1
* https://patchwork.kernel.org/project/linux-riscv/list/?series=708610&state=%2A&archive=both

v4 -> v5
* Rebased ALTERNATIVE_3() macro on top of Andrew's patches
* Rebased the changes on top of Heiko's alternative call patches
* Dropped configuring the PMA from Linux
* Dropped configuring the L2 cache from Linux and dropped the binding for same
* Now using runtime patching mechanism instead of compile time config

RFC v3 -> v4
* Implemented ALTERNATIVE_3() macro 
* Now using runtime patching mechanism instead of compile time config
* Added Andes CMO as and errata
* Fixed comments pointed by Geert

RFC v2-> RFC v3
* Fixed review comments pointed by Conor
* Move DT binding into cache folder
* Fixed DT binding check issue
* Added andestech,ax45mp-cache.h header file
* Now passing the flags for the PMA setup as part of andestech,pma-regions
  property.
* Added andestech,inst/data-prefetch and andestech,tag/data-ram-ctl
  properties to configure the L2 cache.
* Registered the cache driver as platform driver

RFC v1-> RFC v2
* Moved out the code from arc/riscv to drivers/soc/renesas
* Now handling the PMA setup as part of the L2 cache
* Now making use of dma-noncoherent.c instead SoC specific implementation.
* Dropped arch_dma_alloc() and arch_dma_free()
* Switched to RISCV_DMA_NONCOHERENT
* Included DT binding doc

RFC v2: https://patchwork.kernel.org/project/linux-renesas-soc/cover/20221003223222.448551-1-prabhakar.mahadev-lad.rj@bp.renesas.com/
RFC v1: https://patchwork.kernel.org/project/linux-renesas-soc/cover/20220906102154.32526-1-prabhakar.mahadev-lad.rj@bp.renesas.com/

Cheers,
Prabhakar

Lad Prabhakar (6):
  riscv: mm: dma-noncoherent: Switch using function pointers for cache
    management
  riscv: asm: vendorid_list: Add Andes Technology to the vendors list
  riscv: errata: Add Andes alternative ports
  dt-bindings: cache: r9a07g043f-l2-cache: Add DT binding documentation
    for L2 cache controller
  cache: Add L2 cache management for Andes AX45MP RISC-V core
  soc: renesas: Kconfig: Select the required configs for RZ/Five SoC

 .../cache/andestech,ax45mp-cache.yaml         |  81 +++++
 arch/riscv/Kconfig.erratas                    |  22 ++
 arch/riscv/errata/Makefile                    |   1 +
 arch/riscv/errata/andes/Makefile              |   1 +
 arch/riscv/errata/andes/errata.c              |  71 +++++
 arch/riscv/errata/thead/errata.c              |  71 +++++
 arch/riscv/include/asm/alternative.h          |   3 +
 arch/riscv/include/asm/dma-noncoherent.h      |  83 ++++++
 arch/riscv/include/asm/errata_list.h          |  53 ----
 arch/riscv/include/asm/vendorid_list.h        |   1 +
 arch/riscv/kernel/alternative.c               |   5 +
 arch/riscv/kernel/cpufeature.c                |   2 +
 arch/riscv/mm/dma-noncoherent.c               |  94 +++++-
 arch/riscv/mm/pmem.c                          |  18 +-
 drivers/Kconfig                               |   2 +
 drivers/Makefile                              |   1 +
 drivers/cache/Kconfig                         |  10 +
 drivers/cache/Makefile                        |   3 +
 drivers/cache/ax45mp_cache.c                  | 279 ++++++++++++++++++
 drivers/soc/renesas/Kconfig                   |   4 +
 20 files changed, 744 insertions(+), 61 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/cache/andestech,ax45mp-cache.yaml
 create mode 100644 arch/riscv/errata/andes/Makefile
 create mode 100644 arch/riscv/errata/andes/errata.c
 create mode 100644 arch/riscv/include/asm/dma-noncoherent.h
 create mode 100644 drivers/cache/Kconfig
 create mode 100644 drivers/cache/Makefile
 create mode 100644 drivers/cache/ax45mp_cache.c

Comments

Conor Dooley Jan. 6, 2023, 9:44 p.m. UTC | #1
On Fri, Jan 06, 2023 at 06:55:23PM +0000, Prabhakar wrote:
> From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> 
> Add required ports of the Alternative scheme for Andes CPU cores.
> 
> I/O Coherence Port (IOCP) provides an AXI interface for connecting external
> non-caching masters, such as DMA controllers. IOCP is a specification
> option and is disabled on the Renesas RZ/Five SoC due to this reason cache
> management needs a software workaround.
> 
> Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> ---
> v5 -> v6
> * Dropped patching alternative and now just probing IOCP
> 
> v4 -> v5
> * Sorted the Kconfig/Makefile/Switch based on Core name
> * Added a comments
> * Introduced RZFIVE_SBI_EXT_IOCP_SW_WORKAROUND SBI EXT ID to check if
>   CMO needs to be applied. Is there a way we can access the DTB while patching
>   as we can drop this SBI EXT ID and add a DT property instead for cmo?
> 
> RFC v3 -> v4
> * New patch
> ---
>  arch/riscv/Kconfig.erratas           | 22 +++++++++
>  arch/riscv/errata/Makefile           |  1 +
>  arch/riscv/errata/andes/Makefile     |  1 +
>  arch/riscv/errata/andes/errata.c     | 71 ++++++++++++++++++++++++++++
>  arch/riscv/include/asm/alternative.h |  3 ++
>  arch/riscv/kernel/alternative.c      |  5 ++
>  6 files changed, 103 insertions(+)
>  create mode 100644 arch/riscv/errata/andes/Makefile
>  create mode 100644 arch/riscv/errata/andes/errata.c
> 
> diff --git a/arch/riscv/Kconfig.erratas b/arch/riscv/Kconfig.erratas
> index 69621ae6d647..f0f0c1abd52b 100644
> --- a/arch/riscv/Kconfig.erratas
> +++ b/arch/riscv/Kconfig.erratas
> @@ -1,5 +1,27 @@
>  menu "CPU errata selection"
>  
> +config ERRATA_ANDES
> +	bool "Andes AX45MP errata"
> +	depends on !XIP_KERNEL
> +	select RISCV_ALTERNATIVE
> +	help
> +	  All Andes errata Kconfig depend on this Kconfig. Disabling
> +	  this Kconfig will disable all Andes errata. Please say "Y"
> +	  here if your platform uses Andes CPU cores.
> +
> +	  Otherwise, please say "N" here to avoid unnecessary overhead.
> +
> +config ERRATA_ANDES_CMO
> +	bool "Apply Andes cache management errata"
> +	depends on ERRATA_ANDES && MMU && ARCH_R9A07G043
> +	select RISCV_DMA_NONCOHERENT
> +	default y
> +	help
> +	  This will apply the cache management errata to handle the
> +	  non-standard handling on non-coherent operations on Andes cores.
> +
> +	  If you don't know what to do here, say "Y".

Ideally we would not need errata to turn this stuff on at all, but, as
you pointed out to me off-list, arch_setup_dma_ops() complains if we
have not set up.

I'm happy to commit to trying to sort that out in follow on work w/ MPFS,
since in that case it really isn't errata, and not require it for this
series as you do fit that particular bill IMO.

Reviewed-by: Conor Dooley <conor.dooley@microchip.com>

Thanks,
Conor.
Arnd Bergmann Jan. 6, 2023, 10:31 p.m. UTC | #2
On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote:
> From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
>
> The current implementation of CMO was handled using the ALTERNATIVE_X()
> macro; this was manageable when there were a limited number of platforms
> using this. Now that we are having more and more platforms coming through
> with the CMO the use of the ALTERNATIVE_X() macro becomes unmanageable.
>
> To avoid such issues this patch switches to use of function pointers
> instead of ALTERNATIVE_X() macro for cache management (the only draw being
> performance over the previous approach).
>
> void (*clean_range)(unsigned long addr, unsigned long size);
> void (*inv_range)(unsigned long addr, unsigned long size);
> void (*flush_range)(unsigned long addr, unsigned long size);
>
> The above function pointers are provided to be overridden where platforms
> using standard approach and for platforms who want handle the operation
> based on the operation the below function pointer is provided:
>
> void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
>                                       enum dma_data_direction dir,
>                                       enum dma_noncoherent_ops ops);
>
> In the current patch we have moved the ZICBOM and T-Head CMO to use
> function pointers.
>
> Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>

This looks like a nice improvement! I have a few suggestions
for improvements, but no objections here.

> diff --git a/arch/riscv/errata/thead/errata.c b/arch/riscv/errata/thead/errata.c
> index fac5742d1c1e..826b2ba3e61e 100644
> --- a/arch/riscv/errata/thead/errata.c
> +++ b/arch/riscv/errata/thead/errata.c
...
> @@ -44,6 +106,15 @@ static bool errata_probe_cmo(unsigned int stage,
> 
>  	riscv_cbom_block_size = L1_CACHE_BYTES;
>  	riscv_noncoherent_supported();
> +
> +	memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops));
> +	if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) {
> +		thead_cmo_ops.clean_range = &thead_cmo_clean_range;
> +		thead_cmo_ops.inv_range = &thead_cmo_inval_range;
> +		thead_cmo_ops.flush_range = &thead_cmo_flush_range;
> +		riscv_noncoherent_register_cache_ops(&thead_cmo_ops);
> +	}

The implementation here looks reasonable, just wonder whether
the classification as an 'errata' makes sense. I would probably
consider this a 'driver' at this point, but that's just
a question of personal preference.

For the operations structure, I think a 'static const struct
riscv_cache_ops' is more intuitive than assigning the
members individually.
> +
> +enum dma_noncoherent_ops {
> +	NON_COHERENT_SYNC_DMA_FOR_DEVICE = 0,
> +	NON_COHERENT_SYNC_DMA_FOR_CPU,
> +	NON_COHERENT_DMA_PREP,
> +	NON_COHERENT_DMA_PMEM,
> +};
> +
> +/*
> + * struct riscv_cache_ops - Structure for CMO function pointers
> + * @clean_range: Function pointer for clean cache
> + * @inv_range: Function pointer for invalidate cache
> + * @flush_range: Function pointer for flushing the cache
> + * @riscv_dma_noncoherent_cmo_ops: Function pointer for platforms who 
> want
> + *  to handle CMO themselves. If this function pointer is set rest of 
> the
> + *  function pointers will be NULL.
> + */
> +struct riscv_cache_ops {
> +	void (*clean_range)(unsigned long addr, unsigned long size);
> +	void (*inv_range)(unsigned long addr, unsigned long size);
> +	void (*flush_range)(unsigned long addr, unsigned long size);
> +	void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
> +					      enum dma_data_direction dir,
> +					      enum dma_noncoherent_ops ops);
> +};

I don't quite see how the fourth operation is used here.
Are there cache controllers that need something beyond
clean/inv/flush?

> +
> +#else
> +
> +static void riscv_noncoherent_register_cache_ops(struct 
> riscv_cache_ops *ops) {}
> +
> +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t 
> size) {}
> +
> +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t 
> size) {}
> +
> +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t 
> size) {}

I think you can drop the #else path here: if there is no
noncoherent DMA, then nothing should ever call these
functions, right?

> +
> +#ifdef CONFIG_RISCV_ISA_ZICBOM
...
> +struct riscv_cache_ops zicbom_cmo_ops = {
> +	.clean_range = &zicbom_cmo_clean_range,
> +	.inv_range = &zicbom_cmo_inval_range,
> +	.flush_range = &zicbom_cmo_flush_range,
> +};
> +#else
> +struct riscv_cache_ops zicbom_cmo_ops = {
> +	.clean_range = NULL,
> +	.inv_range = NULL,
> +	.flush_range = NULL,
> +	.riscv_dma_noncoherent_cmo_ops = NULL,
> +};
> +#endif
> +EXPORT_SYMBOL(zicbom_cmo_ops);

Same here: If the ZICBOM ISA is disabled, nothing should
reference zicbom_cmo_ops. Also, since ZICBOM is a standard
extension, I think it makes sense to always have it enabled,
at least whenever noncoherent DMA is supported, that way
it can be the default that gets used in the absence of any
nonstandard cache controller.

   Arnd
Conor Dooley Jan. 6, 2023, 11:29 p.m. UTC | #3
On Fri, Jan 06, 2023 at 11:31:33PM +0100, Arnd Bergmann wrote:
> On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote:
> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > +	.clean_range = &zicbom_cmo_clean_range,
> > +	.inv_range = &zicbom_cmo_inval_range,
> > +	.flush_range = &zicbom_cmo_flush_range,
> > +};
> > +#else
> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > +	.clean_range = NULL,
> > +	.inv_range = NULL,
> > +	.flush_range = NULL,
> > +	.riscv_dma_noncoherent_cmo_ops = NULL,
> > +};
> > +#endif
> > +EXPORT_SYMBOL(zicbom_cmo_ops);
> 
> Same here: If the ZICBOM ISA is disabled, nothing should
> reference zicbom_cmo_ops.

> Also, since ZICBOM is a standard
> extension, I think it makes sense to always have it enabled,
> at least whenever noncoherent DMA is supported, that way
> it can be the default that gets used in the absence of any
> nonstandard cache controller.

While I think of it, this is not possible as Zicbom requires toolchain
support whereas the alternative methods for non-coherent DMA do not.

Thanks,
Conor.
Conor Dooley Jan. 6, 2023, 11:49 p.m. UTC | #4
FWIW:
Reviewed-by: Conor Dooley <conor.dooley@microchip.com>
Arnd Bergmann Jan. 7, 2023, 9:52 p.m. UTC | #5
On Sat, Jan 7, 2023, at 00:29, Conor Dooley wrote:
> On Fri, Jan 06, 2023 at 11:31:33PM +0100, Arnd Bergmann wrote:
>> On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote:
>> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
>> > +struct riscv_cache_ops zicbom_cmo_ops = {
>> > +	.clean_range = &zicbom_cmo_clean_range,
>> > +	.inv_range = &zicbom_cmo_inval_range,
>> > +	.flush_range = &zicbom_cmo_flush_range,
>> > +};
>> > +#else
>> > +struct riscv_cache_ops zicbom_cmo_ops = {
>> > +	.clean_range = NULL,
>> > +	.inv_range = NULL,
>> > +	.flush_range = NULL,
>> > +	.riscv_dma_noncoherent_cmo_ops = NULL,
>> > +};
>> > +#endif
>> > +EXPORT_SYMBOL(zicbom_cmo_ops);
>> 
>> Same here: If the ZICBOM ISA is disabled, nothing should
>> reference zicbom_cmo_ops.
>
>> Also, since ZICBOM is a standard
>> extension, I think it makes sense to always have it enabled,
>> at least whenever noncoherent DMA is supported, that way
>> it can be the default that gets used in the absence of any
>> nonstandard cache controller.
>
> While I think of it, this is not possible as Zicbom requires toolchain
> support whereas the alternative methods for non-coherent DMA do not.

Ah, I see. Would it be possible to use the same .long trick
as in the other ones though? Something like

#if CONFIG_AS_VERSION >= 23600 /* or whichever version */
     /* proper inline asm */
#else
     /* .long hack */
#endif

That way everyone can use it, and the hack would automatically
go away in a few years after linux requires a newer toolchain.

Alternatively, the entire noncoherent-dma support could be
made to depend on whichever toolchain introduced Zicbom.

     Arnd
Lad, Prabhakar Jan. 7, 2023, 10:10 p.m. UTC | #6
Hi Arnd,

Thank you for the review.

On Fri, Jan 6, 2023 at 10:31 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
> On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote:
> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> >
> > The current implementation of CMO was handled using the ALTERNATIVE_X()
> > macro; this was manageable when there were a limited number of platforms
> > using this. Now that we are having more and more platforms coming through
> > with the CMO the use of the ALTERNATIVE_X() macro becomes unmanageable.
> >
> > To avoid such issues this patch switches to use of function pointers
> > instead of ALTERNATIVE_X() macro for cache management (the only draw being
> > performance over the previous approach).
> >
> > void (*clean_range)(unsigned long addr, unsigned long size);
> > void (*inv_range)(unsigned long addr, unsigned long size);
> > void (*flush_range)(unsigned long addr, unsigned long size);
> >
> > The above function pointers are provided to be overridden where platforms
> > using standard approach and for platforms who want handle the operation
> > based on the operation the below function pointer is provided:
> >
> > void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
> >                                       enum dma_data_direction dir,
> >                                       enum dma_noncoherent_ops ops);
> >
> > In the current patch we have moved the ZICBOM and T-Head CMO to use
> > function pointers.
> >
> > Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
>
> This looks like a nice improvement! I have a few suggestions
> for improvements, but no objections here.
>
> > diff --git a/arch/riscv/errata/thead/errata.c b/arch/riscv/errata/thead/errata.c
> > index fac5742d1c1e..826b2ba3e61e 100644
> > --- a/arch/riscv/errata/thead/errata.c
> > +++ b/arch/riscv/errata/thead/errata.c
> ...
> > @@ -44,6 +106,15 @@ static bool errata_probe_cmo(unsigned int stage,
> >
> >       riscv_cbom_block_size = L1_CACHE_BYTES;
> >       riscv_noncoherent_supported();
> > +
> > +     memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops));
> > +     if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) {
> > +             thead_cmo_ops.clean_range = &thead_cmo_clean_range;
> > +             thead_cmo_ops.inv_range = &thead_cmo_inval_range;
> > +             thead_cmo_ops.flush_range = &thead_cmo_flush_range;
> > +             riscv_noncoherent_register_cache_ops(&thead_cmo_ops);
> > +     }
>
> The implementation here looks reasonable, just wonder whether
> the classification as an 'errata' makes sense. I would probably
> consider this a 'driver' at this point, but that's just
> a question of personal preference.
>
zicbom is a CPU feature that doesn't have any DT node and hence no
driver and similarly for T-HEAD SoC. Also the arch_setup_dma_ops()
happens quite early before driver probing due to which we get WARN()
messages during bootup hence I have implemented it as errata; as
errata patching happens quite early.

> For the operations structure, I think a 'static const struct
> riscv_cache_ops' is more intuitive than assigning the
> members individually.
OK.

> > +
> > +enum dma_noncoherent_ops {
> > +     NON_COHERENT_SYNC_DMA_FOR_DEVICE = 0,
> > +     NON_COHERENT_SYNC_DMA_FOR_CPU,
> > +     NON_COHERENT_DMA_PREP,
> > +     NON_COHERENT_DMA_PMEM,
> > +};
> > +
> > +/*
> > + * struct riscv_cache_ops - Structure for CMO function pointers
> > + * @clean_range: Function pointer for clean cache
> > + * @inv_range: Function pointer for invalidate cache
> > + * @flush_range: Function pointer for flushing the cache
> > + * @riscv_dma_noncoherent_cmo_ops: Function pointer for platforms who
> > want
> > + *  to handle CMO themselves. If this function pointer is set rest of
> > the
> > + *  function pointers will be NULL.
> > + */
> > +struct riscv_cache_ops {
> > +     void (*clean_range)(unsigned long addr, unsigned long size);
> > +     void (*inv_range)(unsigned long addr, unsigned long size);
> > +     void (*flush_range)(unsigned long addr, unsigned long size);
> > +     void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
> > +                                           enum dma_data_direction dir,
> > +                                           enum dma_noncoherent_ops ops);
> > +};
>
> I don't quite see how the fourth operation is used here.
> Are there cache controllers that need something beyond
> clean/inv/flush?
>
This is for platforms that dont follow standard cache operations (like
done in patch 5/6) and there drivers decide on the operations
depending on the ops and dir.

> > +
> > +#else
> > +
> > +static void riscv_noncoherent_register_cache_ops(struct
> > riscv_cache_ops *ops) {}
> > +
> > +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t
> > size) {}
> > +
> > +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t
> > size) {}
> > +
> > +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t
> > size) {}
>
> I think you can drop the #else path here: if there is no
> noncoherent DMA, then nothing should ever call these
> functions, right?
>
Yes it shouldn't be called.

> > +
> > +#ifdef CONFIG_RISCV_ISA_ZICBOM
> ...
> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > +     .clean_range = &zicbom_cmo_clean_range,
> > +     .inv_range = &zicbom_cmo_inval_range,
> > +     .flush_range = &zicbom_cmo_flush_range,
> > +};
> > +#else
> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > +     .clean_range = NULL,
> > +     .inv_range = NULL,
> > +     .flush_range = NULL,
> > +     .riscv_dma_noncoherent_cmo_ops = NULL,
> > +};
> > +#endif
> > +EXPORT_SYMBOL(zicbom_cmo_ops);
>
> Same here: If the ZICBOM ISA is disabled, nothing should
> reference zicbom_cmo_ops.
OK.

Cheers,
Prabhakar
Conor Dooley Jan. 7, 2023, 10:21 p.m. UTC | #7
On Sat, Jan 07, 2023 at 10:52:55PM +0100, Arnd Bergmann wrote:
> On Sat, Jan 7, 2023, at 00:29, Conor Dooley wrote:
> > On Fri, Jan 06, 2023 at 11:31:33PM +0100, Arnd Bergmann wrote:
> >> On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote:
> >> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> >> > +struct riscv_cache_ops zicbom_cmo_ops = {
> >> > +	.clean_range = &zicbom_cmo_clean_range,
> >> > +	.inv_range = &zicbom_cmo_inval_range,
> >> > +	.flush_range = &zicbom_cmo_flush_range,
> >> > +};
> >> > +#else
> >> > +struct riscv_cache_ops zicbom_cmo_ops = {
> >> > +	.clean_range = NULL,
> >> > +	.inv_range = NULL,
> >> > +	.flush_range = NULL,
> >> > +	.riscv_dma_noncoherent_cmo_ops = NULL,
> >> > +};
> >> > +#endif
> >> > +EXPORT_SYMBOL(zicbom_cmo_ops);
> >> 
> >> Same here: If the ZICBOM ISA is disabled, nothing should
> >> reference zicbom_cmo_ops.
> >
> >> Also, since ZICBOM is a standard
> >> extension, I think it makes sense to always have it enabled,
> >> at least whenever noncoherent DMA is supported, that way
> >> it can be the default that gets used in the absence of any
> >> nonstandard cache controller.
> >
> > While I think of it, this is not possible as Zicbom requires toolchain
> > support whereas the alternative methods for non-coherent DMA do not.
> 
> Ah, I see. Would it be possible to use the same .long trick
> as in the other ones though? Something like
> 
> #if CONFIG_AS_VERSION >= 23600 /* or whichever version */


>      /* proper inline asm */
> #else
>      /* .long hack */
> #endif
> 
> That way everyone can use it, and the hack would automatically
> go away in a few years after linux requires a newer toolchain.

> Alternatively, the entire noncoherent-dma support could be
> made to depend on whichever toolchain introduced Zicbom.

Ehh, I don't think that's a great idea. It'd require far too recent a
toolchain IMO.

Ideally, in my opinion, we'd just do something like what Drew has
proposed for Zicboz, negating the need for a check at all:
https://lore.kernel.org/linux-riscv/20221027130247.31634-4-ajones@ventanamicro.com/

Been waiting for that to be re-spun and Palmer to accept it before doing
the same thing for Zicbom. At present we have this in the arch Kconfig:

config TOOLCHAIN_HAS_ZICBOM
	bool
	default y
	depends on !64BIT || $(cc-option,-mabi=lp64 -march=rv64ima_zicbom)
	depends on !32BIT || $(cc-option,-mabi=ilp32 -march=rv32ima_zicbom)
	depends on LLD_VERSION >= 150000 || LD_VERSION >= 23800

config RISCV_ISA_ZICBOM
	bool "Zicbom extension support for non-coherent DMA operation"
	depends on TOOLCHAIN_HAS_ZICBOM

The linker version check is entirely due to the linker having issues if
it sees zicbom in the ISA string in object files.

I'd been intending to do that for Zicbom anyway, so I guess I'll just go
do it & Prabhakar can attach it to his v7..

Thanks,
Conor.
Lad, Prabhakar Jan. 7, 2023, 10:36 p.m. UTC | #8
Hi Conor,

Thank you for the review.

On Fri, Jan 6, 2023 at 11:48 PM Conor Dooley <conor@kernel.org> wrote:
>
> +CC Icenowy
>
> Hey Prabhakar,
>
> On Fri, Jan 06, 2023 at 06:55:21PM +0000, Prabhakar wrote:
> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> >
> > The current implementation of CMO was handled using the ALTERNATIVE_X()
> > macro; this was manageable when there were a limited number of platforms
> > using this. Now that we are having more and more platforms coming through
> > with the CMO the use of the ALTERNATIVE_X() macro becomes unmanageable.
>
> Nitpick time! This opening paragraph is all a little oddly worded IMO.
> How about:
>
> Currently, selecting which CMOs to use on a given platform is done using
> and ALTERNATIVE_X() macro. This was manageable when there were just two
> CMO implementations, but now that there are more and more platforms coming
> needing custom CMOs, the use of the ALTERNATIVE_X() macro is unmanageable.
>
sounds good.

> > To avoid such issues this patch switches to use of function pointers
> > instead of ALTERNATIVE_X() macro for cache management (the only draw being
>
> s/draw/drawback
>
oops.

> > performance over the previous approach).
>
> Did you notice a performance decrease or are you speculating?
> Perhaps Icenowy - who I know benched their implmentation of a new CMO
> macro for THEAD stuff can do so again for this.
>
I actually haven't benchmarked the speeds to tbh I am speculating it
because of the additional checks. Maybe If I export the functions
instead of having them in the header and have static keys for each
callback maybe that will reduce some lag if any. Does that sound good?

> > void (*clean_range)(unsigned long addr, unsigned long size);
> > void (*inv_range)(unsigned long addr, unsigned long size);
> > void (*flush_range)(unsigned long addr, unsigned long size);
> >
> > The above function pointers are provided to be overridden where platforms
> > using standard approach and for platforms who want handle the operation
> > based on the operation the below function pointer is provided:
>
> Think you've left out some words here chief!
> Perhaps s/and for platforms who/. For platforms that/ & on the line
> after, s/operation/direction/ ?
>
I'll re-word it.

> > void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
> >                                       enum dma_data_direction dir,
> >                                       enum dma_noncoherent_ops ops);
> >
> > In the current patch we have moved the ZICBOM and T-Head CMO to use
> > function pointers.
>
> "Convert ZICBOM..."
>
OK.

> > Signed-off-by: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> > ---
> > v5->v6
> > * New patch
> > ---
> >  arch/riscv/errata/thead/errata.c         | 71 ++++++++++++++++++
> >  arch/riscv/include/asm/dma-noncoherent.h | 83 +++++++++++++++++++++
> >  arch/riscv/include/asm/errata_list.h     | 53 -------------
> >  arch/riscv/kernel/cpufeature.c           |  2 +
> >  arch/riscv/mm/dma-noncoherent.c          | 94 ++++++++++++++++++++++--
> >  arch/riscv/mm/pmem.c                     | 18 ++++-
> >  6 files changed, 260 insertions(+), 61 deletions(-)
> >  create mode 100644 arch/riscv/include/asm/dma-noncoherent.h
> >
> > diff --git a/arch/riscv/errata/thead/errata.c b/arch/riscv/errata/thead/errata.c
> > index fac5742d1c1e..826b2ba3e61e 100644
> > --- a/arch/riscv/errata/thead/errata.c
> > +++ b/arch/riscv/errata/thead/errata.c
> > @@ -8,12 +8,72 @@
> >  #include <linux/module.h>
> >  #include <linux/string.h>
> >  #include <linux/uaccess.h>
> > +#include <asm/dma-noncoherent.h>
> >  #include <asm/alternative.h>
> >  #include <asm/cacheflush.h>
> >  #include <asm/errata_list.h>
> >  #include <asm/patch.h>
> >  #include <asm/vendorid_list.h>
> >
> > +#ifdef CONFIG_ERRATA_THEAD_CMO
> > +/*
> > + * dcache.ipa rs1 (invalidate, physical address)
> > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
> > + *   0000001    01010      rs1       000      00000  0001011
> > + * dache.iva rs1 (invalida, virtual address)
> > + *   0000001    00110      rs1       000      00000  0001011
> > + *
> > + * dcache.cpa rs1 (clean, physical address)
> > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
> > + *   0000001    01001      rs1       000      00000  0001011
> > + * dcache.cva rs1 (clean, virtual address)
> > + *   0000001    00100      rs1       000      00000  0001011
> > + *
> > + * dcache.cipa rs1 (clean then invalidate, physical address)
> > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
> > + *   0000001    01011      rs1       000      00000  0001011
> > + * dcache.civa rs1 (... virtual address)
> > + *   0000001    00111      rs1       000      00000  0001011
> > + *
> > + * sync.s (make sure all cache operations finished)
> > + * | 31 - 25 | 24 - 20 | 19 - 15 | 14 - 12 | 11 - 7 | 6 - 0 |
> > + *   0000000    11001     00000      000      00000  0001011
> > + */
> > +#define THEAD_inval_A0       ".long 0x0265000b"
> > +#define THEAD_clean_A0       ".long 0x0245000b"
> > +#define THEAD_flush_A0       ".long 0x0275000b"
> > +#define THEAD_SYNC_S ".long 0x0190000b"
> > +
> > +#define THEAD_CMO_OP(_op, _start, _size, _cachesize)                         \
> > +     asm volatile("mv a0, %1\n\t"                                            \
> > +                  "j 2f\n\t"                                                 \
> > +                  "3:\n\t"                                                   \
> > +                  THEAD_##_op##_A0 "\n\t"                                    \
> > +                  "add a0, a0, %0\n\t"                                       \
> > +                  "2:\n\t"                                                   \
> > +                  "bltu a0, %2, 3b\n\t"                                      \
> > +                  THEAD_SYNC_S                                               \
> > +                  : : "r"(_cachesize),                                       \
> > +                      "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)),  \
> > +                      "r"((unsigned long)(_start) + (_size))                 \
> > +                  : "a0")
>
> I'm not going to provide a tested-by for the THEAD stuff, as I have no
> metrics - but I booted my usual NFS and did some tyre kicking. Seemed
> grand.
>
\o/

> > +static void thead_cmo_clean_range(unsigned long addr, unsigned long size)
> > +{
> > +     THEAD_CMO_OP(clean, addr, size, riscv_cbom_block_size);
> > +}
> > +
> > +static void thead_cmo_flush_range(unsigned long addr, unsigned long size)
> > +{
> > +     THEAD_CMO_OP(flush, addr, size, riscv_cbom_block_size);
> > +}
> > +
> > +static void thead_cmo_inval_range(unsigned long addr, unsigned long size)
> > +{
> > +     THEAD_CMO_OP(inval, addr, size, riscv_cbom_block_size);
> > +}
> > +#endif
> > +
> >  static bool errata_probe_pbmt(unsigned int stage,
> >                             unsigned long arch_id, unsigned long impid)
> >  {
> > @@ -33,6 +93,8 @@ static bool errata_probe_pbmt(unsigned int stage,
> >  static bool errata_probe_cmo(unsigned int stage,
> >                            unsigned long arch_id, unsigned long impid)
> >  {
> > +     struct riscv_cache_ops thead_cmo_ops;
> > +
> >       if (!IS_ENABLED(CONFIG_ERRATA_THEAD_CMO))
> >               return false;
> >
> > @@ -44,6 +106,15 @@ static bool errata_probe_cmo(unsigned int stage,
> >
> >       riscv_cbom_block_size = L1_CACHE_BYTES;
> >       riscv_noncoherent_supported();
> > +
> > +     memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops));
> > +     if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) {
>
> With CONFIG_ERRATA_THEAD_CMO=n:
>
> /stuff/linux/arch/riscv/errata/thead/errata.c: In function 'errata_probe_cmo':
> /stuff/linux/arch/riscv/errata/thead/errata.c:112:46: error: 'thead_cmo_clean_range' undeclared (first use in this function)
>   112 |                 thead_cmo_ops.clean_range = &thead_cmo_clean_range;
>       |                                              ^~~~~~~~~~~~~~~~~~~~~
> /stuff/linux/arch/riscv/errata/thead/errata.c:112:46: note: each undeclared identifier is reported only once for each function it appears in
> /stuff/linux/arch/riscv/errata/thead/errata.c:113:44: error: 'thead_cmo_inval_range' undeclared (first use in this function)
>   113 |                 thead_cmo_ops.inv_range = &thead_cmo_inval_range;
>       |                                            ^~~~~~~~~~~~~~~~~~~~~
> /stuff/linux/arch/riscv/errata/thead/errata.c:114:46: error: 'thead_cmo_flush_range' undeclared (first use in this function)
>   114 |                 thead_cmo_ops.flush_range = &thead_cmo_flush_range;
>       |                                              ^~~~~~~~~~~~~~~~~~~~~
>
> These functions are only present if CONFIG_ERRATA_THEAD_CMO is set,
> so this cannot be an IS_ENABLED() as things stand.
>
OK, I'll fix this.

> > +             thead_cmo_ops.clean_range = &thead_cmo_clean_range;
> > +             thead_cmo_ops.inv_range = &thead_cmo_inval_range;
> > +             thead_cmo_ops.flush_range = &thead_cmo_flush_range;
>
> As Arnd suggested, I'd rather see a `static const struct
> riscv_cache_ops` type deal too, so you can just call register()
> directly.
>
Sure will do that.

> > +             riscv_noncoherent_register_cache_ops(&thead_cmo_ops);
> > +     }
> > +
> >       return true;
> >  }
> >
> > diff --git a/arch/riscv/include/asm/dma-noncoherent.h b/arch/riscv/include/asm/dma-noncoherent.h
> > new file mode 100644
> > index 000000000000..a2af863d2608
> > --- /dev/null
> > +++ b/arch/riscv/include/asm/dma-noncoherent.h
> > @@ -0,0 +1,83 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright (C) 2022 Renesas Electronics Corp.
> > + */
> > +
> > +#ifndef __ASM_DMA_NONCOHERENT_H
> > +#define __ASM_DMA_NONCOHERENT_H
> > +
> > +#include <linux/dma-direct.h>
> > +
> > +enum dma_noncoherent_ops {
> > +     NON_COHERENT_SYNC_DMA_FOR_DEVICE = 0,
> > +     NON_COHERENT_SYNC_DMA_FOR_CPU,
> > +     NON_COHERENT_DMA_PREP,
> > +     NON_COHERENT_DMA_PMEM,
> > +};
> > +
> > +/*
> > + * struct riscv_cache_ops - Structure for CMO function pointers
>
> Drop the "function pointers" from everything here IMO.
>
OK.

> > + * @clean_range: Function pointer for clean cache
> > + * @inv_range: Function pointer for invalidate cache
> > + * @flush_range: Function pointer for flushing the cache
> > + * @riscv_dma_noncoherent_cmo_ops: Function pointer for platforms who want
> > + *  to handle CMO themselves. If this function pointer is set rest of the
> > + *  function pointers will be NULL.
>
> Will be NULL? Or must be NULL?
> Assuming you keep it, as I see Arnd is questioning it, I think a better
> description is needed for this one. And a rename of the function name,
> the one you have right now is oddly juxtaposed and a bit confusing.
> I don't really have something much better to suggest, so I am gonna use
> cmo_omnibus here...
>
> @cmo_omnibus: For platforms whose CMO capabilities do not align with the
>               standard cache management operations. If set, the specific
>               ops must be left unset.
>
OK, I'll update the description as above.

> Circling back ages later, cmo_universal()?
>
Sounds good to me.

> > +struct riscv_cache_ops {
> > +     void (*clean_range)(unsigned long addr, unsigned long size);
> > +     void (*inv_range)(unsigned long addr, unsigned long size);
> > +     void (*flush_range)(unsigned long addr, unsigned long size);
> > +     void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
> > +                                           enum dma_data_direction dir,
> > +                                           enum dma_noncoherent_ops ops);
> > +};
> > +
> > +extern struct riscv_cache_ops zicbom_cmo_ops;
> > +
> > +#ifdef CONFIG_RISCV_DMA_NONCOHERENT
> > +
> > +extern struct riscv_cache_ops noncoherent_cache_ops;
> > +
> > +void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops);
> > +
> > +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size)
> > +{
> > +     if (noncoherent_cache_ops.clean_range) {
> > +             unsigned long addr = (unsigned long)vaddr;
> > +
> > +             noncoherent_cache_ops.clean_range(addr, size);
> > +     }
> > +}
> > +
> > +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size)
> > +{
> > +     if (noncoherent_cache_ops.flush_range) {
> > +             unsigned long addr = (unsigned long)vaddr;
> > +
> > +             noncoherent_cache_ops.flush_range(addr, size);
> > +     }
> > +}
> > +
> > +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size)
> > +{
> > +     if (noncoherent_cache_ops.inv_range) {
> > +             unsigned long addr = (unsigned long)vaddr;
> > +
> > +             noncoherent_cache_ops.inv_range(addr, size);
> > +     }
> > +}
> > +
> > +#else
> > +
> > +static void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops) {}
> > +
> > +static inline void riscv_dma_noncoherent_clean(void *vaddr, size_t size) {}
> > +
> > +static inline void riscv_dma_noncoherent_flush(void *vaddr, size_t size) {}
> > +
> > +static inline void riscv_dma_noncoherent_inval(void *vaddr, size_t size) {}
>
> Can these ever be used if DMA_NONCOHERENT is not set? I think
> riscv/mm/Makefile gates their usage on the symbol, no?
>
I think I can get rid of these.

> > +
> > +#endif
> > +
> > +#endif       /* __ASM_DMA_NONCOHERENT_H */
> > diff --git a/arch/riscv/include/asm/errata_list.h b/arch/riscv/include/asm/errata_list.h
> > index 4180312d2a70..ae3fc8b80edd 100644
>
> [...]
>
> > diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
> > index d919efab6eba..d9445c266bfd 100644
> > --- a/arch/riscv/mm/dma-noncoherent.c
> > +++ b/arch/riscv/mm/dma-noncoherent.c
> > @@ -9,23 +9,82 @@
> >  #include <linux/dma-map-ops.h>
> >  #include <linux/mm.h>
> >  #include <asm/cacheflush.h>
> > +#include <asm/dma-noncoherent.h>
> >
> >  static bool noncoherent_supported;
> >
> > +struct riscv_cache_ops noncoherent_cache_ops = {
> > +     .clean_range = NULL,
> > +     .inv_range = NULL,
> > +     .flush_range = NULL,
> > +     .riscv_dma_noncoherent_cmo_ops = NULL,
> > +};
> > +EXPORT_SYMBOL(noncoherent_cache_ops);
>
> These exports should be _GPL, no? The file is GPLed. Ditto the others.
>
Agreed.

> > +#ifdef CONFIG_RISCV_ISA_ZICBOM
> > +#define ZICBOM_CMO_OP(_op, _start, _size, _cachesize)                                \
> > +     asm volatile("mv a0, %1\n\t"                                            \
> > +                  "j 2f\n\t"                                                 \
> > +                  "3:\n\t"                                                   \
> > +                  "cbo." __stringify(_op) " (a0)\n\t"                        \
> > +                  "add a0, a0, %0\n\t"                                       \
> > +                  "2:\n\t"                                                   \
> > +                  "bltu a0, %2, 3b\n\t"                                      \
> > +                  : : "r"(_cachesize),                                       \
> > +                      "r"((unsigned long)(_start) & ~((_cachesize) - 1UL)),  \
> > +                      "r"((unsigned long)(_start) + (_size))                 \
> > +                  : "a0")
> > +
> > +static void zicbom_cmo_clean_range(unsigned long addr, unsigned long size)
> > +{
> > +     ZICBOM_CMO_OP(clean, addr, size, riscv_cbom_block_size);
> > +}
> > +
> > +static void zicbom_cmo_flush_range(unsigned long addr, unsigned long size)
> > +{
> > +     ZICBOM_CMO_OP(flush, addr, size, riscv_cbom_block_size);
> > +}
> > +
> > +static void zicbom_cmo_inval_range(unsigned long addr, unsigned long size)
> > +{
> > +     ZICBOM_CMO_OP(inval, addr, size, riscv_cbom_block_size);
> > +}
> > +
> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > +     .clean_range = &zicbom_cmo_clean_range,
> > +     .inv_range = &zicbom_cmo_inval_range,
> > +     .flush_range = &zicbom_cmo_flush_range,
> > +};
> > +#else
> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > +     .clean_range = NULL,
> > +     .inv_range = NULL,
> > +     .flush_range = NULL,
> > +     .riscv_dma_noncoherent_cmo_ops = NULL,
> > +};
> > +#endif
> > +EXPORT_SYMBOL(zicbom_cmo_ops);
> > +
> >  void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
> >                             enum dma_data_direction dir)
> >  {
> >       void *vaddr = phys_to_virt(paddr);
> >
> > +     if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
> > +             noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir,
> > +                                                                 NON_COHERENT_SYNC_DMA_FOR_DEVICE);
> > +             return;
> > +     }
> > +
> >       switch (dir) {
> >       case DMA_TO_DEVICE:
> > -             ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> > +             riscv_dma_noncoherent_clean(vaddr, size);
> >               break;
> >       case DMA_FROM_DEVICE:
> > -             ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
> > +             riscv_dma_noncoherent_clean(vaddr, size);
> >               break;
> >       case DMA_BIDIRECTIONAL:
> > -             ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> > +             riscv_dma_noncoherent_flush(vaddr, size);
> >               break;
> >       default:
> >               break;
> > @@ -37,12 +96,18 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
> >  {
> >       void *vaddr = phys_to_virt(paddr);
> >
> > +     if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
> > +             noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(vaddr, size, dir,
> > +                                                                 NON_COHERENT_SYNC_DMA_FOR_CPU);
> > +             return;
> > +     }
> > +
> >       switch (dir) {
> >       case DMA_TO_DEVICE:
> >               break;
> >       case DMA_FROM_DEVICE:
> >       case DMA_BIDIRECTIONAL:
> > -             ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
> > +             riscv_dma_noncoherent_flush(vaddr, size);
> >               break;
> >       default:
> >               break;
> > @@ -53,7 +118,13 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
> >  {
> >       void *flush_addr = page_address(page);
> >
> > -     ALT_CMO_OP(flush, flush_addr, size, riscv_cbom_block_size);
> > +     if (noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops) {
> > +             noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops(flush_addr, size, -1,
> > +                                                                 NON_COHERENT_DMA_PREP);
> > +             return;
> > +     }
> > +
> > +     riscv_dma_noncoherent_flush(flush_addr, size);
> >  }
> >
> >  void arch_setup_dma_ops(struct device *dev, u64 dma_base, u64 size,
> > @@ -78,3 +149,16 @@ void riscv_noncoherent_supported(void)
> >            "Non-coherent DMA support enabled without a block size\n");
> >       noncoherent_supported = true;
> >  }
>
> Barring the function name, which I really don't like - it really is
> confusingly named. Something like cmo_universal() would get it's point
> across a bit better I think.
>
OK, I'll rename riscv_dma_noncoherent_cmo_ops to cmo_universal.

> > +void riscv_noncoherent_register_cache_ops(struct riscv_cache_ops *ops)
> > +{
> > +     if (!ops)
> > +             return;
> > +
> > +     if (ops->riscv_dma_noncoherent_cmo_ops)
> > +             noncoherent_cache_ops.riscv_dma_noncoherent_cmo_ops =
> > +                             ops->riscv_dma_noncoherent_cmo_ops;
> > +     else
> > +             noncoherent_cache_ops = *ops;
>
> Can we just chop off the extra step here? Behaviourally this is going to
> be no different whether someone sets the universal op and the individual
> ones anyway.
>
Agreed.

> > +}
> > +EXPORT_SYMBOL(riscv_noncoherent_register_cache_ops);
>
> _GPL again I think!
>
OK.

> I'm happy with this approach, modulo the minor bits I've pointed out.
> I do really like that it takes us away from messing with an asm
> alternative every time someone wants to do this stuff differently or for
> a new platform.
>
\o/

> I would like Heiko to have a look at what you've done here, but ye looks
> promising IMO.
>
> Thanks,
> Conor.
>

Cheers,
Prabhakar
Arnd Bergmann Jan. 8, 2023, 12:07 a.m. UTC | #9
On Sat, Jan 7, 2023, at 23:10, Lad, Prabhakar wrote:

>> > +
>> > +     memset(&thead_cmo_ops, 0x0, sizeof(thead_cmo_ops));
>> > +     if (IS_ENABLED(CONFIG_ERRATA_THEAD_CMO)) {
>> > +             thead_cmo_ops.clean_range = &thead_cmo_clean_range;
>> > +             thead_cmo_ops.inv_range = &thead_cmo_inval_range;
>> > +             thead_cmo_ops.flush_range = &thead_cmo_flush_range;
>> > +             riscv_noncoherent_register_cache_ops(&thead_cmo_ops);
>> > +     }
>>
>> The implementation here looks reasonable, just wonder whether
>> the classification as an 'errata' makes sense. I would probably
>> consider this a 'driver' at this point, but that's just
>> a question of personal preference.
>>
> zicbom is a CPU feature that doesn't have any DT node and hence no
> driver and similarly for T-HEAD SoC.

A driver does not have to be a 'struct platform_driver' that
matches to a device node, my point was more about what to
name it, regardless of how the code is entered.

> Also the arch_setup_dma_ops()
> happens quite early before driver probing due to which we get WARN()
> messages during bootup hence I have implemented it as errata; as
> errata patching happens quite early.

But there is no more patching here, just setting the
function pointers, right?

>> > +struct riscv_cache_ops {
>> > +     void (*clean_range)(unsigned long addr, unsigned long size);
>> > +     void (*inv_range)(unsigned long addr, unsigned long size);
>> > +     void (*flush_range)(unsigned long addr, unsigned long size);
>> > +     void (*riscv_dma_noncoherent_cmo_ops)(void *vaddr, size_t size,
>> > +                                           enum dma_data_direction dir,
>> > +                                           enum dma_noncoherent_ops ops);
>> > +};
>>
>> I don't quite see how the fourth operation is used here.
>> Are there cache controllers that need something beyond
>> clean/inv/flush?
>>
> This is for platforms that dont follow standard cache operations (like
> done in patch 5/6) and there drivers decide on the operations
> depending on the ops and dir.

My feeling is that the set of operations that get called should
not depend on the cache controller but at best the CPU. I tried to
enumerate how zicbom and ax45 differ here, and how that compares
to other architectures:

                  zicbom      ax45,mips,arc      arm           arm64
fromdevice      clean/flush   inval/inval   inval/inval   clean/inval
todevice        clean/-       clean/-       clean/-       clean/-
bidi            flush/flush   flush/inval   clean/inval   clean/inval

So everyone does the same operation for DMA_TO_DEVICE, but
they differ in the DMA_FROM_DEVICE handling, for reasons I
don't quite see:

Your ax45 code does the same as arc and mips. arm and
arm64 skip invalidating the cache before bidi mappings,
but arm has a FIXME comment about that. arm64 does a
'clean' instead of 'inval' when mapping a fromdevice
page, which seems valid but slower than necessary.

Could the zicbom operations be changed to do the same
things as the ax45/mips/arc ones, or are there specific
details in the zicbom spec that require this?

     Arnd
Conor Dooley Jan. 8, 2023, 4:37 p.m. UTC | #10
On Sat, Jan 07, 2023 at 10:21:52PM +0000, Conor Dooley wrote:
> On Sat, Jan 07, 2023 at 10:52:55PM +0100, Arnd Bergmann wrote:
> > On Sat, Jan 7, 2023, at 00:29, Conor Dooley wrote:
> > > On Fri, Jan 06, 2023 at 11:31:33PM +0100, Arnd Bergmann wrote:
> > >> On Fri, Jan 6, 2023, at 19:55, Prabhakar wrote:
> > >> > From: Lad Prabhakar <prabhakar.mahadev-lad.rj@bp.renesas.com>
> > >> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > >> > +	.clean_range = &zicbom_cmo_clean_range,
> > >> > +	.inv_range = &zicbom_cmo_inval_range,
> > >> > +	.flush_range = &zicbom_cmo_flush_range,
> > >> > +};
> > >> > +#else
> > >> > +struct riscv_cache_ops zicbom_cmo_ops = {
> > >> > +	.clean_range = NULL,
> > >> > +	.inv_range = NULL,
> > >> > +	.flush_range = NULL,
> > >> > +	.riscv_dma_noncoherent_cmo_ops = NULL,
> > >> > +};
> > >> > +#endif
> > >> > +EXPORT_SYMBOL(zicbom_cmo_ops);
> > >> 
> > >> Same here: If the ZICBOM ISA is disabled, nothing should
> > >> reference zicbom_cmo_ops.
> > >
> > >> Also, since ZICBOM is a standard
> > >> extension, I think it makes sense to always have it enabled,
> > >> at least whenever noncoherent DMA is supported, that way
> > >> it can be the default that gets used in the absence of any
> > >> nonstandard cache controller.
> > >
> > > While I think of it, this is not possible as Zicbom requires toolchain
> > > support whereas the alternative methods for non-coherent DMA do not.
> > 
> > Ah, I see. Would it be possible to use the same .long trick
> > as in the other ones though? Something like
> > 
> > #if CONFIG_AS_VERSION >= 23600 /* or whichever version */
> 
> 
> >      /* proper inline asm */
> > #else
> >      /* .long hack */
> > #endif
> > 
> > That way everyone can use it, and the hack would automatically
> > go away in a few years after linux requires a newer toolchain.
> 
> > Alternatively, the entire noncoherent-dma support could be
> > made to depend on whichever toolchain introduced Zicbom.
> 
> Ehh, I don't think that's a great idea. It'd require far too recent a
> toolchain IMO.
> 
> Ideally, in my opinion, we'd just do something like what Drew has
> proposed for Zicboz, negating the need for a check at all:
> https://lore.kernel.org/linux-riscv/20221027130247.31634-4-ajones@ventanamicro.com/
> 
> Been waiting for that to be re-spun and Palmer to accept it before doing
> the same thing for Zicbom. At present we have this in the arch Kconfig:
> 
> config TOOLCHAIN_HAS_ZICBOM
> 	bool
> 	default y
> 	depends on !64BIT || $(cc-option,-mabi=lp64 -march=rv64ima_zicbom)
> 	depends on !32BIT || $(cc-option,-mabi=ilp32 -march=rv32ima_zicbom)
> 	depends on LLD_VERSION >= 150000 || LD_VERSION >= 23800
> 
> config RISCV_ISA_ZICBOM
> 	bool "Zicbom extension support for non-coherent DMA operation"
> 	depends on TOOLCHAIN_HAS_ZICBOM
> 
> The linker version check is entirely due to the linker having issues if
> it sees zicbom in the ISA string in object files.
> 
> I'd been intending to do that for Zicbom anyway, so I guess I'll just go
> do it & Prabhakar can attach it to his v7..

Should pop up here in a few minutes..
https://lore.kernel.org/linux-riscv/20230108163356.3063839-1-conor@kernel.org/

Hopefully that both works & makes life easier. Certainly from a CI
coverage point of view, relaxing toolchain requirements makes *my* life
easier!
hch Jan. 10, 2023, 7:01 a.m. UTC | #11
On Mon, Jan 09, 2023 at 01:59:12PM +0100, Arnd Bergmann wrote:
> I had another look at the arm64 side, which (like the zicbom
> variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE),
> as that has changed not that long ago, see
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572

which IIRC has been reverted recently.

> I'm still not sure what the correct set of operations has
> to be, but nothing in that patch description sounds ISA
> or even microarchitecture specific.

Nothing is ISA specific, and the only micro architecture related thing
is if the specific core can speculate memory accesses.  See the table
in arch/arc/mm/dma.c for details.

And as commented on the arm64 patch I really hate architectures getting
creative here, as I'd much prefer to move the choice of primitives to
the core DMA code and just provide helpers to invalidate/writeback/
writeback+invalidate from the architectures eventually.
Arnd Bergmann Jan. 10, 2023, 3:03 p.m. UTC | #12
On Tue, Jan 10, 2023, at 08:01, Christoph Hellwig wrote:
> On Mon, Jan 09, 2023 at 01:59:12PM +0100, Arnd Bergmann wrote:
>> I had another look at the arm64 side, which (like the zicbom
>> variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE),
>> as that has changed not that long ago, see
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572
>
> which IIRC has been reverted recently.

To clarify: I was looking at arch_sync_dma_for_device(), which
changed from 'invalidate' to 'clean' last June in commit
c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers
at start of DMA transfer"). I don't see  a revert for that.

The one that was reverted recently is arch_dma_prep_coherent, which
was changed and reverted in

c44094eee32 Aug 23 2022 flush->clean
b7d9aae4048 Dec 6  2022 clean->flush

I'm primarily interested in the streaming mappings (arch_sync_*)
at the moment.

>> I'm still not sure what the correct set of operations has
>> to be, but nothing in that patch description sounds ISA
>> or even microarchitecture specific.
>
> Nothing is ISA specific, and the only micro architecture related thing
> is if the specific core can speculate memory accesses.  See the table
> in arch/arc/mm/dma.c for details.
>
> And as commented on the arm64 patch I really hate architectures getting
> creative here, as I'd much prefer to move the choice of primitives to
> the core DMA code and just provide helpers to invalidate/writeback/
> writeback+invalidate from the architectures eventually.

Agreed, that's why I particularly don't like the idea of giving SoC
specific L2$ drivers the option of making custom choices here.

The arch_dma_prep_coherent() change is arguably arm64 specific
because it is only needed for machines sharing memory with EL3
firmware that enforces buffer ownership, but even for that it would
be nice to have a central definition and some architecture specific
flag that picks one behavior or the other. Same thing for the
speculative access difference.

I looked at all the implementations now and put them in a table[1]
to see what the differences are. The only bit that I think needs
discussion is the dma_sync_single_for_device(DMA_FROM_DEVICE) op
that I mentioned above. I see that arm64, csky, powerpc, riscv
and parisc all write out at least partical cache lines first to
avoid losing dirty data in the part that is not written by the
device[2][3], while the other ones don't[4]. 

The other differences I found are:

- arm and arm64 use clean instead of flush for
  dma_sync_single_for_device(DMA_BIDIRECTIONAL).
  This seems harmless, since there is another
  invalidation at the _for_cpu() step.

- hexagon, m68k, openrisc and sh don't deal with
  speculative loads

- m68k, nios2, openrisc, parisc and xtensa use
  flush instead of clean for DMA_TO_DEVICE, presumably
  because they don't have a flush without invalidate.

- microblaze, parisc and powerpc use the same function
  for _for_device and _for_cpu, which is slightly wasteful
  but harmless.

- openrisc lacks support for bidirectional mappings,
  which is a bug

- sparc32 and xtensa only supports writethrough caches
  and consequently skips the clean before DMA but
  instead invalidate the buffer in the _for_cpu
  operation.

I also see that at least arc, arm, mips and riscv all want
CPU specific cache management operations to be registered
at boot time. While Russell had some concerns about your
suggestion to generalize the arm version, we could start
by moving the abstracted riscv version into
kernel/dma/direct.c and make sure it can be shared with
at least mips and arc, provided that we can agree on the
DMA_FROM_DEVICE behavior.

     Arnd

[1] https://docs.google.com/spreadsheets/d/1qDuMqB6TnRTj_CgUwgIIm_RJ6EZO76qohpTJUMQjUEo/edit#gid=0
[2] https://git.kernel.org/torvalds/c/c50f11c6196f
[3] https://git.kernel.org/torvalds/c/03d70617b8a7
[4] https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/
Will Deacon Jan. 10, 2023, 3:11 p.m. UTC | #13
On Tue, Jan 10, 2023 at 04:03:06PM +0100, Arnd Bergmann wrote:
> On Tue, Jan 10, 2023, at 08:01, Christoph Hellwig wrote:
> > On Mon, Jan 09, 2023 at 01:59:12PM +0100, Arnd Bergmann wrote:
> >> I had another look at the arm64 side, which (like the zicbom
> >> variant) uses 'clean' on dma_sync_single_for_device(DMA_FROM_DEVICE),
> >> as that has changed not that long ago, see
> >> 
> >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c50f11c6196f45c92ca48b16a5071615d4ae0572
> >
> > which IIRC has been reverted recently.
> 
> To clarify: I was looking at arch_sync_dma_for_device(), which
> changed from 'invalidate' to 'clean' last June in commit
> c50f11c6196f ("arm64: mm: Don't invalidate FROM_DEVICE buffers
> at start of DMA transfer"). I don't see  a revert for that.
> 
> The one that was reverted recently is arch_dma_prep_coherent, which
> was changed and reverted in
> 
> c44094eee32 Aug 23 2022 flush->clean
> b7d9aae4048 Dec 6  2022 clean->flush
> 
> I'm primarily interested in the streaming mappings (arch_sync_*)
> at the moment.

Just as an FYI, but we plan to revert the revert (i.e. go back to 'clean')
here once Qualcomm's modem firmware loader has been updated:

https://lore.kernel.org/r/20230109034843.23759-1-quic_sibis@quicinc.com

Will
hch Jan. 21, 2023, 2:37 p.m. UTC | #14
On Fri, Jan 20, 2023 at 06:04:37PM +0100, Arnd Bergmann wrote:
> Having looked at this some more, I see that the powerpc
> version is a bit problematic here as well: this one
> flushes the partial cache lines before and after the
> DMA transfer, while only invalidating the full
> cache lines.

That feels really odd, and might be worth a bug report to the
PPC maintainers.

> Obviously there is no winning either way if the same
> cache line gets written by both CPU and device, I'm
> just trying to figure out what behavior we actually
> want here.

There isn't, and that's why we require DMAed regions to be cache line
aligned.

> Aside from the question for how to handle flush vs invalidate
> on DMA_FROM_DEVICE, I'm still trying to figure out how to
> best handle highmem with architecture specific cache management
> operations. The easy approach would be to leave that up
> to the architecture, passing only a physical address to
> the flush function.

I suspect that is a good enough first step.  Especially as I remember
that some architectures have physical address based cache management
anyway (unless we removed them in the meantime).

> A nicer interface might be to move the
> loop over highmem pages out into common code, flush
> lowmem pages by virtual addresss, and have a separate
> callback for highmem pages that takes a page pointer,
> like

I'd rather avoid multiple callbacks if we can.  But maybe solve
the simple problem first and just pass the paddr and then
iterate from there.

> 
> struct dma_cache_ops {
>         void (*dma_cache_wback_inv)(void *start, unsigned long sz);
>         void (*dma_cache_inv)(void *start, unsigned long sz);
>         void (*dma_cache_wback)(void *start, unsigned long sz);
> #ifdef CONFIG_HIGHMEM
>         void (*dma_cache_wback_inv_high_page)(struct page *, size_t start, unsigned long sz);
>         void (*dma_cache_inv_high_page)(struct page *, size_t start, unsigned long sz);
>         void (*dma_cache_wback_high_page)(struct page *, size_t start, unsigned long sz);

Btw, I really don't think these should be indirect calls.  
For sane architectures there should be exactly one way to call them,
and the onces that have different implementations really should be
using alternatives instead of expensive indirect calls.
hch Jan. 22, 2023, 7:27 a.m. UTC | #15
On Sat, Jan 21, 2023 at 08:30:23PM +0100, Arnd Bergmann wrote:
> > That feels really odd, and might be worth a bug report to the
> > PPC maintainers.
> 
> Right, my first step would be to change all of the current
> outliers to use the same set of operations where possible.

Sounds good.

> > I'd rather avoid multiple callbacks if we can.  But maybe solve
> > the simple problem first and just pass the paddr and then
> > iterate from there.
> 
> Ok, fair enough. This means we can't easily put the kmap_atomic()
> into common code for highmem, though the per-page loop would
> still work.

Yes.  Given how messy many of the ops are I think one step at a time
is always good.

> I was thinking of using STATIC_CALL() as an optimization here, which
> I find easier to read and understand than alternatives. One advantage
> here is that this allows the actual cache operations to be declared
> locally in the architecture without letting drivers call them,
> but still update the common code to work without indirect branches.
> 
> The main downside is that this is currently only optimized on
> powerpc and x86, both of which don't actually need CPU specific
> callbacks. ARC, ARM, and MIPS on the other hand already
> have indirect function pointers, RISC-V would likely benefit the
> most from either alternatives or static_call, as it already
> uses alternatives and has one implementation that is clearly
> preferred over the others.

For now I'd just keep doing direct calls into the arch code, just
for the lower level invalidate, writeback, invalidate+writeback
calls as that helps cementinc the logic of which of those to use
in well documented core code.

And I'm not really sure I'd like to go beyond that - making it too
easy pluggable will make people feel more comfortable doing stupid
things here.  And yes, maybe that's personal because I've warned
the RISC-V people years ago that they'll need architectural
cache management instructions yesterday and the answer was that
no one is going to use them on modern CPUs.  *sigh*
hch Jan. 23, 2023, 2:46 p.m. UTC | #16
On Sun, Jan 22, 2023 at 12:04:35PM +0100, Arnd Bergmann wrote:
> > And I'm not really sure I'd like to go beyond that - making it too
> > easy pluggable will make people feel more comfortable doing stupid
> > things here. 
> 
> I fear the bigger risk is still making the functions callable
> from device driver code than it is to make the functions
> globally settable.
> 
> You introduced the mips version in f8c55dc6e828 ("MIPS: use generic
> dma noncoherent ops for simple noncoherent platforms"), which
> was clearly meant as an implementation detail, yet we already
> have a driver that slipped in with 3bdffa8ffb45 ("Input: Add
> N64 controller driver") that just calls this directly rather
> than using the dma-mapping interface.

MIPS actually has a bit of a history of these odd bypasses that
it seems like this driver copied.  rmk has been very worried by
this bypassing, and in general I agree.  But there's only so
much we can do except for auditing drivers.  Especially as none of
these helpers is exported and built-in only drivers are quite rare.

> > And yes, maybe that's personal because I've warned
> > the RISC-V people years ago that they'll need architectural
> > cache management instructions yesterday and the answer was that
> > no one is going to use them on modern CPUs.  *sigh*
> 
> To be fair, from the ISA point of view, it really shouldn't
> be necessary as long as you have a sane SoC design.
> In practice there are always chips that are cutting corners,
> or use the new CPU core as a drop-in for an existing
> design. Arm SBSA tried to enforce the same thing and also
> failed for pretty much the same reason.

Not wiring up IP blocks for cache coherency is cheap.  So it totally
makes sense for dirt cheap SOCs, or even for low performance periphals
in general.  The GPU folks also believe they can make some things
faster by deliberately turning coherency off on SOCs that support
coherency.

In addition to that cache maintainance is absolutely needed for NVDIMM
support.