Message ID | 20210406210125.241-1-shiraz.saleem@intel.com |
---|---|
Headers | show |
Series | Add Intel Ethernet Protocol Driver for RDMA (irdma) | expand |
> Subject: Re: [PATCH v4 00/23] Add Intel Ethernet Protocol Driver for RDMA > (irdma) > > On Tue, Apr 06, 2021 at 04:01:02PM -0500, Shiraz Saleem wrote: > > Dave Ertman (4): > > iidc: Introduce iidc.h > > ice: Initialize RDMA support > > ice: Implement iidc operations > > ice: Register auxiliary device to provide RDMA > > > > Michael J. Ruhl (1): > > RDMA/irdma: Add dynamic tracing for CM > > > > Mustafa Ismail (13): > > RDMA/irdma: Register auxiliary driver and implement private channel > > OPs > > RDMA/irdma: Implement device initialization definitions > > RDMA/irdma: Implement HW Admin Queue OPs > > RDMA/irdma: Add HMC backing store setup functions > > RDMA/irdma: Add privileged UDA queue implementation > > RDMA/irdma: Add QoS definitions > > RDMA/irdma: Add connection manager > > RDMA/irdma: Add PBLE resource manager > > RDMA/irdma: Implement device supported verb APIs > > RDMA/irdma: Add RoCEv2 UD OP support > > RDMA/irdma: Add user/kernel shared libraries > > RDMA/irdma: Add miscellaneous utility definitions > > RDMA/irdma: Add ABI definitions > > > > Shiraz Saleem (5): > > ice: Add devlink params support > > i40e: Prep i40e header for aux bus conversion > > i40e: Register auxiliary devices to provide RDMA > > RDMA/irdma: Add irdma Kconfig/Makefile and remove i40iw > > RDMA/irdma: Update MAINTAINERS file > > This doesn't apply, and I don't really know why: > > Applying: iidc: Introduce iidc.h > Applying: ice: Initialize RDMA support > Applying: ice: Implement iidc operations > Applying: ice: Register auxiliary device to provide RDMA > Applying: ice: Add devlink params support > Applying: i40e: Prep i40e header for aux bus conversion > Applying: i40e: Register auxiliary devices to provide RDMA > Applying: RDMA/irdma: Register auxiliary driver and implement private channel > OPs > Applying: RDMA/irdma: Implement device initialization definitions > Applying: RDMA/irdma: Implement HW Admin Queue OPs > Applying: RDMA/irdma: Add HMC backing store setup functions > Applying: RDMA/irdma: Add privileged UDA queue implementation > Applying: RDMA/irdma: Add QoS definitions > Applying: RDMA/irdma: Add connection manager > Applying: RDMA/irdma: Add PBLE resource manager > Applying: RDMA/irdma: Implement device supported verb APIs > Applying: RDMA/irdma: Add RoCEv2 UD OP support > Applying: RDMA/irdma: Add user/kernel shared libraries > Applying: RDMA/irdma: Add miscellaneous utility definitions > Applying: RDMA/irdma: Add dynamic tracing for CM > Applying: RDMA/irdma: Add ABI definitions > Applying: RDMA/irdma: Add irdma Kconfig/Makefile and remove i40iw Using index > info to reconstruct a base tree... > error: removal patch leaves file contents > error: drivers/infiniband/hw/i40iw/Kconfig: patch does not apply > > Can you investigate and fix it? Perhaps using a 9 year old version of git is the > problem? > Hi Jason - I think its because I used --irreversible-delete flag in git format-patch for review that this one doesn't apply. I can resend without it if your trying to apply. Shiraz
> Subject: RE: [PATCH v4 00/23] Add Intel Ethernet Protocol Driver for RDMA > (irdma) > > > Subject: Re: [PATCH v4 00/23] Add Intel Ethernet Protocol Driver for > > RDMA > > (irdma) > > > > On Tue, Apr 06, 2021 at 04:01:02PM -0500, Shiraz Saleem wrote: > > > Dave Ertman (4): > > > iidc: Introduce iidc.h > > > ice: Initialize RDMA support > > > ice: Implement iidc operations > > > ice: Register auxiliary device to provide RDMA > > > > > > Michael J. Ruhl (1): > > > RDMA/irdma: Add dynamic tracing for CM > > > > > > Mustafa Ismail (13): > > > RDMA/irdma: Register auxiliary driver and implement private channel > > > OPs > > > RDMA/irdma: Implement device initialization definitions > > > RDMA/irdma: Implement HW Admin Queue OPs > > > RDMA/irdma: Add HMC backing store setup functions > > > RDMA/irdma: Add privileged UDA queue implementation > > > RDMA/irdma: Add QoS definitions > > > RDMA/irdma: Add connection manager > > > RDMA/irdma: Add PBLE resource manager > > > RDMA/irdma: Implement device supported verb APIs > > > RDMA/irdma: Add RoCEv2 UD OP support > > > RDMA/irdma: Add user/kernel shared libraries > > > RDMA/irdma: Add miscellaneous utility definitions > > > RDMA/irdma: Add ABI definitions > > > > > > Shiraz Saleem (5): > > > ice: Add devlink params support > > > i40e: Prep i40e header for aux bus conversion > > > i40e: Register auxiliary devices to provide RDMA > > > RDMA/irdma: Add irdma Kconfig/Makefile and remove i40iw > > > RDMA/irdma: Update MAINTAINERS file > > > > This doesn't apply, and I don't really know why: > > > > Applying: iidc: Introduce iidc.h > > Applying: ice: Initialize RDMA support > > Applying: ice: Implement iidc operations > > Applying: ice: Register auxiliary device to provide RDMA > > Applying: ice: Add devlink params support > > Applying: i40e: Prep i40e header for aux bus conversion > > Applying: i40e: Register auxiliary devices to provide RDMA > > Applying: RDMA/irdma: Register auxiliary driver and implement private > > channel OPs > > Applying: RDMA/irdma: Implement device initialization definitions > > Applying: RDMA/irdma: Implement HW Admin Queue OPs > > Applying: RDMA/irdma: Add HMC backing store setup functions > > Applying: RDMA/irdma: Add privileged UDA queue implementation > > Applying: RDMA/irdma: Add QoS definitions > > Applying: RDMA/irdma: Add connection manager > > Applying: RDMA/irdma: Add PBLE resource manager > > Applying: RDMA/irdma: Implement device supported verb APIs > > Applying: RDMA/irdma: Add RoCEv2 UD OP support > > Applying: RDMA/irdma: Add user/kernel shared libraries > > Applying: RDMA/irdma: Add miscellaneous utility definitions > > Applying: RDMA/irdma: Add dynamic tracing for CM > > Applying: RDMA/irdma: Add ABI definitions > > Applying: RDMA/irdma: Add irdma Kconfig/Makefile and remove i40iw > > Using index info to reconstruct a base tree... > > error: removal patch leaves file contents > > error: drivers/infiniband/hw/i40iw/Kconfig: patch does not apply > > > > Can you investigate and fix it? Perhaps using a 9 year old version of > > git is the problem? > > > > Hi Jason - I think its because I used --irreversible-delete flag in git format-patch for > review that this one doesn't apply. > > I can resend without it if your trying to apply. > I resend it without using --irreversible-delete. Hopefully, it applies cleanly now.
On Tue, Apr 06, 2021 at 11:30:23PM +0000, Saleem, Shiraz wrote: > Hi Jason - I think its because I used --irreversible-delete flag in git format-patch for review that this one doesn't apply. I doubt it > I can resend without it if your trying to apply. Now it is too big to go to the mailing lists Jason
On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote: > Add a new generic runtime devlink parameter 'rdma_protocol' > and use it in ice PCI driver. Configuration changes > result in unplugging the auxiliary RDMA device and re-plugging > it with updated values for irdma auxiiary driver to consume at > drv.probe() > > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> > .../networking/devlink/devlink-params.rst | 6 ++ > Documentation/networking/devlink/ice.rst | 13 +++ > drivers/net/ethernet/intel/ice/ice_devlink.c | 92 +++++++++++++++++++++- > drivers/net/ethernet/intel/ice/ice_devlink.h | 5 ++ > drivers/net/ethernet/intel/ice/ice_main.c | 2 + > include/net/devlink.h | 4 + > net/core/devlink.c | 5 ++ > 7 files changed, 125 insertions(+), 2 deletions(-) > > diff --git a/Documentation/networking/devlink/devlink-params.rst b/Documentation/networking/devlink/devlink-params.rst > index 54c9f10..0b454c3 100644 > +++ b/Documentation/networking/devlink/devlink-params.rst > @@ -114,3 +114,9 @@ own name. > will NACK any attempt of other host to reset the device. This parameter > is useful for setups where a device is shared by different hosts, such > as multi-host setup. > + * - ``rdma_protocol`` > + - string > + - Selects the RDMA protocol selected for multi-protocol devices. > + - ``iwarp`` iWARP > + - ``roce`` RoCE > + - ``ib`` Infiniband I'm still not sure this belongs in devlink. What about devices that support roce and iwarp concurrently? There is nothing at the protocol level that precludes this - doesn't this device allow it? I know Parav is looking at the general problem of how to customize what aux devices are created, that may be a better fit for this. Can you remove the devlink parts to make progress? Jason
> Subject: Re: [PATCH v4 00/23] Add Intel Ethernet Protocol Driver for RDMA > (irdma) > > On Tue, Apr 06, 2021 at 11:30:23PM +0000, Saleem, Shiraz wrote: > > > Hi Jason - I think its because I used --irreversible-delete flag in git format-patch > for review that this one doesn't apply. > > I doubt it The documentation hints at it. https://git-scm.com/docs/git-format-patch -D --irreversible-delete Omit the preimage for deletes, i.e. print only the header but not the diff between the preimage and /dev/null. The resulting patch is not meant to be applied with patch or git apply; this is solely for people who want to just concentrate on reviewing the text after the change. In addition, the output obviously lacks enough information to apply such a patch in reverse, even manually, hence the name of the option. > > > I can resend without it if your trying to apply. > > Now it is too big to go to the mailing lists > It is showing now. https://patchwork.kernel.org/project/linux-rdma/patch/20210407001502.1890-23-shiraz.saleem@intel.com/
On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote: > +/* Following APIs are implemented by core PCI driver */ > +struct iidc_core_ops { > + /* APIs to allocate resources such as VEB, VSI, Doorbell queues, > + * completion queues, Tx/Rx queues, etc... > + */ > + int (*alloc_res)(struct iidc_core_dev_info *cdev_info, > + struct iidc_res *res, > + int partial_acceptable); > + int (*free_res)(struct iidc_core_dev_info *cdev_info, > + struct iidc_res *res); > + > + int (*request_reset)(struct iidc_core_dev_info *cdev_info, > + enum iidc_reset_type reset_type); > + > + int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info, > + u16 vport_id, bool enable); > + int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8 *msg, > + u16 len); > +}; What is this? There is only one implementation: static const struct iidc_core_ops ops = { .alloc_res = ice_cdev_info_alloc_res, .free_res = ice_cdev_info_free_res, .request_reset = ice_cdev_info_request_reset, .update_vport_filter = ice_cdev_info_update_vsi_filter, .vc_send = ice_cdev_info_vc_send, }; So export and call the functions directly. We just had this very same discussion with Broadcom. I notice there is no module dependency between irdma and the ethernet driver because the above ops are avoiding it. This entire idea was already NAK'd once: https://lore.kernel.org/linux-rdma/20180522203831.20624-1-jeffrey.t.kirsher@intel.com/ So please remove it. Jason
On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote: > +struct iidc_res_base { > + /* Union for future provision e.g. other res_type */ > + union { > + struct iidc_rdma_qset_params qsets; > + } res; Use an anonymous union? There is alot of confusiong provisioning for future types, do you have concrete plans here? I'm a bit confused why this is so different from how mlx5 ended up when it already has multiple types. > +}; > + > +struct iidc_res { > + /* Type of resource. */ > + enum iidc_res_type res_type; > + /* Count requested */ > + u16 cnt_req; > + > + /* Number of resources allocated. Filled in by callee. > + * Based on this value, caller to fill up "resources" > + */ > + u16 res_allocated; > + > + /* Unique handle to resources allocated. Zero if call fails. > + * Allocated by callee and for now used by caller for internal > + * tracking purpose. > + */ > + u32 res_handle; > + > + /* Peer driver has to allocate sufficient memory, to accommodate > + * cnt_requested before calling this function. Calling what function? > + * Memory has to be zero initialized. It is input/output param. > + * As a result of alloc_res API, this structures will be populated. > + */ > + struct iidc_res_base res[1]; So it is a wrongly defined flex array? Confused The usages are all using this as some super-complicated function argument: struct iidc_res rdma_qset_res = {}; rdma_qset_res.res_allocated = 1; rdma_qset_res.res_type = IIDC_RDMA_QSETS_TXSCHED; rdma_qset_res.res[0].res.qsets.vport_id = vsi->vsi_idx; rdma_qset_res.res[0].res.qsets.teid = tc_node->l2_sched_node_id; rdma_qset_res.res[0].res.qsets.qs_handle = tc_node->qs_handle; if (cdev_info->ops->free_res(cdev_info, &rdma_qset_res)) So the answer here is to make your function calls sane and well architected. If you have to pass a union to call a function then something is very wrong with the design. You aren't trying to achieve ABI decoupling of the rdma/ethernet modules with an obfuscated complicated function pointer indirection, are you? Please use sane function calls > +/* Following APIs are implemented by auxiliary drivers and invoked by core PCI > + * driver > + */ > +struct iidc_auxiliary_ops { > + /* This event_handler is meant to be a blocking call. For instance, > + * when a BEFORE_MTU_CHANGE event comes in, the event_handler will not > + * return until the auxiliary driver is ready for the MTU change to > + * happen. > + */ > + void (*event_handler)(struct iidc_core_dev_info *cdev_info, > + struct iidc_event *event); > + > + int (*vc_receive)(struct iidc_core_dev_info *cdev_info, u32 vf_id, > + u8 *msg, u16 len); > +}; This is not the normal pattern: > +struct iidc_auxiliary_drv { > + struct auxiliary_driver adrv; > + struct iidc_auxiliary_ops *ops; > +}; Just put the two functions above in the drv directly: struct iidc_auxiliary_drv { struct auxilary_driver adrv; void (*event_handler)(struct iidc_core_dev_info *cdev_info, *cdev_info, struct iidc_event *event); int (*vc_receive)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8 *msg, u16 len); } > + > +#define IIDC_RDMA_NAME "intel_rdma" > +#define IIDC_RDMA_ID 0x00000010 > +#define IIDC_MAX_NUM_AUX 4 > + > +/* The const struct that instantiates cdev_info_id needs to be initialized > + * in the .c with the macro ASSIGN_IIDC_INFO. > + * For example: > + * static const struct cdev_info_id cdev_info_ids[] = ASSIGN_IIDC_INFO; > + */ > +struct cdev_info_id { > + char *name; > + int id; > +}; > + > +#define IIDC_RDMA_INFO { .name = IIDC_RDMA_NAME, .id = IIDC_RDMA_ID }, > + > +#define ASSIGN_IIDC_INFO \ > +{ \ > + IIDC_RDMA_INFO \ > +} I tried to figure out what all this was for and came up short. There is only one user and all this seems unnecessary in this series, add it later when you need it. > + > +#define iidc_priv(x) ((x)->auxiliary_priv) Use a static inline function Jason
> Subject: Re: [PATCH v4 05/23] ice: Add devlink params support > > On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote: > > Add a new generic runtime devlink parameter 'rdma_protocol' > > and use it in ice PCI driver. Configuration changes result in > > unplugging the auxiliary RDMA device and re-plugging it with updated > > values for irdma auxiiary driver to consume at > > drv.probe() > > > > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> > > .../networking/devlink/devlink-params.rst | 6 ++ > > Documentation/networking/devlink/ice.rst | 13 +++ > > drivers/net/ethernet/intel/ice/ice_devlink.c | 92 +++++++++++++++++++++- > > drivers/net/ethernet/intel/ice/ice_devlink.h | 5 ++ > > drivers/net/ethernet/intel/ice/ice_main.c | 2 + > > include/net/devlink.h | 4 + > > net/core/devlink.c | 5 ++ > > 7 files changed, 125 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/networking/devlink/devlink-params.rst > > b/Documentation/networking/devlink/devlink-params.rst > > index 54c9f10..0b454c3 100644 > > +++ b/Documentation/networking/devlink/devlink-params.rst > > @@ -114,3 +114,9 @@ own name. > > will NACK any attempt of other host to reset the device. This parameter > > is useful for setups where a device is shared by different hosts, such > > as multi-host setup. > > + * - ``rdma_protocol`` > > + - string > > + - Selects the RDMA protocol selected for multi-protocol devices. > > + - ``iwarp`` iWARP > > + - ``roce`` RoCE > > + - ``ib`` Infiniband > > I'm still not sure this belongs in devlink. I believe you suggested we use devlink for protocol switch. > > What about devices that support roce and iwarp concurrently? > > There is nothing at the protocol level that precludes this - doesn't this device allow > it? Nope. This device doesn’t support both protocols concurrently on same PCI function. Maybe then it makes sense to move this protocol switch as driver specific devlink? > > I know Parav is looking at the general problem of how to customize what aux > devices are created, that may be a better fit for this. > > Can you remove the devlink parts to make progress? > It is important since otherwise the customer will have no way to use RoCEv2 on this device. Shiraz
> Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h > > On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote: > > > +/* Following APIs are implemented by core PCI driver */ struct > > +iidc_core_ops { > > + /* APIs to allocate resources such as VEB, VSI, Doorbell queues, > > + * completion queues, Tx/Rx queues, etc... > > + */ > > + int (*alloc_res)(struct iidc_core_dev_info *cdev_info, > > + struct iidc_res *res, > > + int partial_acceptable); > > + int (*free_res)(struct iidc_core_dev_info *cdev_info, > > + struct iidc_res *res); > > + > > + int (*request_reset)(struct iidc_core_dev_info *cdev_info, > > + enum iidc_reset_type reset_type); > > + > > + int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info, > > + u16 vport_id, bool enable); > > + int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8 *msg, > > + u16 len); > > +}; > > What is this? There is only one implementation: > > static const struct iidc_core_ops ops = { > .alloc_res = ice_cdev_info_alloc_res, > .free_res = ice_cdev_info_free_res, > .request_reset = ice_cdev_info_request_reset, > .update_vport_filter = ice_cdev_info_update_vsi_filter, > .vc_send = ice_cdev_info_vc_send, > }; > > So export and call the functions directly. No. Then we end up requiring ice to be loaded even when just want to use irdma with x722 [whose ethernet driver is "i40e"]. This will not allow us to be a unified RDMA driver. > > We just had this very same discussion with Broadcom. Yes, but they only have a single ethernet driver today I presume. Here we have a single RDMA driver and 2 ethernet drivers, i40e and ice. Shiraz
On Wed, Apr 07, 2021 at 08:58:49PM +0000, Saleem, Shiraz wrote: > > Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h > > > > On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote: > > > > > +/* Following APIs are implemented by core PCI driver */ struct > > > +iidc_core_ops { > > > + /* APIs to allocate resources such as VEB, VSI, Doorbell queues, > > > + * completion queues, Tx/Rx queues, etc... > > > + */ > > > + int (*alloc_res)(struct iidc_core_dev_info *cdev_info, > > > + struct iidc_res *res, > > > + int partial_acceptable); > > > + int (*free_res)(struct iidc_core_dev_info *cdev_info, > > > + struct iidc_res *res); > > > + > > > + int (*request_reset)(struct iidc_core_dev_info *cdev_info, > > > + enum iidc_reset_type reset_type); > > > + > > > + int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info, > > > + u16 vport_id, bool enable); > > > + int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8 *msg, > > > + u16 len); > > > +}; > > > > What is this? There is only one implementation: > > > > static const struct iidc_core_ops ops = { > > .alloc_res = ice_cdev_info_alloc_res, > > .free_res = ice_cdev_info_free_res, > > .request_reset = ice_cdev_info_request_reset, > > .update_vport_filter = ice_cdev_info_update_vsi_filter, > > .vc_send = ice_cdev_info_vc_send, > > }; > > > > So export and call the functions directly. > > No. Then we end up requiring ice to be loaded even when just want to > use irdma with x722 [whose ethernet driver is "i40e"]. So what? What does it matter to load a few extra kb of modules? If you had both drivers use the same ops interface you'd have an argument. Jason
On Wed, Apr 07, 2021 at 08:58:25PM +0000, Saleem, Shiraz wrote: > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support > > > > On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote: > > > Add a new generic runtime devlink parameter 'rdma_protocol' > > > and use it in ice PCI driver. Configuration changes result in > > > unplugging the auxiliary RDMA device and re-plugging it with updated > > > values for irdma auxiiary driver to consume at > > > drv.probe() > > > > > > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> > > > .../networking/devlink/devlink-params.rst | 6 ++ > > > Documentation/networking/devlink/ice.rst | 13 +++ > > > drivers/net/ethernet/intel/ice/ice_devlink.c | 92 +++++++++++++++++++++- > > > drivers/net/ethernet/intel/ice/ice_devlink.h | 5 ++ > > > drivers/net/ethernet/intel/ice/ice_main.c | 2 + > > > include/net/devlink.h | 4 + > > > net/core/devlink.c | 5 ++ > > > 7 files changed, 125 insertions(+), 2 deletions(-) > > > > > > diff --git a/Documentation/networking/devlink/devlink-params.rst > > > b/Documentation/networking/devlink/devlink-params.rst > > > index 54c9f10..0b454c3 100644 > > > +++ b/Documentation/networking/devlink/devlink-params.rst > > > @@ -114,3 +114,9 @@ own name. > > > will NACK any attempt of other host to reset the device. This parameter > > > is useful for setups where a device is shared by different hosts, such > > > as multi-host setup. > > > + * - ``rdma_protocol`` > > > + - string > > > + - Selects the RDMA protocol selected for multi-protocol devices. > > > + - ``iwarp`` iWARP > > > + - ``roce`` RoCE > > > + - ``ib`` Infiniband > > > > I'm still not sure this belongs in devlink. > > I believe you suggested we use devlink for protocol switch. Yes, devlink is the right place, but selecting a *single* protocol doesn't seem right, or general enough. Parav is talking about generic ways to customize the aux devices created and that would seem to serve the same function as this. > > I know Parav is looking at the general problem of how to customize what aux > > devices are created, that may be a better fit for this. > > > > Can you remove the devlink parts to make progress? > > It is important since otherwise the customer will have no way to use RoCEv2 on this device. I'm not saying to not having it eventually, I'm just getting tired of looking at 23 patches. You can argue it out after I'm also half thinking of applying this under driver/staging or CONFIG_BROKEN or something just because I am getting sick of looking at it. Jason
On Wed, Apr 07, 2021 at 07:43:24PM -0300, Jason Gunthorpe wrote: > On Wed, Apr 07, 2021 at 08:58:49PM +0000, Saleem, Shiraz wrote: > > > Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h > > > > > > On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote: > > > > > > > +/* Following APIs are implemented by core PCI driver */ struct > > > > +iidc_core_ops { > > > > + /* APIs to allocate resources such as VEB, VSI, Doorbell queues, > > > > + * completion queues, Tx/Rx queues, etc... > > > > + */ > > > > + int (*alloc_res)(struct iidc_core_dev_info *cdev_info, > > > > + struct iidc_res *res, > > > > + int partial_acceptable); > > > > + int (*free_res)(struct iidc_core_dev_info *cdev_info, > > > > + struct iidc_res *res); > > > > + > > > > + int (*request_reset)(struct iidc_core_dev_info *cdev_info, > > > > + enum iidc_reset_type reset_type); > > > > + > > > > + int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info, > > > > + u16 vport_id, bool enable); > > > > + int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8 *msg, > > > > + u16 len); > > > > +}; > > > > > > What is this? There is only one implementation: > > > > > > static const struct iidc_core_ops ops = { > > > .alloc_res = ice_cdev_info_alloc_res, > > > .free_res = ice_cdev_info_free_res, > > > .request_reset = ice_cdev_info_request_reset, > > > .update_vport_filter = ice_cdev_info_update_vsi_filter, > > > .vc_send = ice_cdev_info_vc_send, > > > }; > > > > > > So export and call the functions directly. > > > > No. Then we end up requiring ice to be loaded even when just want to > > use irdma with x722 [whose ethernet driver is "i40e"]. > > So what? What does it matter to load a few extra kb of modules? And if user cares about it, he will blacklist that module anyway. Thanks
> Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h > > On Wed, Apr 07, 2021 at 07:43:24PM -0300, Jason Gunthorpe wrote: > > On Wed, Apr 07, 2021 at 08:58:49PM +0000, Saleem, Shiraz wrote: > > > > Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h > > > > > > > > On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote: > > > > > > > > > +/* Following APIs are implemented by core PCI driver */ struct > > > > > +iidc_core_ops { > > > > > + /* APIs to allocate resources such as VEB, VSI, Doorbell queues, > > > > > + * completion queues, Tx/Rx queues, etc... > > > > > + */ > > > > > + int (*alloc_res)(struct iidc_core_dev_info *cdev_info, > > > > > + struct iidc_res *res, > > > > > + int partial_acceptable); > > > > > + int (*free_res)(struct iidc_core_dev_info *cdev_info, > > > > > + struct iidc_res *res); > > > > > + > > > > > + int (*request_reset)(struct iidc_core_dev_info *cdev_info, > > > > > + enum iidc_reset_type reset_type); > > > > > + > > > > > + int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info, > > > > > + u16 vport_id, bool enable); > > > > > + int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8 > *msg, > > > > > + u16 len); > > > > > +}; > > > > > > > > What is this? There is only one implementation: > > > > > > > > static const struct iidc_core_ops ops = { > > > > .alloc_res = ice_cdev_info_alloc_res, > > > > .free_res = ice_cdev_info_free_res, > > > > .request_reset = ice_cdev_info_request_reset, > > > > .update_vport_filter = ice_cdev_info_update_vsi_filter, > > > > .vc_send = ice_cdev_info_vc_send, > > > > }; > > > > > > > > So export and call the functions directly. > > > > > > No. Then we end up requiring ice to be loaded even when just want to > > > use irdma with x722 [whose ethernet driver is "i40e"]. > > > > So what? What does it matter to load a few extra kb of modules? > > And if user cares about it, he will blacklist that module anyway. > blacklist ice when you just have an x722 card? How does that solve anything? You wont be able to load irdma then. Shiraz
On Fri, Apr 09, 2021 at 01:38:37AM +0000, Saleem, Shiraz wrote: > > Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h > > > > On Wed, Apr 07, 2021 at 07:43:24PM -0300, Jason Gunthorpe wrote: > > > On Wed, Apr 07, 2021 at 08:58:49PM +0000, Saleem, Shiraz wrote: > > > > > Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h > > > > > > > > > > On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote: > > > > > > > > > > > +/* Following APIs are implemented by core PCI driver */ struct > > > > > > +iidc_core_ops { > > > > > > + /* APIs to allocate resources such as VEB, VSI, Doorbell queues, > > > > > > + * completion queues, Tx/Rx queues, etc... > > > > > > + */ > > > > > > + int (*alloc_res)(struct iidc_core_dev_info *cdev_info, > > > > > > + struct iidc_res *res, > > > > > > + int partial_acceptable); > > > > > > + int (*free_res)(struct iidc_core_dev_info *cdev_info, > > > > > > + struct iidc_res *res); > > > > > > + > > > > > > + int (*request_reset)(struct iidc_core_dev_info *cdev_info, > > > > > > + enum iidc_reset_type reset_type); > > > > > > + > > > > > > + int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info, > > > > > > + u16 vport_id, bool enable); > > > > > > + int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8 > > *msg, > > > > > > + u16 len); > > > > > > +}; > > > > > > > > > > What is this? There is only one implementation: > > > > > > > > > > static const struct iidc_core_ops ops = { > > > > > .alloc_res = ice_cdev_info_alloc_res, > > > > > .free_res = ice_cdev_info_free_res, > > > > > .request_reset = ice_cdev_info_request_reset, > > > > > .update_vport_filter = ice_cdev_info_update_vsi_filter, > > > > > .vc_send = ice_cdev_info_vc_send, > > > > > }; > > > > > > > > > > So export and call the functions directly. > > > > > > > > No. Then we end up requiring ice to be loaded even when just want to > > > > use irdma with x722 [whose ethernet driver is "i40e"]. > > > > > > So what? What does it matter to load a few extra kb of modules? > > > > And if user cares about it, he will blacklist that module anyway. > > > blacklist ice when you just have an x722 card? How does that solve anything? You wont be able to load irdma then. You will blacklist i40e if you want solely irdma functionality. Thanks > > Shiraz
> Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h > > On Wed, Apr 07, 2021 at 08:58:49PM +0000, Saleem, Shiraz wrote: > > > Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h > > > > > > On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote: > > > > > > > +/* Following APIs are implemented by core PCI driver */ struct > > > > +iidc_core_ops { > > > > + /* APIs to allocate resources such as VEB, VSI, Doorbell queues, > > > > + * completion queues, Tx/Rx queues, etc... > > > > + */ > > > > + int (*alloc_res)(struct iidc_core_dev_info *cdev_info, > > > > + struct iidc_res *res, > > > > + int partial_acceptable); > > > > + int (*free_res)(struct iidc_core_dev_info *cdev_info, > > > > + struct iidc_res *res); > > > > + > > > > + int (*request_reset)(struct iidc_core_dev_info *cdev_info, > > > > + enum iidc_reset_type reset_type); > > > > + > > > > + int (*update_vport_filter)(struct iidc_core_dev_info *cdev_info, > > > > + u16 vport_id, bool enable); > > > > + int (*vc_send)(struct iidc_core_dev_info *cdev_info, u32 vf_id, u8 *msg, > > > > + u16 len); > > > > +}; > > > > > > What is this? There is only one implementation: > > > > > > static const struct iidc_core_ops ops = { > > > .alloc_res = ice_cdev_info_alloc_res, > > > .free_res = ice_cdev_info_free_res, > > > .request_reset = ice_cdev_info_request_reset, > > > .update_vport_filter = ice_cdev_info_update_vsi_filter, > > > .vc_send = ice_cdev_info_vc_send, > > > }; > > > > > > So export and call the functions directly. > > > > No. Then we end up requiring ice to be loaded even when just want to > > use irdma with x722 [whose ethernet driver is "i40e"]. > > So what? What does it matter to load a few extra kb of modules? Because it is an unnecessary thing to force a user to build/load drivers for which they don't have the HW for? The problem gets compounded if we have to do it for all future HW Intel PCI drivers, i.e. depends on ICE && .... IIDC is Intel's converged and generic RDMA <--> PCI driver channel interface; which we intend to use moving forward. And these .ops callbacks are part of this interface which will have different implementations by each HW generation PCI core driver. It is extensible with new ops added to the table for new HW and where implementations of the certain ops on some HW will be NULL. There is a near-term Intel ethernet VF driver which will use IIDC to provide RDMA in the VF, and implement some of these .ops callbacks. There is also intent to move i40e to IIDC. And yes, it allows to load a unified irdma driver without having all the mulit-gen PCI core drivers to be built/loaded as a pre-requisite which is solving a pain-point to the user and does not unnecessarily add a memory foot-print. In the past, with i40e <-> i40iw, I acknowledge such a dependency was decoupled for the wrong reasons [1] and understand where your frustration is coming from. But in a unified irdma driver model connecting to multiple PCI gen drivers, I do think it serves a reason. This has also been voiced over the years in some of our discussions [2] leading to the auxiliary bus and been part of our submissions from the get go. In fact, use of such domain specific .ops from the parent device is an assumption baked into the design when the auxiliary bus was conceived and in the documentation [3] (See Example Usage). [1] https://lore.kernel.org/linux-rdma/20180522205612.GD7502@mellanox.com/ [2] https://lore.kernel.org/linux-rdma/2B0E3F215D1AB84DA946C8BEE234CCC97B2E1D29@ORSMSX101.amr.corp.intel.com/ [3] https://www.kernel.org/doc/html/latest/driver-api/auxiliary_bus.html Shiraz
> Subject: Re: [PATCH v4 05/23] ice: Add devlink params support > > On Wed, Apr 07, 2021 at 08:58:25PM +0000, Saleem, Shiraz wrote: > > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support > > > > > > On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote: > > > > Add a new generic runtime devlink parameter 'rdma_protocol' > > > > and use it in ice PCI driver. Configuration changes result in > > > > unplugging the auxiliary RDMA device and re-plugging it with > > > > updated values for irdma auxiiary driver to consume at > > > > drv.probe() > > > > > > > > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> > > > > .../networking/devlink/devlink-params.rst | 6 ++ > > > > Documentation/networking/devlink/ice.rst | 13 +++ > > > > drivers/net/ethernet/intel/ice/ice_devlink.c | 92 > +++++++++++++++++++++- > > > > drivers/net/ethernet/intel/ice/ice_devlink.h | 5 ++ > > > > drivers/net/ethernet/intel/ice/ice_main.c | 2 + > > > > include/net/devlink.h | 4 + > > > > net/core/devlink.c | 5 ++ > > > > 7 files changed, 125 insertions(+), 2 deletions(-) > > > > > > > > diff --git a/Documentation/networking/devlink/devlink-params.rst > > > > b/Documentation/networking/devlink/devlink-params.rst > > > > index 54c9f10..0b454c3 100644 > > > > +++ b/Documentation/networking/devlink/devlink-params.rst > > > > @@ -114,3 +114,9 @@ own name. > > > > will NACK any attempt of other host to reset the device. This parameter > > > > is useful for setups where a device is shared by different hosts, such > > > > as multi-host setup. > > > > + * - ``rdma_protocol`` > > > > + - string > > > > + - Selects the RDMA protocol selected for multi-protocol devices. > > > > + - ``iwarp`` iWARP > > > > + - ``roce`` RoCE > > > > + - ``ib`` Infiniband > > > > > > I'm still not sure this belongs in devlink. > > > > I believe you suggested we use devlink for protocol switch. > > Yes, devlink is the right place, but selecting a *single* protocol doesn't seem right, > or general enough. > > Parav is talking about generic ways to customize the aux devices created and that > would seem to serve the same function as this. Is there an RFC or something posted for us to look at? > > > > I know Parav is looking at the general problem of how to customize > > > what aux devices are created, that may be a better fit for this. > > > > > > Can you remove the devlink parts to make progress? > > > > It is important since otherwise the customer will have no way to use RoCEv2 on > this device. > > I'm not saying to not having it eventually, I'm just getting tired of looking at 23 > patches. You can argue it out after > Since this device cannot do concurrent protocols per function, is having a driver specific devlink to switch between the 2 also a NACK? There is something similar in another PCI driver. https://www.kernel.org/doc/html/latest/networking/devlink/qed.html Can we not adapt to Parav's solution (if it's a good fit) when it becomes available? Shiraz
> Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h > > On Tue, Apr 06, 2021 at 04:01:03PM -0500, Shiraz Saleem wrote: > > > +struct iidc_res_base { > > + /* Union for future provision e.g. other res_type */ > > + union { > > + struct iidc_rdma_qset_params qsets; > > + } res; > > Use an anonymous union? > > There is alot of confusiong provisioning for future types, do you have concrete > plans here? I'm a bit confused why this is so different from how mlx5 ended up > when it already has multiple types. It was initially designed to be extensible for more resource types. But at this point, there is no concrete plan and hence it doesn't need to be a union. > > > +}; > > + > > +struct iidc_res { > > + /* Type of resource. */ > > + enum iidc_res_type res_type; > > + /* Count requested */ > > + u16 cnt_req; > > + > > + /* Number of resources allocated. Filled in by callee. > > + * Based on this value, caller to fill up "resources" > > + */ > > + u16 res_allocated; > > + > > + /* Unique handle to resources allocated. Zero if call fails. > > + * Allocated by callee and for now used by caller for internal > > + * tracking purpose. > > + */ > > + u32 res_handle; > > + > > + /* Peer driver has to allocate sufficient memory, to accommodate > > + * cnt_requested before calling this function. > > Calling what function? Left over cruft from the re-write of IIDC in v2. > > > + * Memory has to be zero initialized. It is input/output param. > > + * As a result of alloc_res API, this structures will be populated. > > + */ > > + struct iidc_res_base res[1]; > > So it is a wrongly defined flex array? Confused Needs fixing. > > The usages are all using this as some super-complicated function argument: > > struct iidc_res rdma_qset_res = {}; > > rdma_qset_res.res_allocated = 1; > rdma_qset_res.res_type = IIDC_RDMA_QSETS_TXSCHED; > rdma_qset_res.res[0].res.qsets.vport_id = vsi->vsi_idx; > rdma_qset_res.res[0].res.qsets.teid = tc_node->l2_sched_node_id; > rdma_qset_res.res[0].res.qsets.qs_handle = tc_node->qs_handle; > > if (cdev_info->ops->free_res(cdev_info, &rdma_qset_res)) > > So the answer here is to make your function calls sane and well architected. If you > have to pass a union to call a function then something is very wrong with the > design. > Based on previous comment, the union will be removed. > You aren't trying to achieve ABI decoupling of the rdma/ethernet modules with an > obfuscated complicated function pointer indirection, are you? As discussed in other thread, this is part of the IIDC interface exporting the core device .ops callbacks. > > > +/* Following APIs are implemented by auxiliary drivers and invoked by > > +core PCI > > + * driver > > + */ > > +struct iidc_auxiliary_ops { > > + /* This event_handler is meant to be a blocking call. For instance, > > + * when a BEFORE_MTU_CHANGE event comes in, the event_handler will > not > > + * return until the auxiliary driver is ready for the MTU change to > > + * happen. > > + */ > > + void (*event_handler)(struct iidc_core_dev_info *cdev_info, > > + struct iidc_event *event); > > + > > + int (*vc_receive)(struct iidc_core_dev_info *cdev_info, u32 vf_id, > > + u8 *msg, u16 len); > > +}; > > This is not the normal pattern: > > > +struct iidc_auxiliary_drv { > > + struct auxiliary_driver adrv; > > + struct iidc_auxiliary_ops *ops; > > +}; > > Just put the two functions above in the drv directly: Ok. > > struct iidc_auxiliary_drv { > struct auxilary_driver adrv; > void (*event_handler)(struct iidc_core_dev_info *cdev_info, *cdev_info, > struct iidc_event *event); > > int (*vc_receive)(struct iidc_core_dev_info *cdev_info, u32 vf_id, > u8 *msg, u16 len); > } > > > + > > +#define IIDC_RDMA_NAME "intel_rdma" > > +#define IIDC_RDMA_ID 0x00000010 > > +#define IIDC_MAX_NUM_AUX 4 > > + > > +/* The const struct that instantiates cdev_info_id needs to be > > +initialized > > + * in the .c with the macro ASSIGN_IIDC_INFO. > > + * For example: > > + * static const struct cdev_info_id cdev_info_ids[] = > > +ASSIGN_IIDC_INFO; */ struct cdev_info_id { > > + char *name; > > + int id; > > +}; > > + > > +#define IIDC_RDMA_INFO { .name = IIDC_RDMA_NAME, .id = > IIDC_RDMA_ID }, > > + > > +#define ASSIGN_IIDC_INFO \ > > +{ \ > > + IIDC_RDMA_INFO \ > > +} > > I tried to figure out what all this was for and came up short. There is only one user > and all this seems unnecessary in this series, add it later when you need it. No plan for new user, so this should go. > > > + > > +#define iidc_priv(x) ((x)->auxiliary_priv) > > Use a static inline function > Ok
On Mon, Apr 12, 2021 at 02:50:43PM +0000, Saleem, Shiraz wrote: > Because it is an unnecessary thing to force a user to build/load drivers for > which they don't have the HW for? Happens automatically in all distros, so I don't agree with this. > The problem gets compounded if we have to do it for all future HW > Intel PCI drivers, i.e. depends on ICE && .... Then someday build a proper pluggable abstraction and put all your ethernet drivers under it. Today you haven't done that and all we have a set of ops that only work with one eth driver and a totally different set of functions that only work with a different driver. It is all just dead code until it gets finished and process is to not merge dead code. > There is a near-term Intel ethernet VF driver which will use IIDC to > provide RDMA in the VF, and implement some of these .ops > callbacks. There is also intent to move i40e to IIDC. "near-term" We are now on year three of Intel talking about this driver! Get the bulk of the thing merged and deal with the rest in followup patches. > But in a unified irdma driver model connecting to multiple PCI gen > drivers, I do think it serves a reason. It is fine as a high level idea, but the implementation has to meet kernel standards. Jason
> From: Saleem, Shiraz <shiraz.saleem@intel.com> > Sent: Monday, April 12, 2021 8:21 PM > > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support > > > > On Wed, Apr 07, 2021 at 08:58:25PM +0000, Saleem, Shiraz wrote: > > > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support > > > > > > > > On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote: > > > > > Add a new generic runtime devlink parameter 'rdma_protocol' > > > > > and use it in ice PCI driver. Configuration changes result in > > > > > unplugging the auxiliary RDMA device and re-plugging it with > > > > > updated values for irdma auxiiary driver to consume at > > > > > drv.probe() > > > > > > > > > > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> > > > > > .../networking/devlink/devlink-params.rst | 6 ++ > > > > > Documentation/networking/devlink/ice.rst | 13 +++ > > > > > drivers/net/ethernet/intel/ice/ice_devlink.c | 92 > > +++++++++++++++++++++- > > > > > drivers/net/ethernet/intel/ice/ice_devlink.h | 5 ++ > > > > > drivers/net/ethernet/intel/ice/ice_main.c | 2 + > > > > > include/net/devlink.h | 4 + > > > > > net/core/devlink.c | 5 ++ > > > > > 7 files changed, 125 insertions(+), 2 deletions(-) > > > > > > > > > > diff --git a/Documentation/networking/devlink/devlink-params.rst > > > > > b/Documentation/networking/devlink/devlink-params.rst > > > > > index 54c9f10..0b454c3 100644 > > > > > +++ b/Documentation/networking/devlink/devlink-params.rst > > > > > @@ -114,3 +114,9 @@ own name. > > > > > will NACK any attempt of other host to reset the device. This > parameter > > > > > is useful for setups where a device is shared by different hosts, > such > > > > > as multi-host setup. > > > > > + * - ``rdma_protocol`` > > > > > + - string > > > > > + - Selects the RDMA protocol selected for multi-protocol devices. > > > > > + - ``iwarp`` iWARP > > > > > + - ``roce`` RoCE > > > > > + - ``ib`` Infiniband > > > > > > > > I'm still not sure this belongs in devlink. > > > > > > I believe you suggested we use devlink for protocol switch. > > > > Yes, devlink is the right place, but selecting a *single* protocol > > doesn't seem right, or general enough. > > > > Parav is talking about generic ways to customize the aux devices > > created and that would seem to serve the same function as this. > > Is there an RFC or something posted for us to look at? I do not have polished RFC content ready yet. But coping the full config sequence snippet from the internal draft (changed for ice example) here as I like to discuss with you in this context. # (1) show auxiliary device types supported by a given devlink device. # applies to pci pf,vf,sf. (in general at devlink instance). $ devlink dev auxdev show pci/0000:06.00.0 pci/0000:06.00.0: current: roce eth new: supported: roce eth iwarp # (2) enable iwarp and ethernet type of aux devices and disable roce. $ devlink dev auxdev set pci/0000:06:00.0 roce off iwarp on # (3) now see which aux devices will be enable on next reload. $ devlink dev auxdev show pci/0000:06:00.0 pci/0000:06:00.0: current: roce eth new: eth iwarp supported: roce eth iwarp # (4) now reload the device and see which aux devices are created. At this point driver undergoes reconfig for removal of roce and adding iwarp. $ devlink reload pci/0000:06:00.0 # (5) verify which are the aux devices now activated. $ devlink dev auxdev show pci/0000:06:00.0 pci/0000:06:00.0: current: roce eth new: supported: roce eth iwarp Above 'new' section is only shown when its set. (similar to devlink resource). In command two vendor driver can fail the call when iwarp and roce both enabled by user. mlx5_core doesn't have iwarp, but its vdpa type of device. And for other cases we just want to enable roce disabling eth and vdpa.
> From: Parav Pandit <parav@nvidia.com> > Sent: Tuesday, April 13, 2021 12:38 AM > > > From: Saleem, Shiraz <shiraz.saleem@intel.com> > > Sent: Monday, April 12, 2021 8:21 PM > > > > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support > > > > > > On Wed, Apr 07, 2021 at 08:58:25PM +0000, Saleem, Shiraz wrote: > > > > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support > > > > > > > > > > On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote: > > > > > > Add a new generic runtime devlink parameter 'rdma_protocol' > > > > > > and use it in ice PCI driver. Configuration changes result in > > > > > > unplugging the auxiliary RDMA device and re-plugging it with > > > > > > updated values for irdma auxiiary driver to consume at > > > > > > drv.probe() > > > > > > > > > > > > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> > > > > > > .../networking/devlink/devlink-params.rst | 6 ++ > > > > > > Documentation/networking/devlink/ice.rst | 13 +++ > > > > > > drivers/net/ethernet/intel/ice/ice_devlink.c | 92 > > > +++++++++++++++++++++- > > > > > > drivers/net/ethernet/intel/ice/ice_devlink.h | 5 ++ > > > > > > drivers/net/ethernet/intel/ice/ice_main.c | 2 + > > > > > > include/net/devlink.h | 4 + > > > > > > net/core/devlink.c | 5 ++ > > > > > > 7 files changed, 125 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git > > > > > > a/Documentation/networking/devlink/devlink-params.rst > > > > > > b/Documentation/networking/devlink/devlink-params.rst > > > > > > index 54c9f10..0b454c3 100644 > > > > > > +++ b/Documentation/networking/devlink/devlink-params.rst > > > > > > @@ -114,3 +114,9 @@ own name. > > > > > > will NACK any attempt of other host to reset the > > > > > > device. This > > parameter > > > > > > is useful for setups where a device is shared by > > > > > > different hosts, > > such > > > > > > as multi-host setup. > > > > > > + * - ``rdma_protocol`` > > > > > > + - string > > > > > > + - Selects the RDMA protocol selected for multi-protocol devices. > > > > > > + - ``iwarp`` iWARP > > > > > > + - ``roce`` RoCE > > > > > > + - ``ib`` Infiniband > > > > > > > > > > I'm still not sure this belongs in devlink. > > > > > > > > I believe you suggested we use devlink for protocol switch. > > > > > > Yes, devlink is the right place, but selecting a *single* protocol > > > doesn't seem right, or general enough. > > > > > > Parav is talking about generic ways to customize the aux devices > > > created and that would seem to serve the same function as this. > > > > Is there an RFC or something posted for us to look at? > I do not have polished RFC content ready yet. > But coping the full config sequence snippet from the internal draft (changed > for ice example) here as I like to discuss with you in this context. > > # (1) show auxiliary device types supported by a given devlink device. > # applies to pci pf,vf,sf. (in general at devlink instance). > $ devlink dev auxdev show pci/0000:06.00.0 > pci/0000:06.00.0: > current: > roce eth > new: > supported: > roce eth iwarp > > # (2) enable iwarp and ethernet type of aux devices and disable roce. > $ devlink dev auxdev set pci/0000:06:00.0 roce off iwarp on > > # (3) now see which aux devices will be enable on next reload. > $ devlink dev auxdev show pci/0000:06:00.0 > pci/0000:06:00.0: > current: > roce eth > new: > eth iwarp > supported: > roce eth iwarp > > # (4) now reload the device and see which aux devices are created. > At this point driver undergoes reconfig for removal of roce and adding iwarp. > $ devlink reload pci/0000:06:00.0 > > # (5) verify which are the aux devices now activated. > $ devlink dev auxdev show pci/0000:06:00.0 > pci/0000:06:00.0: > current: > roce eth This was copy paste error from my command #1. It should be "roce eth" should be "iwarp eth". This is because in step #3, iwarp was enabled. > new: > supported: > roce eth iwarp > > Above 'new' section is only shown when its set. (similar to devlink resource). > In command two vendor driver can fail the call when iwarp and roce both > enabled by user. > mlx5_core doesn't have iwarp, but its vdpa type of device. And for other > cases we just want to enable roce disabling eth and vdpa.
> Subject: RE: [PATCH v4 05/23] ice: Add devlink params support > > > > > From: Saleem, Shiraz <shiraz.saleem@intel.com> > > Sent: Monday, April 12, 2021 8:21 PM > > > > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support > > > > > > On Wed, Apr 07, 2021 at 08:58:25PM +0000, Saleem, Shiraz wrote: > > > > > Subject: Re: [PATCH v4 05/23] ice: Add devlink params support > > > > > > > > > > On Tue, Apr 06, 2021 at 04:01:07PM -0500, Shiraz Saleem wrote: > > > > > > Add a new generic runtime devlink parameter 'rdma_protocol' > > > > > > and use it in ice PCI driver. Configuration changes result in > > > > > > unplugging the auxiliary RDMA device and re-plugging it with > > > > > > updated values for irdma auxiiary driver to consume at > > > > > > drv.probe() > > > > > > > > > > > > Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> > > > > > > .../networking/devlink/devlink-params.rst | 6 ++ > > > > > > Documentation/networking/devlink/ice.rst | 13 +++ > > > > > > drivers/net/ethernet/intel/ice/ice_devlink.c | 92 > > > +++++++++++++++++++++- > > > > > > drivers/net/ethernet/intel/ice/ice_devlink.h | 5 ++ > > > > > > drivers/net/ethernet/intel/ice/ice_main.c | 2 + > > > > > > include/net/devlink.h | 4 + > > > > > > net/core/devlink.c | 5 ++ > > > > > > 7 files changed, 125 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git > > > > > > a/Documentation/networking/devlink/devlink-params.rst > > > > > > b/Documentation/networking/devlink/devlink-params.rst > > > > > > index 54c9f10..0b454c3 100644 > > > > > > +++ b/Documentation/networking/devlink/devlink-params.rst > > > > > > @@ -114,3 +114,9 @@ own name. > > > > > > will NACK any attempt of other host to reset the > > > > > > device. This > > parameter > > > > > > is useful for setups where a device is shared by > > > > > > different hosts, > > such > > > > > > as multi-host setup. > > > > > > + * - ``rdma_protocol`` > > > > > > + - string > > > > > > + - Selects the RDMA protocol selected for multi-protocol devices. > > > > > > + - ``iwarp`` iWARP > > > > > > + - ``roce`` RoCE > > > > > > + - ``ib`` Infiniband > > > > > > > > > > I'm still not sure this belongs in devlink. > > > > > > > > I believe you suggested we use devlink for protocol switch. > > > > > > Yes, devlink is the right place, but selecting a *single* protocol > > > doesn't seem right, or general enough. > > > > > > Parav is talking about generic ways to customize the aux devices > > > created and that would seem to serve the same function as this. > > > > Is there an RFC or something posted for us to look at? > I do not have polished RFC content ready yet. > But coping the full config sequence snippet from the internal draft (changed for ice > example) here as I like to discuss with you in this context. Thanks Parav! Some comments below. > > # (1) show auxiliary device types supported by a given devlink device. > # applies to pci pf,vf,sf. (in general at devlink instance). > $ devlink dev auxdev show pci/0000:06.00.0 > pci/0000:06.00.0: > current: > roce eth > new: > supported: > roce eth iwarp > > # (2) enable iwarp and ethernet type of aux devices and disable roce. > $ devlink dev auxdev set pci/0000:06:00.0 roce off iwarp on > > # (3) now see which aux devices will be enable on next reload. > $ devlink dev auxdev show pci/0000:06:00.0 > pci/0000:06:00.0: > current: > roce eth > new: > eth iwarp > supported: > roce eth iwarp > > # (4) now reload the device and see which aux devices are created. > At this point driver undergoes reconfig for removal of roce and adding iwarp. > $ devlink reload pci/0000:06:00.0 I see this is modeled like devlink resource. Do we really to need a PCI driver re-init to switch the type of the auxdev hanging off the PCI dev? Why not just allow the setting to apply dynamically during a 'set' itself with an unplug/plug of the auxdev with correct type. Shiraz
> From: Saleem, Shiraz <shiraz.saleem@intel.com> > Sent: Tuesday, April 13, 2021 8:11 PM [..] > > > > Parav is talking about generic ways to customize the aux devices > > > > created and that would seem to serve the same function as this. > > > > > > Is there an RFC or something posted for us to look at? > > I do not have polished RFC content ready yet. > > But coping the full config sequence snippet from the internal draft > > (changed for ice > > example) here as I like to discuss with you in this context. > > Thanks Parav! Some comments below. > > > > > # (1) show auxiliary device types supported by a given devlink device. > > # applies to pci pf,vf,sf. (in general at devlink instance). > > $ devlink dev auxdev show pci/0000:06.00.0 > > pci/0000:06.00.0: > > current: > > roce eth > > new: > > supported: > > roce eth iwarp > > > > # (2) enable iwarp and ethernet type of aux devices and disable roce. > > $ devlink dev auxdev set pci/0000:06:00.0 roce off iwarp on > > > > # (3) now see which aux devices will be enable on next reload. > > $ devlink dev auxdev show pci/0000:06:00.0 > > pci/0000:06:00.0: > > current: > > roce eth > > new: > > eth iwarp > > supported: > > roce eth iwarp > > > > # (4) now reload the device and see which aux devices are created. > > At this point driver undergoes reconfig for removal of roce and adding > iwarp. > > $ devlink reload pci/0000:06:00.0 > > I see this is modeled like devlink resource. > > Do we really to need a PCI driver re-init to switch the type of the auxdev > hanging off the PCI dev? > I don't see a need to re-init the whole PCI driver. Since only aux device config is changed only that piece to get reloaded. > Why not just allow the setting to apply dynamically during a 'set' itself with an > unplug/plug of the auxdev with correct type. > This suggestion came up in the internal discussion too. However such task needs to synchronize with devlink reload command and also with driver remove() sequence. So locking wise and depending on amount of config change, it is close to what reload will do. For example other resource config or other params setting also to take effect. So to avoid defining multiple config sequence, doing as part of already existing devlink reload, it brings simple sequence to user. For example, 1. enable/disable desired aux devices 2. configure device resources 3. set some device params 4. do devlink reload and apply settings done in #1 to #3
> Subject: RE: [PATCH v4 05/23] ice: Add devlink params support > > > > > From: Saleem, Shiraz <shiraz.saleem@intel.com> > > Sent: Tuesday, April 13, 2021 8:11 PM > [..] > > > > > > Parav is talking about generic ways to customize the aux devices > > > > > created and that would seem to serve the same function as this. > > > > > > > > Is there an RFC or something posted for us to look at? > > > I do not have polished RFC content ready yet. > > > But coping the full config sequence snippet from the internal draft > > > (changed for ice > > > example) here as I like to discuss with you in this context. > > > > Thanks Parav! Some comments below. > > > > > > > > # (1) show auxiliary device types supported by a given devlink device. > > > # applies to pci pf,vf,sf. (in general at devlink instance). > > > $ devlink dev auxdev show pci/0000:06.00.0 > > > pci/0000:06.00.0: > > > current: > > > roce eth > > > new: > > > supported: > > > roce eth iwarp > > > > > > # (2) enable iwarp and ethernet type of aux devices and disable roce. > > > $ devlink dev auxdev set pci/0000:06:00.0 roce off iwarp on > > > > > > # (3) now see which aux devices will be enable on next reload. > > > $ devlink dev auxdev show pci/0000:06:00.0 > > > pci/0000:06:00.0: > > > current: > > > roce eth > > > new: > > > eth iwarp > > > supported: > > > roce eth iwarp > > > > > > # (4) now reload the device and see which aux devices are created. > > > At this point driver undergoes reconfig for removal of roce and > > > adding > > iwarp. > > > $ devlink reload pci/0000:06:00.0 > > > > I see this is modeled like devlink resource. > > > > Do we really to need a PCI driver re-init to switch the type of the > > auxdev hanging off the PCI dev? > > > I don't see a need to re-init the whole PCI driver. Since only aux device config is > changed only that piece to get reloaded. But that is what mlx5 and other implementations does on reload no? i.e. a PCI driver reinit. I can see an ice implementation of reload morphing to similar over time to support a new config that requires a true reinit of PCI driver entities. > > > Why not just allow the setting to apply dynamically during a 'set' > > itself with an unplug/plug of the auxdev with correct type. > > > This suggestion came up in the internal discussion too. > However such task needs to synchronize with devlink reload command and also > with driver remove() sequence. > So locking wise and depending on amount of config change, it is close to what > reload will do. Holding this mutex across the auxiliary device unplug/plug in "set" wont cut it? https://elixir.bootlin.com/linux/v5.12-rc7/source/drivers/net/ethernet/mellanox/mlx5/core/main.c#L1304 > For example other resource config or other params setting also to take effect. > So to avoid defining multiple config sequence, doing as part of already existing > devlink reload, it brings simple sequence to user. > > For example, > 1. enable/disable desired aux devices > 2. configure device resources > 3. set some device params > 4. do devlink reload and apply settings done in #1 to #3 Sure. But a user might also just want to operate on just an auxiliary device configuration change. As in #1. And he ends up having everything hanging off the PF to get blown out, including potentially the VFs. That feels like too big a hammer. Shiraz
+ Jiri. > From: Saleem, Shiraz <shiraz.saleem@intel.com> > Sent: Wednesday, April 14, 2021 5:51 AM > > > Subject: RE: [PATCH v4 05/23] ice: Add devlink params support > > > > > > > > > From: Saleem, Shiraz <shiraz.saleem@intel.com> > > > Sent: Tuesday, April 13, 2021 8:11 PM > > [..] > > > > > > > > Parav is talking about generic ways to customize the aux > > > > > > devices created and that would seem to serve the same function as > this. > > > > > > > > > > Is there an RFC or something posted for us to look at? > > > > I do not have polished RFC content ready yet. > > > > But coping the full config sequence snippet from the internal > > > > draft (changed for ice > > > > example) here as I like to discuss with you in this context. > > > > > > Thanks Parav! Some comments below. > > > > > > > > > > > # (1) show auxiliary device types supported by a given devlink device. > > > > # applies to pci pf,vf,sf. (in general at devlink instance). > > > > $ devlink dev auxdev show pci/0000:06.00.0 > > > > pci/0000:06.00.0: > > > > current: > > > > roce eth > > > > new: > > > > supported: > > > > roce eth iwarp > > > > > > > > # (2) enable iwarp and ethernet type of aux devices and disable roce. > > > > $ devlink dev auxdev set pci/0000:06:00.0 roce off iwarp on > > > > > > > > # (3) now see which aux devices will be enable on next reload. > > > > $ devlink dev auxdev show pci/0000:06:00.0 > > > > pci/0000:06:00.0: > > > > current: > > > > roce eth > > > > new: > > > > eth iwarp > > > > supported: > > > > roce eth iwarp > > > > > > > > # (4) now reload the device and see which aux devices are created. > > > > At this point driver undergoes reconfig for removal of roce and > > > > adding > > > iwarp. > > > > $ devlink reload pci/0000:06:00.0 > > > > > > I see this is modeled like devlink resource. > > > > > > Do we really to need a PCI driver re-init to switch the type of the > > > auxdev hanging off the PCI dev? > > > > > I don't see a need to re-init the whole PCI driver. Since only aux > > device config is changed only that piece to get reloaded. > > But that is what mlx5 and other implementations does on reload no? i.e. a > PCI driver reinit. Currently yes, reload does PCI re-init. However I am not seeing the value of reload if no config (param, resource, auxdev) is changed. > I can see an ice implementation of reload morphing to similar over time to > support a new config that requires a true reinit of PCI driver entities. > Sure. > > > > > Why not just allow the setting to apply dynamically during a 'set' > > > itself with an unplug/plug of the auxdev with correct type. > > > > > This suggestion came up in the internal discussion too. > > However such task needs to synchronize with devlink reload command and > > also with driver remove() sequence. > > So locking wise and depending on amount of config change, it is close > > to what reload will do. > > Holding this mutex across the auxiliary device unplug/plug in "set" wont cut > it? > https://elixir.bootlin.com/linux/v5.12- > rc7/source/drivers/net/ethernet/mellanox/mlx5/core/main.c#L1304 > Currently devlink reload for mlx5 is source of lockdep assert, use after free access and a deadlock in net ns. :-( Multiple of us (Leon, Saeed, Moshe) working on it resolve it. So I want to stay away from intf_mutex for now. > > For example other resource config or other params setting also to take > effect. > > So to avoid defining multiple config sequence, doing as part of > > already existing devlink reload, it brings simple sequence to user. > > > > For example, > > 1. enable/disable desired aux devices > > 2. configure device resources > > 3. set some device params > > 4. do devlink reload and apply settings done in #1 to #3 > > Sure. But a user might also just want to operate on just an auxiliary device > configuration change. As in #1. > And he ends up having everything hanging off the PF to get blown out, > including potentially the VFs. That feels like too big a hammer. This is certainly not desired. If we want aux device enable/disable to take effect when its done without reload than above flow should be redefined as, 1. configure device resources (optional) 2. set some device params (optional) 3. enable/disable desired aux devices Step-3 needs to apply the settings of (1) and (2) without user doing devlink reload. devlink core doesn't know on step #3, that reload_down() and reload_up() to be done. So driver internally needs to implement reload_down(), up() on callback of #3. This builds parallel framework to devlink reload. Jiri, What do you think of it?
> Subject: Re: [PATCH v4 01/23] iidc: Introduce iidc.h > > On Mon, Apr 12, 2021 at 02:50:43PM +0000, Saleem, Shiraz wrote: > [....] > > There is a near-term Intel ethernet VF driver which will use IIDC to > > provide RDMA in the VF, and implement some of these .ops callbacks. > > There is also intent to move i40e to IIDC. > > "near-term" We are now on year three of Intel talking about this driver! > > Get the bulk of the thing merged and deal with the rest in followup patches. We will submit with symbols exported from ice and direct calls from irdma. Shiraz
On Wed, Apr 14, 2021 at 12:21:08AM +0000, Saleem, Shiraz wrote: > > Subject: RE: [PATCH v4 05/23] ice: Add devlink params support <...> > > > Why not just allow the setting to apply dynamically during a 'set' > > > itself with an unplug/plug of the auxdev with correct type. > > > > > This suggestion came up in the internal discussion too. > > However such task needs to synchronize with devlink reload command and also > > with driver remove() sequence. > > So locking wise and depending on amount of config change, it is close to what > > reload will do. > > Holding this mutex across the auxiliary device unplug/plug in "set" wont cut it? > https://elixir.bootlin.com/linux/v5.12-rc7/source/drivers/net/ethernet/mellanox/mlx5/core/main.c#L1304 Like Parav said, we are working to fix it and already have one working solution, unfortunately it has one eyebrow raising change and we are trying another one. You can take a look here to get sense of the scope: https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma.git/log/?h=devlink-core Thanks