mbox series

[ath-next,v4,0/9] wifi: ath12k: fixes for rmmod and recovery issues with hardware grouping

Message ID 20250408-fix_reboot_issues_with_hw_grouping-v4-0-95e7bf048595@oss.qualcomm.com
Headers show
Series wifi: ath12k: fixes for rmmod and recovery issues with hardware grouping | expand

Message

Aditya Kumar Singh April 8, 2025, 6:06 a.m. UTC
With hardware grouping, there is a kernel crash with signature -

$ rmmod ath12k.ko
Unable to handle kernel paging request at virtual address 000000000000d1a8
[...]
Call trace:
 ath12k_reg_free+0x14/0x74 [ath12k] (P)
 ath12k_core_hw_group_destroy+0x7c/0xb4 [ath12k] (L)
 ath12k_core_hw_group_destroy+0x7c/0xb4 [ath12k]
 ath12k_core_deinit+0xd8/0x124 [ath12k]
 ath12k_pci_remove+0x6c/0x130 [ath12k]
 pci_device_remove+0x44/0xe8
 device_remove+0x4c/0x80
 device_release_driver_internal+0x1d0/0x22c
 driver_detach+0x50/0x98
 bus_remove_driver+0x70/0xf4
 driver_unregister+0x30/0x60
 pci_unregister_driver+0x24/0x9c
 ath12k_pci_exit+0x18/0x24 [ath12k]
 __arm64_sys_delete_module+0x1a0/0x2a8
 invoke_syscall+0x48/0x110
 el0_svc_common.constprop.0+0x40/0xe0
 do_el0_svc+0x1c/0x28
 el0_svc+0x30/0xd0
 el0t_64_sync_handler+0x10c/0x138
 el0t_64_sync+0x198/0x19c
Code: a9bd7bfd 910003fd a9025bf5 91402015 (f968d6a1)
---[ end trace 0000000000000000 ]---
Segmentation fault

This series aims to fix this stability issue. With this now, 100+ iteration
of rmmod and insmod works perfectly.

Also, firmware recovery with grouping is not working fine. Randomly, some
NULL pointer crash or another firmware assert is seen. This series aims to
fix that as well.

With this in place now, 100+ iteration of firmware recovery with one 3 link
AP MLD up works fine.

---
Changes in v4:
- Rebased on ToT.
- Fixed potential deadlock warning.
- Moved to oss email from quicinc.
- Link to v3: https://lore.kernel.org/r/20250124-fix_reboot_issues_with_hw_grouping-v3-0-329030b18d9e@quicinc.com

Changes in v3:
- Rebased on ToT due to FTM changes conflict.
- Link to v2: https://lore.kernel.org/r/20250120-fix_reboot_issues_with_hw_grouping-v2-0-b7d073bb2a22@quicinc.com

Changes in v2:
- Rebased on ToT.
- No changes in 1-4, 6-10.
- Removed regd_freed flag in 5.
- Link to v1: https://lore.kernel.org/r/20250109-fix_reboot_issues_with_hw_grouping-v1-0-fb39ec03451e@quicinc.com

---
Aditya Kumar Singh (9):
      wifi: ath12k: fix SLUB BUG - Object already free in ath12k_reg_free()
      wifi: ath12k: add reference counting for core attachment to hardware group
      wifi: ath12k: fix failed to set mhi state error during reboot with hardware grouping
      wifi: ath12k: fix ATH12K_FLAG_REGISTERED flag handling
      wifi: ath12k: fix firmware assert during reboot with hardware grouping
      wifi: ath12k: fix ath12k_core_pre_reconfigure_recovery() with grouping
      wifi: ath12k: handle ath12k_core_restart() with hardware grouping
      wifi: ath12k: handle ath12k_core_reset() with hardware grouping
      wifi: ath12k: reset MLO global memory during recovery

 drivers/net/wireless/ath/ath12k/core.c | 110 ++++++++++++++++++++++++++++++---
 drivers/net/wireless/ath/ath12k/core.h |  15 +----
 drivers/net/wireless/ath/ath12k/mac.c  |   6 --
 drivers/net/wireless/ath/ath12k/pci.c  |  26 +++++++-
 drivers/net/wireless/ath/ath12k/qmi.c  |  22 +++++++
 drivers/net/wireless/ath/ath12k/qmi.h  |   2 +
 drivers/net/wireless/ath/ath12k/reg.c  |   4 ++
 7 files changed, 156 insertions(+), 29 deletions(-)
---
base-commit: ac17b1211841c98a9b4c2900ba2a7f457c80cf90
change-id: 20241218-fix_reboot_issues_with_hw_grouping-0c2d367a587b

Comments

Vasanthakumar Thiagarajan April 8, 2025, 9:50 a.m. UTC | #1
On 4/8/2025 11:36 AM, Aditya Kumar Singh wrote:
> During rmmod of ath12k module with SLUB debug enabled, following print is
> seen -
> 
> =============================================================================
> BUG kmalloc-1k (Not tainted): Object already free
> -----------------------------------------------------------------------------
> 
> Allocated in ath12k_reg_build_regd+0x94/0xa20 [ath12k] age=10470 cpu=0 pid=0
>   __kmalloc_noprof+0xf4/0x368
>   ath12k_reg_build_regd+0x94/0xa20 [ath12k]
>   ath12k_wmi_op_rx+0x199c/0x2c14 [ath12k]
>   ath12k_htc_rx_completion_handler+0x398/0x554 [ath12k]
>   ath12k_ce_per_engine_service+0x248/0x368 [ath12k]
>   ath12k_pci_ce_workqueue+0x28/0x50 [ath12k]
>   process_one_work+0x14c/0x28c
>   bh_worker+0x22c/0x27c
>   workqueue_softirq_action+0x80/0x90
>   tasklet_action+0x14/0x3c
>   handle_softirqs+0x108/0x240
>   __do_softirq+0x14/0x20
> Freed in ath12k_reg_free+0x40/0x74 [ath12k] age=136 cpu=2 pid=166
>   kfree+0x148/0x248
>   ath12k_reg_free+0x40/0x74 [ath12k]
>   ath12k_core_hw_group_destroy+0x68/0xac [ath12k]
>   ath12k_core_deinit+0xd8/0x124 [ath12k]
>   ath12k_pci_remove+0x6c/0x130 [ath12k]
>   pci_device_remove+0x44/0xe8
>   device_remove+0x4c/0x80
>   device_release_driver_internal+0x1d0/0x22c
>   driver_detach+0x50/0x98
>   bus_remove_driver+0x70/0xf4
>   driver_unregister+0x30/0x60
>   pci_unregister_driver+0x24/0x9c
>   ath12k_pci_exit+0x18/0x24 [ath12k]
>   __arm64_sys_delete_module+0x1a0/0x2a8
>   invoke_syscall+0x48/0x110
>   el0_svc_common.constprop.0+0x40/0xe0
> Slab 0xfffffdffc0033600 objects=10 used=6 fp=0xffff000000cdcc00 flags=0x3fffe0000000240(workingset|head|node=0|zone=0|lastcpupid=0x1ffff)
> Object 0xffff000000cdcc00 @offset=19456 fp=0xffff000000cde400
> [...]
> 
> This issue arises because in ath12k_core_hw_group_destroy(), each device
> calls ath12k_core_soc_destroy() for itself and all its partners within the
> same group. Since ath12k_core_hw_group_destroy() is invoked for each
> device, this results in a double free condition, eventually causing the
> SLUB bug.
> 
> To resolve this, set the freed pointers to NULL. And since there could be
> a race condition to read these pointers, guard these with the available
> mutex lock.
> 
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.3.1-00173-QCAHKSWPL_SILICONZ-1
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.4.1-00199-QCAHKSWPL_SILICONZ-1
> Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
> 
> Fixes: 6f245ea0ec6c ("wifi: ath12k: introduce device group abstraction")
> Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>

Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
Vasanthakumar Thiagarajan April 8, 2025, 9:52 a.m. UTC | #2
On 4/8/2025 11:36 AM, Aditya Kumar Singh wrote:
> With hardware grouping, during reboot, whenever a device is removed, it
> powers down itself and all its partner devices in the same group. Now this
> is done by all devices and hence there is multiple power down for devices
> and hence the following error messages can be seen:
> 
> ath12k_pci 0002:01:00.0: failed to set mhi state POWER_OFF(3) in current mhi state (0x0)
> ath12k_pci 0002:01:00.0: failed to set mhi state: POWER_OFF(3)
> ath12k_pci 0002:01:00.0: failed to set mhi state DEINIT(1) in current mhi state (0x0)
> ath12k_pci 0002:01:00.0: failed to set mhi state: DEINIT(1)
> ath12k_pci 0003:01:00.0: failed to set mhi state POWER_OFF(3) in current mhi state (0x0)
> ath12k_pci 0003:01:00.0: failed to set mhi state: POWER_OFF(3)
> ath12k_pci 0003:01:00.0: failed to set mhi state DEINIT(1) in current mhi state (0x0)
> ath12k_pci 0003:01:00.0: failed to set mhi state: DEINIT(1)
> ath12k_pci 0004:01:00.0: failed to set mhi state POWER_OFF(3) in current mhi state (0x0)
> ath12k_pci 0004:01:00.0: failed to set mhi state: POWER_OFF(3)
> 
> To prevent this, check if the ATH12K_PCI_FLAG_INIT_DONE flag is already
> set before powering down. If it is set, it indicates that another partner
> device has already performed the power down, and this device can skip this
> step.
> 
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.3.1-00173-QCAHKSWPL_SILICONZ-1
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.4.1-00199-QCAHKSWPL_SILICONZ-1
> Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
> 
> Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>

Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
Vasanthakumar Thiagarajan April 8, 2025, 9:53 a.m. UTC | #3
On 4/8/2025 11:36 AM, Aditya Kumar Singh wrote:
> At present, during PCI shutdown, the power down is only executed for a
> single device. However, when operating in a group, all devices need to be
> powered down simultaneously. Failure to do so will result in a firmware
> assertion.
> 
> Hence, introduce a new ath12k_pci_hw_group_power_down() and call it during
> power down. This will ensure that all partner devices are properly powered
> down.
> 
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.3.1-00173-QCAHKSWPL_SILICONZ-1
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.4.1-00199-QCAHKSWPL_SILICONZ-1
> Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
> 
> Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>

Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
Vasanthakumar Thiagarajan April 8, 2025, 9:56 a.m. UTC | #4
On 4/8/2025 11:36 AM, Aditya Kumar Singh wrote:
> Currently, when ath12k_core_restart() is called and the ab->is_reset flag
> is set, it invokes ieee80211_restart_hw() for all hardware in the same
> group. However, other hardware might still be in the recovery process,
> making this call inappropriate with grouping into picture.
> 
> To address this, add a condition to check if the group is ready. If the
> group is not ready, do not call ieee80211_restart_hw().
> 
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.3.1-00173-QCAHKSWPL_SILICONZ-1
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.4.1-00199-QCAHKSWPL_SILICONZ-1
> Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
> 
> Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>

Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
Vasanthakumar Thiagarajan April 8, 2025, 10 a.m. UTC | #5
On 4/8/2025 11:36 AM, Aditya Kumar Singh wrote:
> When operating with multiple devices grouped together, the firmware stores
> data related to the state machine of each partner device in the MLO global
> memory region. If the firmware crashes, it updates the state to 'crashed'.
> During recovery, this memory is shared with the firmware again, and upon
> detecting the 'crashed' state, it reasserts. This leads to a loop of
> firmware asserts and it never recovers.
> 
> Hence to fix this issue,  once all devices in the group have been asserted
> and powered down, reset the MLO global memory region.
> 
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.3.1-00173-QCAHKSWPL_SILICONZ-1
> Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.4.1-00199-QCAHKSWPL_SILICONZ-1
> Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3
> 
> Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com>

Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>