Message ID | 20250408-fix_reboot_issues_with_hw_grouping-v4-0-95e7bf048595@oss.qualcomm.com |
---|---|
Headers | show |
Series | wifi: ath12k: fixes for rmmod and recovery issues with hardware grouping | expand |
On 4/8/2025 11:36 AM, Aditya Kumar Singh wrote: > During rmmod of ath12k module with SLUB debug enabled, following print is > seen - > > ============================================================================= > BUG kmalloc-1k (Not tainted): Object already free > ----------------------------------------------------------------------------- > > Allocated in ath12k_reg_build_regd+0x94/0xa20 [ath12k] age=10470 cpu=0 pid=0 > __kmalloc_noprof+0xf4/0x368 > ath12k_reg_build_regd+0x94/0xa20 [ath12k] > ath12k_wmi_op_rx+0x199c/0x2c14 [ath12k] > ath12k_htc_rx_completion_handler+0x398/0x554 [ath12k] > ath12k_ce_per_engine_service+0x248/0x368 [ath12k] > ath12k_pci_ce_workqueue+0x28/0x50 [ath12k] > process_one_work+0x14c/0x28c > bh_worker+0x22c/0x27c > workqueue_softirq_action+0x80/0x90 > tasklet_action+0x14/0x3c > handle_softirqs+0x108/0x240 > __do_softirq+0x14/0x20 > Freed in ath12k_reg_free+0x40/0x74 [ath12k] age=136 cpu=2 pid=166 > kfree+0x148/0x248 > ath12k_reg_free+0x40/0x74 [ath12k] > ath12k_core_hw_group_destroy+0x68/0xac [ath12k] > ath12k_core_deinit+0xd8/0x124 [ath12k] > ath12k_pci_remove+0x6c/0x130 [ath12k] > pci_device_remove+0x44/0xe8 > device_remove+0x4c/0x80 > device_release_driver_internal+0x1d0/0x22c > driver_detach+0x50/0x98 > bus_remove_driver+0x70/0xf4 > driver_unregister+0x30/0x60 > pci_unregister_driver+0x24/0x9c > ath12k_pci_exit+0x18/0x24 [ath12k] > __arm64_sys_delete_module+0x1a0/0x2a8 > invoke_syscall+0x48/0x110 > el0_svc_common.constprop.0+0x40/0xe0 > Slab 0xfffffdffc0033600 objects=10 used=6 fp=0xffff000000cdcc00 flags=0x3fffe0000000240(workingset|head|node=0|zone=0|lastcpupid=0x1ffff) > Object 0xffff000000cdcc00 @offset=19456 fp=0xffff000000cde400 > [...] > > This issue arises because in ath12k_core_hw_group_destroy(), each device > calls ath12k_core_soc_destroy() for itself and all its partners within the > same group. Since ath12k_core_hw_group_destroy() is invoked for each > device, this results in a double free condition, eventually causing the > SLUB bug. > > To resolve this, set the freed pointers to NULL. And since there could be > a race condition to read these pointers, guard these with the available > mutex lock. > > Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.3.1-00173-QCAHKSWPL_SILICONZ-1 > Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.4.1-00199-QCAHKSWPL_SILICONZ-1 > Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3 > > Fixes: 6f245ea0ec6c ("wifi: ath12k: introduce device group abstraction") > Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com> Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
On 4/8/2025 11:36 AM, Aditya Kumar Singh wrote: > With hardware grouping, during reboot, whenever a device is removed, it > powers down itself and all its partner devices in the same group. Now this > is done by all devices and hence there is multiple power down for devices > and hence the following error messages can be seen: > > ath12k_pci 0002:01:00.0: failed to set mhi state POWER_OFF(3) in current mhi state (0x0) > ath12k_pci 0002:01:00.0: failed to set mhi state: POWER_OFF(3) > ath12k_pci 0002:01:00.0: failed to set mhi state DEINIT(1) in current mhi state (0x0) > ath12k_pci 0002:01:00.0: failed to set mhi state: DEINIT(1) > ath12k_pci 0003:01:00.0: failed to set mhi state POWER_OFF(3) in current mhi state (0x0) > ath12k_pci 0003:01:00.0: failed to set mhi state: POWER_OFF(3) > ath12k_pci 0003:01:00.0: failed to set mhi state DEINIT(1) in current mhi state (0x0) > ath12k_pci 0003:01:00.0: failed to set mhi state: DEINIT(1) > ath12k_pci 0004:01:00.0: failed to set mhi state POWER_OFF(3) in current mhi state (0x0) > ath12k_pci 0004:01:00.0: failed to set mhi state: POWER_OFF(3) > > To prevent this, check if the ATH12K_PCI_FLAG_INIT_DONE flag is already > set before powering down. If it is set, it indicates that another partner > device has already performed the power down, and this device can skip this > step. > > Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.3.1-00173-QCAHKSWPL_SILICONZ-1 > Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.4.1-00199-QCAHKSWPL_SILICONZ-1 > Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3 > > Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com> Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
On 4/8/2025 11:36 AM, Aditya Kumar Singh wrote: > At present, during PCI shutdown, the power down is only executed for a > single device. However, when operating in a group, all devices need to be > powered down simultaneously. Failure to do so will result in a firmware > assertion. > > Hence, introduce a new ath12k_pci_hw_group_power_down() and call it during > power down. This will ensure that all partner devices are properly powered > down. > > Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.3.1-00173-QCAHKSWPL_SILICONZ-1 > Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.4.1-00199-QCAHKSWPL_SILICONZ-1 > Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3 > > Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com> Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
On 4/8/2025 11:36 AM, Aditya Kumar Singh wrote: > Currently, when ath12k_core_restart() is called and the ab->is_reset flag > is set, it invokes ieee80211_restart_hw() for all hardware in the same > group. However, other hardware might still be in the recovery process, > making this call inappropriate with grouping into picture. > > To address this, add a condition to check if the group is ready. If the > group is not ready, do not call ieee80211_restart_hw(). > > Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.3.1-00173-QCAHKSWPL_SILICONZ-1 > Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.4.1-00199-QCAHKSWPL_SILICONZ-1 > Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3 > > Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com> Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>
On 4/8/2025 11:36 AM, Aditya Kumar Singh wrote: > When operating with multiple devices grouped together, the firmware stores > data related to the state machine of each partner device in the MLO global > memory region. If the firmware crashes, it updates the state to 'crashed'. > During recovery, this memory is shared with the firmware again, and upon > detecting the 'crashed' state, it reasserts. This leads to a loop of > firmware asserts and it never recovers. > > Hence to fix this issue, once all devices in the group have been asserted > and powered down, reset the MLO global memory region. > > Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.3.1-00173-QCAHKSWPL_SILICONZ-1 > Tested-on: QCN9274 hw2.0 PCI WLAN.WBE.1.4.1-00199-QCAHKSWPL_SILICONZ-1 > Tested-on: WCN7850 hw2.0 PCI WLAN.HMT.1.0.c5-00481-QCAHMTSWPL_V1.0_V2.0_SILICONZ-3 > > Signed-off-by: Aditya Kumar Singh <aditya.kumar.singh@oss.qualcomm.com> Reviewed-by: Vasanthakumar Thiagarajan <vasanthakumar.thiagarajan@oss.qualcomm.com>