Message ID | 20250314012927.150860-1-jiangjianjun3@huawei.com |
---|---|
Headers | show |
Series | scsi: scsi_error: Introduce new error handle mechanism | expand |
On 3/13/25 6:29 PM, JiangJianJun wrote: > This is preparation for a genernal target based error handle strategy > to check if to wake up actual error handler. I don't like this change because it slows down the hot path for LLD drivers that do not set starget->can_queue. Why is this change necessary? What are the alternatives? Thanks, Bart.
On 3/14/25 02:29, JiangJianJun wrote: > It's unbearable for systems with large scale scsi devices share HBAs to > block all devices' IOs when handle error commands, we need a new error > handle mechanism to address this issue. > > I consulted about this issue a year ago, the discuss link can be found in > refenence. Hannes replied about why we have to block the SCSI host > then perform error recovery kindly. I think it's unnecessary to block > SCSI host for all drivers and can try a small level recovery(LUN based for > example) first to avoid block the SCSI host. > Technically, yes. There are, however, some issues which would need to be addressed if someone would design a new error handler. 1. The 'LUN Reset' TMF (as it's currently being used) is badly scoped; it will reset the LUN itself, affecting all ports to that LUN. So in a multipathed/multiported environment all initiators will be affected, even if they haven't experienced an error. Is that what we want? Shouldn't we rather use the 'Reset IT Nexus' TMF here? And, of course, the 'Target Reset' TMF has been dropped from SAM, so I really don't see the point in spending time here ... 2. Irrespective of the EH granularity, any error handing requires that all activity on the level has to be stopped. If you need to issue a LUN reset, you need to stop I/O for that LUN. 3. The current EH framework is designed around 'struct scsi_cmnd'. Which means that the command _initiating_ the error handling can only be returned once the _entire_ error handling (with all escalations) is finished. And more often than not, the application is waiting on that command to be completed before the next I/O is sent. And that really limits the effectiveness of any improved error handler; the application ultimatively has to wait for a host reset before it can contine. But anyway. We already have a mechanism for asynchronous command aborts; have you checked if you can adapt if for LUN reset, too? That would be the easiest solution, I guess ... Cheers, Hannes
On Fri, Mar 14, 2025 at 10:01:40AM +0100, Hannes Reinecke wrote: > 3. The current EH framework is designed around 'struct scsi_cmnd'. > Which means that the command _initiating_ the error handling can > only be returned once the _entire_ error handling (with all > escalations) is finished. And more often than not, the application > is waiting on that command to be completed before the next I/O > is sent. And that really limits the effectiveness of any improved > error handler; the application ultimatively has to wait for a > host reset before it can contine. And someone needs to get your old series to fix that merged before we even start talking about any major EH change.
Sorry for late message! I'm working on fixing and testing these issues before re-emailing. -----邮件原件----- 发件人: Christoph Hellwig <hch@infradead.org> 发送时间: 2025年3月20日 14:06 收件人: Hannes Reinecke <hare@suse.de> 抄送: Jiangjianjun <jiangjianjun3@huawei.com>; jejb@linux.ibm.com; martin.petersen@oracle.com; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org; lixiaokeng <lixiaokeng@huawei.com>; hewenliang (C) <hewenliang4@huawei.com>; Yangkunlin(Poincare) <yangkunlin7@huawei.com> 主题: Re: [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism On Fri, Mar 14, 2025 at 10:01:40AM +0100, Hannes Reinecke wrote: > 3. The current EH framework is designed around 'struct scsi_cmnd'. > Which means that the command _initiating_ the error handling can only > be returned once the _entire_ error handling (with all > escalations) is finished. And more often than not, the application is > waiting on that command to be completed before the next I/O is sent. > And that really limits the effectiveness of any improved error > handler; the application ultimatively has to wait for a host reset > before it can contine. And someone needs to get your old series to fix that merged before we even start talking about any major EH change.
On 31/03/2025 04:10, Jiangjianjun wrote: > Sorry for late message! I'm working on fixing and testing these issues before re-emailing. What are you actually working on? It seems that Hannes' "scsi: EH rework, main part" series and maybe this one can help resolve this following issue: https://lore.kernel.org/linux-block/eef1e927-c9b2-c61d-7f48-92e65d8b0418@huawei.com/ with fix attempted in: https://lore.kernel.org/linux-ide/20241031140731.224589-4-cassel@kernel.org/ so that we don't see "fixes" like: https://lore.kernel.org/linux-scsi/20250329073236.2300582-1-liyihang9@huawei.com/T/#m80bcb3f57fd176b7ce41b1f26e8560de6ad52c9d > > -----邮件原件----- > 发件人: Christoph Hellwig <hch@infradead.org> > 发送时间: 2025年3月20日 14:06 > 收件人: Hannes Reinecke <hare@suse.de> > 抄送: Jiangjianjun <jiangjianjun3@huawei.com>; jejb@linux.ibm.com; martin.petersen@oracle.com; linux-scsi@vger.kernel.org; linux-kernel@vger.kernel.org; lixiaokeng <lixiaokeng@huawei.com>; hewenliang (C) <hewenliang4@huawei.com>; Yangkunlin(Poincare) <yangkunlin7@huawei.com> > 主题: Re: [RFC PATCH v3 00/19] scsi: scsi_error: Introduce new error handle mechanism > > On Fri, Mar 14, 2025 at 10:01:40AM +0100, Hannes Reinecke wrote: >> 3. The current EH framework is designed around 'struct scsi_cmnd'. >> Which means that the command _initiating_ the error handling can only >> be returned once the _entire_ error handling (with all >> escalations) is finished. And more often than not, the application is >> waiting on that command to be completed before the next I/O is sent. >> And that really limits the effectiveness of any improved error >> handler; the application ultimatively has to wait for a host reset >> before it can contine. > > And someone needs to get your old series to fix that merged before we even start talking about any major EH change. >