Message ID | 20200604112040.22144-1-wsa+renesas@sang-engineering.com |
---|---|
Headers | show |
Series | renesas_sdhi: fix hang when SCC loses its clock | expand |
Hi Shimoda-san,tftp 0x58000000 r8a77965-salvator-xs.dtb; tftp 0x50000000 Image-m3n-wsa; booti 0x50000000 - 0x58000000 > > > > + /* Tuning done, no special handling for SCC clock needed anymore */ > > > > + priv->keep_scc_freq = false; > > > > + > > > > > > Setting keep_scc_freq to false is only here. But, I'm thinking > > > we should set it in some error paths like below somehow too: > > > - error paths before hs400_complete() in mmc_select_hs400(). > > > - error path of mmc_execute_tuning() in mmc_retune(). > > > > Hmm, I guess you are right. That would kind of spoil my approach taken > > here. Maybe we need another flag in the core like 'doing_tune' to > > supplement 'doing_retune', so or driver knows when any kind of tuning is > > going on? > > Adding such a new flag is better, I think. So, I added a flag to the MMC core and I think it should work. However, I can't test it currently because, sadly, the issue disappeared again :( I even can't reproduce the issue with the same codebase and config which I used when I was working last time on it. And back then, the issue was happening. I am at a loss currently what really triggers this hang. I added some code to enforce reading something from the SCC with the hclk disabled. However, that reading works fine today here, no hang. So, it seems that keeping hclk enabled will fix the hang. However, it doesn't look like it will hang just when we allow to disable it. Seems something else is part of the equation, too... I kept trying to figure this out for the last two days, but no success so far. Will keep you updated. Thanks, Wolfram
Hi Wolfram-san, > From: Wolfram Sang, Sent: Friday, August 14, 2020 4:15 PM > > > > > > + /* Tuning done, no special handling for SCC clock needed anymore */ > > > > > + priv->keep_scc_freq = false; > > > > > + > > > > > > > > Setting keep_scc_freq to false is only here. But, I'm thinking > > > > we should set it in some error paths like below somehow too: > > > > - error paths before hs400_complete() in mmc_select_hs400(). > > > > - error path of mmc_execute_tuning() in mmc_retune(). > > > > > > Hmm, I guess you are right. That would kind of spoil my approach taken > > > here. Maybe we need another flag in the core like 'doing_tune' to > > > supplement 'doing_retune', so or driver knows when any kind of tuning is > > > going on? > > > > Adding such a new flag is better, I think. > > So, I added a flag to the MMC core and I think it should work. However, > I can't test it currently because, sadly, the issue disappeared again :( I got a report from a colleague about this issue. According to the report, this issue is related to retuning. When retuning happens, the mmc core calls mmc_hs400_to_hs200() and then mmc_hs400_to_hs200() will set the clock as 52MHz at first. So, it's possible to cause the issue. It's difficult to cause retuning in normal situation. But, according to the report, if we add a code which the sdhi driver reports an error at the first CMD18 once, we can cause retuning and then the issue happens. Best regards, Yoshihiro Shimoda
Hi Shimoda-san, > I got a report from a colleague about this issue. According to the report, > this issue is related to retuning. When retuning happens, the mmc core > calls mmc_hs400_to_hs200() and then mmc_hs400_to_hs200() will set the clock > as 52MHz at first. So, it's possible to cause the issue. > > It's difficult to cause retuning in normal situation. But, according to > the report, if we add a code which the sdhi driver reports an error > at the first CMD18 once, we can cause retuning and then the issue happens. I took the liberty of a different approach because I wanted to reproduce the issue when doing the initial tuning and not a retune. Because my new series adds (and checks) a flag for doing_initial_tune, so I really wanted to excercise this code path. This is a real problem, too, because I saw this with my boards earlier back then. And halleluja, today I saw it again, once. I switched to my H3-ES2.0 board which I haven't used for weeks. And when booting that for the first time, I got a failure including logs. Later boots just went fine. And because of the logs, I could finally inject an error which will reproducibly cause the boot to hang because of a stalled SCC. Tada, here is the patch: From: Wolfram Sang <wsa+renesas@sang-engineering.com> Subject: [PATCH] GOLD: simulate stalled SCC Geez, this took ages to find... --- drivers/mmc/core/mmc.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/mmc/core/mmc.c b/drivers/mmc/core/mmc.c index 216bd1aed373..6b3056437b37 100644 --- a/drivers/mmc/core/mmc.c +++ b/drivers/mmc/core/mmc.c @@ -1218,6 +1218,7 @@ static int mmc_select_hs400(struct mmc_card *card) host->ops->hs400_complete(host); err = mmc_switch_status(card, true); + err = -EILSEQ; if (err) goto out_err; Interestingly, the other mmc_switch_status() in mmc_select_hs400() was not stalling the SCC. Anyhow, after this failute, the MMC core switches back to 300kHz and the SCC clock is off but for some reason SCC is still accessed. I will investigate why. The good news is that my new patch set fixes the hang as expected. The board will continue to boot so we probably want to have this series. However, I have the feeling that this SCC access which hangs the board might be a bug because of an unintended code path. I mean, this is also one reason why the bug triggers so rarely these days. We have been fixing a lot of things and the SCC is only accessed when it should be accessed. We will see. I also need to test other boards, too. So much for now, I hope I can report more later. Happy hacking and kind regards, Wolfram
> not stalling the SCC. Anyhow, after this failute, the MMC core switches > back to 300kHz and the SCC clock is off but for some reason SCC is still > accessed. I will investigate why. The good news is that my new patch set > fixes the hang as expected. The board will continue to boot so we > probably want to have this series. However, I have the feeling that this > SCC access which hangs the board might be a bug because of an unintended > code path. I mean, this is also one reason why the bug triggers so > rarely these days. We have been fixing a lot of things and the SCC is > only accessed when it should be accessed. We will see. I also need to > test other boards, too. Some more good news: I can reproduce the issue now not only with H3-ES2.0 but also with my M3-N. Interesting news: The hang comes from a code path I would have not expected. It is not because of accessing an SCC register, it is this line from renesas_sdhi_set_clock() which causes the issue: 186 sd_ctrl_write16(host, CTL_SD_CARD_CLK_CTL, clk & CLK_CTL_DIV_MASK); I mean I can guess that the clock setting has something to do with the SCC, but I can't see the direct connection with the documentation I have. I will stop that research here and will prepare now my series to leave the SCC clock enabled as long as some tuning is in progress.