rn5t618_wdt: WATCHDOG_NOWAYOUT not working

Message ID	CAMeyCbiVVTmpV8pazAPLtew0yQSYeb5ykeHzCs_A=nBUK7d1bg@mail.gmail.com
State	New
Headers	show Return-Path: <linux-watchdog-owner@kernel.org> MIME-Version: 1.0 From: Kegl Rohit <keglrohit@gmail.com> Date: Fri, 12 Aug 2022 11:56:04 +0200 Message-ID: <CAMeyCbiVVTmpV8pazAPLtew0yQSYeb5ykeHzCs_A=nBUK7d1bg@mail.gmail.com> Subject: rn5t618_wdt: WATCHDOG_NOWAYOUT not working To: linux-watchdog@vger.kernel.org Content-Type: multipart/mixed; boundary="000000000000fb551205e6084ac9" Precedence: bulk
Series	rn5t618_wdt: WATCHDOG_NOWAYOUT not working \| expand rn5t618_wdt: WATCHDOG_NOWAYOUT not working

Message ID

CAMeyCbiVVTmpV8pazAPLtew0yQSYeb5ykeHzCs_A=nBUK7d1bg@mail.gmail.com

State

New

Headers

MIME-Version: 1.0
From: Kegl Rohit <keglrohit@gmail.com>
Date: Fri, 12 Aug 2022 11:56:04 +0200
Message-ID: 
 <CAMeyCbiVVTmpV8pazAPLtew0yQSYeb5ykeHzCs_A=nBUK7d1bg@mail.gmail.com>
Subject: rn5t618_wdt: WATCHDOG_NOWAYOUT not working
To: linux-watchdog@vger.kernel.org
Content-Type: multipart/mixed; boundary="000000000000fb551205e6084ac9"
Precedence: bulk

Series

rn5t618_wdt: WATCHDOG_NOWAYOUT not working | expand

Commit Message

Kegl Rohit Aug. 12, 2022, 9:56 a.m. UTC

Hello!

Our board uses the RN5T567 as /dev/watchdog source via i2c communication.
RN5T567 is using the rn5t618_wdt.c driver

Our kernel has CONFIG_WATCHDOG_NOWAYOUT=y enabled

# Starting the wdt works as expected
echo -n '1' > /dev/watchdog

# Stopping the wdt works as expected no reboot will be issued
echo -n 'V' > /dev/watchdog

# Starting the wdt again will enable the wdt again
# BUT while the wdt is triggered every second the system reboots
while true; do echo -n '1' > /dev/watchdog; sleep 1; done

Digging deeper into the issue I could find out that the remap is
initialized to cache the register accesses RN5T618_WATCHDOG (0x0b) and
RN5T618_PWRIRQ (0x13).

So I expect that because of this caching the IRQ status bit was never reset.
The status register must not be cached because its set by the RN5T567.
Also it is not ideal to cache the access to the watchdog register
which resets the counter via read write cycle.

debugfs shows the regmap setting for these registers:
[root@imx7d /sys/kernel/debug]# cat regmap/0-0033/access
// third column means volatile yes or no
…
0b: y y n n
…
13: y y n n

After marking these registers volatile, stopping the wdt and starting
again seems to work.

Furthermore it is not necessary to do a RN5T618_WATCHDOG read AND
write cycle to reset the wdt counter.
The source code states:
/* The counter is restarted after a R/W access to watchdog register */

The RN5T567 datasheet states:
“The count value of watchdog timer is cleared by accessing (R/W) to
this register.”

Tests showed that a single read is enough. I did not check other chip
variants which use the same driver.

In my opinion a write cycle is even dangerous if there is some strange
situation and the write cycle disables the wdt or changes the wdt
settings stored in this register.

I still don't know why the irq status bit is cleared on every ping()
but I kept it there.
Attached is my patch tested with RN5T567.
Is the rn5t618_wdt.c driver maintained? Strange that such issue was
never noticed.

Comments

Guenter Roeck Aug. 12, 2022, 10:08 p.m. UTC | #1

On 8/12/22 02:56, Kegl Rohit wrote:
> Hello!
> 
> Our board uses the RN5T567 as /dev/watchdog source via i2c communication.
> RN5T567 is using the rn5t618_wdt.c driver
> 
> Our kernel has CONFIG_WATCHDOG_NOWAYOUT=y enabled
> 
> # Starting the wdt works as expected
> echo -n '1' > /dev/watchdog
> 
> # Stopping the wdt works as expected no reboot will be issued
> echo -n 'V' > /dev/watchdog
> 
> # Starting the wdt again will enable the wdt again
> # BUT while the wdt is triggered every second the system reboots
> while true; do echo -n '1' > /dev/watchdog; sleep 1; done
> 
> Digging deeper into the issue I could find out that the remap is
> initialized to cache the register accesses RN5T618_WATCHDOG (0x0b) and
> RN5T618_PWRIRQ (0x13).
> 
> So I expect that because of this caching the IRQ status bit was never reset.
> The status register must not be cached because its set by the RN5T567.
> Also it is not ideal to cache the access to the watchdog register
> which resets the counter via read write cycle.
> 
> debugfs shows the regmap setting for these registers:
> [root@imx7d /sys/kernel/debug]# cat regmap/0-0033/access
> // third column means volatile yes or no
> …
> 0b: y y n n
> …
> 13: y y n n
> 
> After marking these registers volatile, stopping the wdt and starting
> again seems to work.
> 
> Furthermore it is not necessary to do a RN5T618_WATCHDOG read AND
> write cycle to reset the wdt counter.
> The source code states:
> /* The counter is restarted after a R/W access to watchdog register */
> 
> The RN5T567 datasheet states:
> “The count value of watchdog timer is cleared by accessing (R/W) to
> this register.”
> 
> Tests showed that a single read is enough. I did not check other chip
> variants which use the same driver.
> 
> In my opinion a write cycle is even dangerous if there is some strange
> situation and the write cycle disables the wdt or changes the wdt
> settings stored in this register.
> 
> I still don't know why the irq status bit is cleared on every ping()
> but I kept it there.
> Attached is my patch tested with RN5T567.

It is sent as attachment and thus unusable. In addition to that, it is
not a proper patch.

> Is the rn5t618_wdt.c driver maintained? Strange that such issue was
> never noticed.

I'd rather assume that your usage pattern is one that no one ever tried
before because it is quite unusual. Why enable NOWAYOUT just to violate
it continuously ?

Anyway, I think you are a bit off track here.

The register 0x13 should probably be volatile. However, register 0x0b
is not updated by the chip and thus does not need to be volatile.

Replacing the various regmap_update_bits() with regmap_write() is just
papering over the problem. To force the write to the chip even if nothing
changed, regmap_write_bits() should be used instead of regmap_update_bits()
where appropriate.
That is probably also the core of the problem: rn5t618_wdt_start() calls
regmap_update_bits(). Since no bit was changed, there is actually no write
to the chip. Ultimately this causes a timeout since the watchdog is never
stopped and the counter is never reset. To avoid that and to make
sure that the counter is reset, it is probably sufficient to replace the
regmap_update_bits() to enable the watchdog with regmap_write_bits().

The read operation in rn5t618_wdt_ping() doesn't really do anything
because the register value is cached. The read only happens to have
a value to write. I can not follow your logic that that write could
somehow be dangerous. It writes the value that is expected for the
register. There is nothing dangerous about it. If there was indeed
a problem, you would have to provide specifics; you can not just make
such a claim without basis in fact.

The interrupt status register at 0x13 should probably be declared volatile
to avoid "loss" of interrupt status for other drivers. Clearing the interrupt
bit is only necessary to ensure that an expired interrupt is caught if the
'ping' comes late. The same is true for the start function. In normal operation
it should not be needed.

Either case, if you want your code to be considered, you'll have to split
it into two patches, one for mfd and one for watchdog, and you'll have to submit
proper patches as outlined in Documentation/process/submitting-patches.rst.

Thanks,
Guenter

Kegl Rohit Aug. 14, 2022, 4:44 p.m. UTC | #2

Thanks for clarifying some parts.

> I'd rather assume that your usage pattern is one that no one ever tried
> before because it is quite unusual. Why enable NOWAYOUT just to violate
> it continuously ?
Why violate? NOWAYOUT should be a pretty straight forward kernel config...
When starting an in-place firmware update its quite useful to stop the
wdt before terminating all applications.
After the upgrade is done the applications and wdt are restarted.


> Anyway, I think you are a bit off track here.
> The register 0x13 should probably be volatile.

> However, register 0x0b is not updated by the chip and thus does not need to be volatile.
That right, but if regmap_read and regmap_write is cached then it
would be necessary to cause some read or write.
I did not find documentation about the regmap caching.
I am new to regmap, therefore i asked for others input.

For example this part:

/* The counter is restarted after a R/W access to watchdog register */
ret = regmap_read(wdt->rn5t618->regmap, RN5T618_WATCHDOG, &val);
if (ret)
return ret;
ret = regmap_write(wdt->rn5t618->regmap, RN5T618_WATCHDOG, val);

Will regmap_read always bypass the cache?
Will regmap_write always bypass the cache?
Then it is not needed to set 0x0b volatile.


> Replacing the various regmap_update_bits() with regmap_write() is just
> papering over the problem. To force the write to the chip even if nothing
> changed, regmap_write_bits() should be used instead of regmap_update_bits()
> where appropriate.

Will regmap_write_bits always bypass the cache?

If 0x13 is set volatile regmap_write_bits will always execute an I2C
read and write transmission?
This I2C read would be overhead in each ping(). The datasheet says an
IRQ can be reset by simply writing '0' to the corresponding bit.
Writing '1' to any other bits has no effect.
So I used regmap_write(wdt->rn5t618->regmap, RN5T618_PWRIRQ,
~RN5T618_PWRIRQ_IR_WDOG) to clear only the wdt irq bit and save a i2c
read transmission in each ping().


> That is probably also the core of the problem: rn5t618_wdt_start() calls
> regmap_update_bits(). Since no bit was changed, there is actually no write
> to the chip. Ultimately this causes a timeout since the watchdog is never
> stopped and the counter is never reset. To avoid that and to make
> sure that the counter is reset, it is probably sufficient to replace the
> regmap_update_bits() to enable the watchdog with regmap_write_bits().

You meant regmap_update_bits(wdt->rn5t618->regmap, RN5T618_PWRIRQ
(0x13), RN5T618_PWRIRQ_IR_WDOG, 0) in rn5t618_wdt_ping()?
No other function will reset the wdt irq status bit. And no other
function accesses the volatile RN5T618_PWRIRQ (0x13) register.
So regmap_update_bits RN5T618_PWRIREN (not RN5T618_PWRIRQ 0x13!)
usage in rn5t618_wdt_start is fine for me and the watchdog was stopped
as I stated!


> The read operation in rn5t618_wdt_ping() doesn't really do anything
> because the register value is cached. The read only happens to have
> a value to write.

So regmap_read(wdt->rn5t618->regmap, RN5T618_WATCHDOG, &val) is cached?
If it is cached and regmap_write(wdt->rn5t618->regmap,
RN5T618_WATCHDOG, val) is also cached then the counter will never be
reset.
But the ping is working and so one of them or both must be non cached.


> I can not follow your logic that that write could
> somehow be dangerous. It writes the value that is expected for the
> register. There is nothing dangerous about it. If there was indeed
> a problem, you would have to provide specifics; you can not just make
> such a claim without basis in fact.

My main point is the unnecessary additional I2C write or read
transmission. A single read or write of the register should be enough.

Dangerous is maybe somewhat overcautious. I2C is not the most reliable
thing, so if the write went wrong maybe the wdt will be disabled
without knowing.
So a read (+checking the enable bit) is in my opinion more defensive
than a read+write with zero checks. Only i2c transfer return code is
checked, not sure how reliable this is.


> The interrupt status register at 0x13 should probably be declared volatile
> to avoid "loss" of interrupt status for other drivers. Clearing the interrupt
> bit is only necessary to ensure that an expired interrupt is caught if the
> 'ping' comes late. The same is true for the start function. In normal operation
> it should not be needed.

Ok, I have the same thought. The datasheet says something about an
additional 1 second until the wdt really fires.
And maybe if the irq is cleared before this additional second is
expired the pending wdt reset is cleared.
So it makes sense to reset the irq status in every ping to gain some
extra time e.g. if the wdt timeout is set to the minimum of 1 second.


> Either case, if you want your code to be considered, you'll have to split
> it into two patches, one for mfd and one for watchdog, and you'll have to submit
> proper patches as outlined in Documentation/process/submitting-patches.rst.

As I am not an expert and do not want to break some foreign code, I
only wanted to share my findings and get feedback if they are wrong.
It would be great if it gets fixed for everyone. There are some
reports online which are maybe facing the same issue, but could not
find the real cause.

On Sat, Aug 13, 2022 at 12:08 AM Guenter Roeck <linux@roeck-us.net> wrote:
>
> On 8/12/22 02:56, Kegl Rohit wrote:
> > Hello!
> >
> > Our board uses the RN5T567 as /dev/watchdog source via i2c communication.
> > RN5T567 is using the rn5t618_wdt.c driver
> >
> > Our kernel has CONFIG_WATCHDOG_NOWAYOUT=y enabled
> >
> > # Starting the wdt works as expected
> > echo -n '1' > /dev/watchdog
> >
> > # Stopping the wdt works as expected no reboot will be issued
> > echo -n 'V' > /dev/watchdog
> >
> > # Starting the wdt again will enable the wdt again
> > # BUT while the wdt is triggered every second the system reboots
> > while true; do echo -n '1' > /dev/watchdog; sleep 1; done
> >
> > Digging deeper into the issue I could find out that the remap is
> > initialized to cache the register accesses RN5T618_WATCHDOG (0x0b) and
> > RN5T618_PWRIRQ (0x13).
> >
> > So I expect that because of this caching the IRQ status bit was never reset.
> > The status register must not be cached because its set by the RN5T567.
> > Also it is not ideal to cache the access to the watchdog register
> > which resets the counter via read write cycle.
> >
> > debugfs shows the regmap setting for these registers:
> > [root@imx7d /sys/kernel/debug]# cat regmap/0-0033/access
> > // third column means volatile yes or no
> > …
> > 0b: y y n n
> > …
> > 13: y y n n
> >
> > After marking these registers volatile, stopping the wdt and starting
> > again seems to work.
> >
> > Furthermore it is not necessary to do a RN5T618_WATCHDOG read AND
> > write cycle to reset the wdt counter.
> > The source code states:
> > /* The counter is restarted after a R/W access to watchdog register */
> >
> > The RN5T567 datasheet states:
> > “The count value of watchdog timer is cleared by accessing (R/W) to
> > this register.”
> >
> > Tests showed that a single read is enough. I did not check other chip
> > variants which use the same driver.
> >
> > In my opinion a write cycle is even dangerous if there is some strange
> > situation and the write cycle disables the wdt or changes the wdt
> > settings stored in this register.
> >
> > I still don't know why the irq status bit is cleared on every ping()
> > but I kept it there.
> > Attached is my patch tested with RN5T567.
>
> It is sent as attachment and thus unusable. In addition to that, it is
> not a proper patch.
>
> > Is the rn5t618_wdt.c driver maintained? Strange that such issue was
> > never noticed.
>
> I'd rather assume that your usage pattern is one that no one ever tried
> before because it is quite unusual. Why enable NOWAYOUT just to violate
> it continuously ?
>
> Anyway, I think you are a bit off track here.
>
> The register 0x13 should probably be volatile. However, register 0x0b
> is not updated by the chip and thus does not need to be volatile.
>
> Replacing the various regmap_update_bits() with regmap_write() is just
> papering over the problem. To force the write to the chip even if nothing
> changed, regmap_write_bits() should be used instead of regmap_update_bits()
> where appropriate.
> That is probably also the core of the problem: rn5t618_wdt_start() calls
> regmap_update_bits(). Since no bit was changed, there is actually no write
> to the chip. Ultimately this causes a timeout since the watchdog is never
> stopped and the counter is never reset. To avoid that and to make
> sure that the counter is reset, it is probably sufficient to replace the
> regmap_update_bits() to enable the watchdog with regmap_write_bits().
>
> The read operation in rn5t618_wdt_ping() doesn't really do anything
> because the register value is cached. The read only happens to have
> a value to write. I can not follow your logic that that write could
> somehow be dangerous. It writes the value that is expected for the
> register. There is nothing dangerous about it. If there was indeed
> a problem, you would have to provide specifics; you can not just make
> such a claim without basis in fact.
>
> The interrupt status register at 0x13 should probably be declared volatile
> to avoid "loss" of interrupt status for other drivers. Clearing the interrupt
> bit is only necessary to ensure that an expired interrupt is caught if the
> 'ping' comes late. The same is true for the start function. In normal operation
> it should not be needed.
>
> Either case, if you want your code to be considered, you'll have to split
> it into two patches, one for mfd and one for watchdog, and you'll have to submit
> proper patches as outlined in Documentation/process/submitting-patches.rst.
>
> Thanks,
> Guenter

diff --git a/drivers/mfd/rn5t618.c b/drivers/mfd/rn5t618.c
index 652a5e60067f..b414abcdfecb 100644
--- a/drivers/mfd/rn5t618.c
+++ b/drivers/mfd/rn5t618.c
@@ -47,8 +47,10 @@  static bool rn5t618_volatile_reg(struct device *dev, unsigned int reg)
 	case RN5T618_RTC_SECONDS ... RN5T618_RTC_YEAR:
 	case RN5T618_CHGSTATE:
 	case RN5T618_CHGCTRL_IRR ... RN5T618_CHGERR_MONI:
 	case RN5T618_CONTROL ... RN5T618_CC_AVEREG0:
+	case RN5T618_WATCHDOG: // should not be cached because of r/w to reset counter
+	case RN5T618_PWRIRQ:   // should not be cached because its set by hardware
 		return true;
 	default:
 		return false;
 	}
diff --git a/drivers/watchdog/rn5t618_wdt.c b/drivers/watchdog/rn5t618_wdt.c
index 6e524c8e26a8..83d3f006f812 100644
--- a/drivers/watchdog/rn5t618_wdt.c
+++ b/drivers/watchdog/rn5t618_wdt.c
@@ -75,8 +75,13 @@  static int rn5t618_wdt_start(struct watchdog_device *wdt_dev)
 	ret = rn5t618_wdt_set_timeout(wdt_dev, wdt_dev->timeout);
 	if (ret)
 		return ret;
 
+	/* Clear pending watchdog interrupt */
+	ret = regmap_write(wdt->rn5t618->regmap, RN5T618_PWRIRQ, ~RN5T618_PWRIRQ_IR_WDOG);
+	if (ret)
+		return ret;
+
 	/* enable repower-on */
 	ret = regmap_update_bits(wdt->rn5t618->regmap, RN5T618_REPCNT,
 				 RN5T618_REPCNT_REPWRON,
 				 RN5T618_REPCNT_REPWRON);
@@ -114,15 +119,10 @@  static int rn5t618_wdt_ping(struct watchdog_device *wdt_dev)
 	ret = regmap_read(wdt->rn5t618->regmap, RN5T618_WATCHDOG, &val);
 	if (ret)
 		return ret;
 
-	ret = regmap_write(wdt->rn5t618->regmap, RN5T618_WATCHDOG, val);
-	if (ret)
-		return ret;
-
 	/* Clear pending watchdog interrupt */
-	return regmap_update_bits(wdt->rn5t618->regmap, RN5T618_PWRIRQ,
-				  RN5T618_PWRIRQ_IR_WDOG, 0);
+	return regmap_write(wdt->rn5t618->regmap, RN5T618_PWRIRQ, ~RN5T618_PWRIRQ_IR_WDOG);
 }
 
 static const struct watchdog_info rn5t618_wdt_info = {
 	.options	= WDIOF_SETTIMEOUT | WDIOF_MAGICCLOSE |
-- 
2.30.2

rn5t618_wdt: WATCHDOG_NOWAYOUT not working

Commit Message

Comments

Patch