Message ID | cover.1718906288.git.mchehab+huawei@kernel.org |
---|---|
Headers | show |
Series | Fix CPER issues related to UEFI 2.9A Errata | expand |
On Thu, 20 Jun 2024 at 20:01, Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > > The UEFI 2.9A errata makes clear how ARM processor type encoding should > be done: it is meant to be equal to Generic processor, using a bitmask. > > The current code assumes, for both generic and ARM processor types > that this is an integer, which is an incorrect assumption. > > Fix it. While here, also fix a compilation issue when using W=1. > > After the change, Kernel will properly decode receiving two errors at the same > message, as defined at UEFI spec: > > [ 75.282430] Memory failure: 0x5cdfd: recovery action for free buddy page: Recovered > [ 94.973081] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > [ 94.973770] {2}[Hardware Error]: event severity: recoverable > [ 94.974334] {2}[Hardware Error]: Error 0, type: recoverable > [ 94.974962] {2}[Hardware Error]: section_type: ARM processor error > [ 94.975586] {2}[Hardware Error]: MIDR: 0x000000000000cd24 > [ 94.976202] {2}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x000000000000ab12 > [ 94.977011] {2}[Hardware Error]: error affinity level: 2 > [ 94.977593] {2}[Hardware Error]: running state: 0x1 > [ 94.978135] {2}[Hardware Error]: Power State Coordination Interface state: 4660 > [ 94.978884] {2}[Hardware Error]: Error info structure 0: > [ 94.979463] {2}[Hardware Error]: num errors: 3 > [ 94.979971] {2}[Hardware Error]: first error captured > [ 94.980523] {2}[Hardware Error]: propagated error captured > [ 94.981110] {2}[Hardware Error]: overflow occurred, error info is incomplete > [ 94.981893] {2}[Hardware Error]: error_type: 0x0006: cache error|TLB error > [ 94.982606] {2}[Hardware Error]: error_info: 0x000000000091000f > [ 94.983249] {2}[Hardware Error]: transaction type: Data Access > [ 94.983891] {2}[Hardware Error]: cache error, operation type: Data write > [ 94.984559] {2}[Hardware Error]: TLB error, operation type: Data write > [ 94.985215] {2}[Hardware Error]: cache level: 2 > [ 94.985749] {2}[Hardware Error]: TLB level: 2 > [ 94.986277] {2}[Hardware Error]: processor context not corrupted > > And the error code is properly decoded according with table N.17 from UEFI 2.10 > spec: > > [ 94.981893] {2}[Hardware Error]: error_type: 0x0006: cache error|TLB error > > Mauro Carvalho Chehab (3): > efi/cper: Adjust infopfx size to accept an extra space > efi/cper: Add a new helper function to print bitmasks > efi/cper: align ARM CPER type with UEFI 2.9A/2.10 specs > Hello Mauro, How this is v4 different from the preceding 3 revisions that you sent over the past 2 days? I would expect an experienced maintainer like yourself to be familiar with the common practice here: please leave some time between sending revisions so people can take a look. And if there is a pressing need to deviate from this rule, at least put an explanation in the commit log of how the series differs from the preceding one. Thanks, Ard.
On Fri, 21 Jun 2024 10:20:36 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> wrote: > On Thu, 20 Jun 2024 20:01:45 +0200 > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > > > Sometimes it is desired to produce a single log line for errors. > > Add a new helper function for such purpose. > > > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > > --- > > drivers/firmware/efi/cper.c | 59 +++++++++++++++++++++++++++++++++++++ > > include/linux/cper.h | 3 ++ > > 2 files changed, 62 insertions(+) > > > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > > index 7d2cdd9e2227..9bf27af3e870 100644 > > --- a/drivers/firmware/efi/cper.c > > +++ b/drivers/firmware/efi/cper.c > > @@ -106,6 +106,65 @@ void cper_print_bits(const char *pfx, unsigned int bits, > > printk("%s\n", buf); > > } > > > > +/* > > It's exported and in a header used by other code, so why not make > this kernel-doc? /** > > > + * cper_bits_to_str - return a string for set bits > > + * @buf: buffer to store the output string > > + * @buf_size: size of the output string buffer > > + * @bits: bit mask > > + * @strs: string array, indexed by bit position > > + * @strs_size: size of the string array: @strs > > If it had been kernel doc, W=1 would have told you mask is > missing. > > Passing a 0 for mask seems probably not worth while. > If all bits of the unsigned int are set then people can pass ~0 > > Or make this cper_bits_to_str_masked() and have > cper_bits_to_str() that doesn't take a mask. > Mask definitely needs docs as I misunderstood it :( Also needs to be contiguous I think which is also a bit unusual. > If you do that, some simplifications can be easily made. > > > > > + * > > + * add to @buf the bitmask in hexadecimal. Then, for each set bit in @bits, > > + * add the corresponding string in @strs to @buf. > > + */ > > +char *cper_bits_to_str(char *buf, int buf_size, unsigned int bits, > > Perhaps make bits an unsigned long as then you can use the > for_each_set_bit() etc. > > > + const char * const strs[], unsigned int strs_size, > > + unsigned int mask) > > +{ > > + int i, size, first_bit; > > + int len = buf_size; > > + const char *start; > > + char *str = buf; > > + > > + if (strs_size < 16) > > + size = snprintf(str, len, "0x%02x: ", bits); > > + if (strs_size < 32) > > + size = snprintf(str, len, "0x%04x: ", bits); > > + > > + len -= size; > > + str += size; > > + > > + start = str; > > + > > + if (mask) { > > + first_bit = ffs(mask) - 1; > > + if (bits & ~mask) { > > + size = strscpy(str, "reserved bit(s)", len); > > + len -= size; > > + str += size; > > + } > > + } else { > > + first_bit = 0; > > + } > Might be worth > > bits = bits & mask; > > Obviously setting bits that aren't in the mask is > odd though so maybe a warning print if that happens? > > + > > > for_each_bit_set(i, &bits, strs_size) { Ah. I'd missed the offset and gotten function name wrong. i = first_bit for_each_set_bit_from(i, &bits, strs_size + first_bit) and look up based on i - first_bit > ... > > } > > > + for (i = 0; i < strs_size; i++) { > > + if (!(bits & (1U << (i + first_bit)))) > > + continue; > > + > > + if (*start && len > 0) { > > + *str = '|'; > > + len--; > > + str++; > > + } > > + > > + size = strscpy(str, strs[i], len); > > + len -= size; > > + str += size; > > + } > > + return buf; > > +} > > +EXPORT_SYMBOL_GPL(cper_bits_to_str); > > + > > static const char * const proc_type_strs[] = { > > "IA32/X64", > > "IA64", > > diff --git a/include/linux/cper.h b/include/linux/cper.h > > index 265b0f8fc0b3..856e8f00a7fb 100644 > > --- a/include/linux/cper.h > > +++ b/include/linux/cper.h > > @@ -584,6 +584,9 @@ const char *cper_mem_err_type_str(unsigned int); > > const char *cper_mem_err_status_str(u64 status); > > void cper_print_bits(const char *prefix, unsigned int bits, > > const char * const strs[], unsigned int strs_size); > > +char *cper_bits_to_str(char *buf, int buf_size, unsigned int bits, > > + const char * const strs[], unsigned int strs_size, > > + unsigned int mask); > > void cper_mem_err_pack(const struct cper_sec_mem_err *, > > struct cper_mem_err_compact *); > > const char *cper_mem_err_unpack(struct trace_seq *, >
Em Fri, 21 Jun 2024 10:20:36 +0100 Jonathan Cameron <Jonathan.Cameron@Huawei.com> escreveu: > On Thu, 20 Jun 2024 20:01:45 +0200 > Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > > > Sometimes it is desired to produce a single log line for errors. > > Add a new helper function for such purpose. > > > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > > --- > > drivers/firmware/efi/cper.c | 59 +++++++++++++++++++++++++++++++++++++ > > include/linux/cper.h | 3 ++ > > 2 files changed, 62 insertions(+) > > > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > > index 7d2cdd9e2227..9bf27af3e870 100644 > > --- a/drivers/firmware/efi/cper.c > > +++ b/drivers/firmware/efi/cper.c > > @@ -106,6 +106,65 @@ void cper_print_bits(const char *pfx, unsigned int bits, > > printk("%s\n", buf); > > } > > > > +/* > > It's exported and in a header used by other code, so why not make > this kernel-doc? /** I tried to preserve the original non-kernel-doc way, as I'm not sure why other comments on this file are not marked as kernel-doc stuff. The code there at cper.c also has other coding style issues - for instance it uses printk() without an error level. Anyway, I intend to submit later on a separate patch series converting the existing function documentation stuff to kernel-doc (and adding to Documentation if not there already), and maybe addressing some other coding style issues. Yet, I would prefer to have such changes out of this fix patch series. > > + * cper_bits_to_str - return a string for set bits > > + * @buf: buffer to store the output string > > + * @buf_size: size of the output string buffer > > + * @bits: bit mask > > + * @strs: string array, indexed by bit position > > + * @strs_size: size of the string array: @strs > > If it had been kernel doc, W=1 would have told you mask is > missing. Yeah, I saw that just after hitting send :-) I'll add mask at v5. > Passing a 0 for mask seems probably not worth while. > If all bits of the unsigned int are set then people can pass ~0 Makes sense. > > Or make this cper_bits_to_str_masked() and have > cper_bits_to_str() that doesn't take a mask. > > If you do that, some simplifications can be easily made. > > > > > + * > > + * add to @buf the bitmask in hexadecimal. Then, for each set bit in @bits, > > + * add the corresponding string in @strs to @buf. > > + */ > > +char *cper_bits_to_str(char *buf, int buf_size, unsigned int bits, > > Perhaps make bits an unsigned long as then you can use the > for_each_set_bit() etc. Ok. > > > + const char * const strs[], unsigned int strs_size, > > + unsigned int mask) > > +{ > > + int i, size, first_bit; > > + int len = buf_size; > > + const char *start; > > + char *str = buf; > > + > > + if (strs_size < 16) > > + size = snprintf(str, len, "0x%02x: ", bits); > > + if (strs_size < 32) > > + size = snprintf(str, len, "0x%04x: ", bits); > > + > > + len -= size; > > + str += size; > > + > > + start = str; > > + > > + if (mask) { > > + first_bit = ffs(mask) - 1; > > + if (bits & ~mask) { > > + size = strscpy(str, "reserved bit(s)", len); > > + len -= size; > > + str += size; > > + } > > + } else { > > + first_bit = 0; > > + } > Might be worth > > bits = bits & mask; No need to to that if we use for_each_set_bit(). > > Obviously setting bits that aren't in the mask is > odd though so maybe a warning print if that happens? The code already warns about that printing: "reserved bit(s)" at the output buffer. Regards, Mauro
Hi Ard, Em Fri, 21 Jun 2024 09:45:16 +0200 Ard Biesheuvel <ardb@kernel.org> escreveu: > On Thu, 20 Jun 2024 at 20:01, Mauro Carvalho Chehab > <mchehab+huawei@kernel.org> wrote: > > > > The UEFI 2.9A errata makes clear how ARM processor type encoding should > > be done: it is meant to be equal to Generic processor, using a bitmask. > > > > The current code assumes, for both generic and ARM processor types > > that this is an integer, which is an incorrect assumption. > > > > Fix it. While here, also fix a compilation issue when using W=1. > > > > After the change, Kernel will properly decode receiving two errors at the same > > message, as defined at UEFI spec: > > > > [ 75.282430] Memory failure: 0x5cdfd: recovery action for free buddy page: Recovered > > [ 94.973081] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 > > [ 94.973770] {2}[Hardware Error]: event severity: recoverable > > [ 94.974334] {2}[Hardware Error]: Error 0, type: recoverable > > [ 94.974962] {2}[Hardware Error]: section_type: ARM processor error > > [ 94.975586] {2}[Hardware Error]: MIDR: 0x000000000000cd24 > > [ 94.976202] {2}[Hardware Error]: Multiprocessor Affinity Register (MPIDR): 0x000000000000ab12 > > [ 94.977011] {2}[Hardware Error]: error affinity level: 2 > > [ 94.977593] {2}[Hardware Error]: running state: 0x1 > > [ 94.978135] {2}[Hardware Error]: Power State Coordination Interface state: 4660 > > [ 94.978884] {2}[Hardware Error]: Error info structure 0: > > [ 94.979463] {2}[Hardware Error]: num errors: 3 > > [ 94.979971] {2}[Hardware Error]: first error captured > > [ 94.980523] {2}[Hardware Error]: propagated error captured > > [ 94.981110] {2}[Hardware Error]: overflow occurred, error info is incomplete > > [ 94.981893] {2}[Hardware Error]: error_type: 0x0006: cache error|TLB error > > [ 94.982606] {2}[Hardware Error]: error_info: 0x000000000091000f > > [ 94.983249] {2}[Hardware Error]: transaction type: Data Access > > [ 94.983891] {2}[Hardware Error]: cache error, operation type: Data write > > [ 94.984559] {2}[Hardware Error]: TLB error, operation type: Data write > > [ 94.985215] {2}[Hardware Error]: cache level: 2 > > [ 94.985749] {2}[Hardware Error]: TLB level: 2 > > [ 94.986277] {2}[Hardware Error]: processor context not corrupted > > > > And the error code is properly decoded according with table N.17 from UEFI 2.10 > > spec: > > > > [ 94.981893] {2}[Hardware Error]: error_type: 0x0006: cache error|TLB error > > > > Mauro Carvalho Chehab (3): > > efi/cper: Adjust infopfx size to accept an extra space > > efi/cper: Add a new helper function to print bitmasks > > efi/cper: align ARM CPER type with UEFI 2.9A/2.10 specs > > > > Hello Mauro, > > How this is v4 different from the preceding 3 revisions that you sent > over the past 2 days? > > I would expect an experienced maintainer like yourself to be familiar > with the common practice here: please leave some time between sending > revisions so people can take a look. And if there is a pressing need > to deviate from this rule, at least put an explanation in the commit > log of how the series differs from the preceding one. Sorry, I'll add a version review on that. Basically I was missing a test environment to do error injection. When I got it enabled, and fixed to cope with UEFI 2.9A/2.10 expected behavior, I was able to discover some issues and to do some code improvements. v1: - (tagged as RFC) was mostly to give a heads up that the current implementation is not following the spec. It also touches only cper code. v2: - It fixes the way printks are handled on both cper_arm and ghes drivers; v3: - It adds a helper function to produce a buffer describing the error bits at cper's printk and ghes pr_warn_bitrated. It also fixes a W=1 error while building cper; v4: - The print function had some bugs on it, which was discovered with the help of an error injection tool I'm now using. I have already another version ready to send. It does some code cleanup and address the issues pointed by Tony and Jonathan. If you prefer, I can hold it until Monday to give you some time to look at it. Thanks, Mauro