mbox series

[0/2] Make ELOG log and trace consistently with GHES

Message ID 20240527144356.246220-1-fabio.m.de.francesco@linux.intel.com
Headers show
Series Make ELOG log and trace consistently with GHES | expand

Message

Fabio M. De Francesco May 27, 2024, 2:43 p.m. UTC
When Firmware First is enabled, BIOS handles errors first and then it
makes them available to the kernel via the Common Platform Error Record
(CPER) sections (UEFI 2.10 Appendix N). Linux parses the CPER sections
via one of two similar paths, either ELOG or GHES.

Currently, ELOG and GHES show some inconsistencies in how they print to
the kernel log as well as in how they report to userspace via trace
events.

Make the two mentioned paths act similarly for what relates to logging
and tracing.

--- Changes for v1 ---

	- Drop the RFC prefix and restart from PATCH v1
	- Drop patch 3/3 because a discussion on it has not yet been
	  settled
	- Drop namespacing in export of pci_print_aer while() (Dan)
	- Don't use '#ifdef' in *.c files (Dan)
	- Drop a reference on pdev after operation is complete (Dan)
	- Don't log an error message if pdev is NULL (Dan)

--- Changes for RFC v2 ---
	
	- 0/3: rework the subject line and the letter.
        - 1/3: no changes.
        - 2/3: trace CPER PCIe Section only if CONFIG_ACPI_APEI_PCIEAER
          is defined; the kernel test robot reported the use of two
          undefined symbols because the test for the config option was
          missing; rewrite the subject line and part of commit message.
        - 3/3: no changes.

Fabio M. De Francesco (2):
  ACPI: extlog: Trace CPER Non-standard Section Body
  ACPI: extlog: Trace CPER PCI Express Error Section

 drivers/acpi/acpi_extlog.c | 35 +++++++++++++++++++++++++++++++++++
 drivers/pci/pcie/aer.c     |  2 +-
 include/linux/aer.h        |  9 ++++++---
 3 files changed, 42 insertions(+), 4 deletions(-)

Comments

Fabio M. De Francesco July 23, 2024, 1:43 p.m. UTC | #1
On Monday, May 27, 2024 4:43:39 PM GMT+2 Fabio M. De Francesco wrote:
> When Firmware First is enabled, BIOS handles errors first and then it
> makes them available to the kernel via the Common Platform Error Record
> (CPER) sections (UEFI 2.10 Appendix N). Linux parses the CPER sections
> via one of two similar paths, either ELOG or GHES.
> 
> Currently, ELOG and GHES show some inconsistencies in how they print to
> the kernel log as well as in how they report to userspace via trace
> events.
> 
> Make the two mentioned paths act similarly for what relates to logging
> and tracing.

Gentle ping.
Thanks,

Fabio

> --- Changes for v1 ---
> 
> 	- Drop the RFC prefix and restart from PATCH v1
> 	- Drop patch 3/3 because a discussion on it has not yet been
> 	  settled
> 	- Drop namespacing in export of pci_print_aer while() (Dan)
> 	- Don't use '#ifdef' in *.c files (Dan)
> 	- Drop a reference on pdev after operation is complete (Dan)
> 	- Don't log an error message if pdev is NULL (Dan)
> 
> --- Changes for RFC v2 ---
> 	
> 	- 0/3: rework the subject line and the letter.
>         - 1/3: no changes.
>         - 2/3: trace CPER PCIe Section only if CONFIG_ACPI_APEI_PCIEAER
>           is defined; the kernel test robot reported the use of two
>           undefined symbols because the test for the config option was
>           missing; rewrite the subject line and part of commit message.
>         - 3/3: no changes.
> 
> Fabio M. De Francesco (2):
>   ACPI: extlog: Trace CPER Non-standard Section Body
>   ACPI: extlog: Trace CPER PCI Express Error Section
> 
>  drivers/acpi/acpi_extlog.c | 35 +++++++++++++++++++++++++++++++++++
>  drivers/pci/pcie/aer.c     |  2 +-
>  include/linux/aer.h        |  9 ++++++---
>  3 files changed, 42 insertions(+), 4 deletions(-)
> 
>
Dan Williams Aug. 6, 2024, 7:21 p.m. UTC | #2
Fabio M. De Francesco wrote:
> In extlog_print(), trace "Non-standard Section Body" reported by firmware
> to the OS via Common Platform Error Record (CPER) (UEFI v2.10 Appendix N
> 2.3) to add further debug information and so to make ELOG log
> consistently with ghes_do_proc() (GHES).

I think this description could be clearer, how about:

---

ghes_do_proc() has a catch-all for unknown or unhandled CPER formats
(UEFI v2.10 Appendix N 2.3), extlog_print() does not. This gap was
noticed by a RAS test that injected CXL protocol errors which were
notified to extlog_print() via the IOMCA (I/O Machine Check
Architecture) mechanism. Bring parity to the extlog_print() path by
including a similar trace_non_standard_event().

---

> 
> Cc: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Fabio M. De Francesco <fabio.m.de.francesco@linux.intel.com>
> ---
>  drivers/acpi/acpi_extlog.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/acpi/acpi_extlog.c b/drivers/acpi/acpi_extlog.c
> index f055609d4b64..e025ae390737 100644
> --- a/drivers/acpi/acpi_extlog.c
> +++ b/drivers/acpi/acpi_extlog.c
> @@ -179,6 +179,12 @@ static int extlog_print(struct notifier_block *nb, unsigned long val,
>  			if (gdata->error_data_length >= sizeof(*mem))
>  				trace_extlog_mem_event(mem, err_seq, fru_id, fru_text,
>  						       (u8)gdata->error_severity);
> +		} else {
> +			void *err = acpi_hest_get_payload(gdata);
> +
> +			trace_non_standard_event(sec_type, fru_id, fru_text,
> +						 gdata->error_severity, err,
> +						 gdata->error_data_length);
>  		}

...with the above changelog update the code change looks good to me, you
can add:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>