Message ID | 20250106121017.1620-5-shiju.jose@huawei.com |
---|---|
State | New |
Headers | show |
Series | EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers | expand |
On Mon, Jan 06, 2025 at 12:10:00PM +0000, shiju.jose@huawei.com wrote: > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_nibble_mask > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank_group > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_rank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_row > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_column > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_channel > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_sub_channel > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_nibble_mask > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank_group > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_rank > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_row > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_column > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_channel > +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_sub_channel So this is new. I don't remember seeing that when I looked at your patches the last time. Looks like you have all those attributes and now you've decided to add a min and max for each one, in addition. And UI-wise it is a madness as there are gazillion single-value files now. "Attributes should be ASCII text files, preferably with only one value per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type." So you don't need those - you can simply express each attribute as a range: echo "1:2" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank or if you wanna scrub only one bank: echo "1:1" > /sys/bus/edac/devices/<dev-name>/mem_repairX/bank What is the use case of that thing? Someone might find it useful so let's add it preemptively? Pfff.
On Thu, Jan 09, 2025 at 11:00:43AM +0000, Shiju Jose wrote: > The min_ and max_ attributes of the control attributes are added for your > feedback on V15 to expose supported ranges of these control attributes to the user, > in the following links. Sure, but you can make that differently: cat /sys/bus/edac/devices/<dev-name>/mem_repairX/bank [x:y] which is the allowed range. echo ... then writes in the bank. > ... so we would propose we do not add max_ and min_ for now and see how the > use cases evolve. Yes, you should apply that same methodology to the rest of the new features you're adding: only add functionality for the stuff that is actually being used now. You can always extend it later. Changing an already user-visible API is a whole different story and a lot lot harder, even impossible. So I'd suggest you prune the EDAC patches from all the hypothetical usage and then send only what remains so that I can try to queue them. Thx.
On Thu, Jan 09, 2025 at 04:01:59PM +0000, Jonathan Cameron wrote: > Ok. To me the fact it's not a single write was relevant. Seems not > in your mental model of how this works. For me a single write > that you cannot query back is fine, setting lots of parameters and > being unable to query any of them less so. I guess you disagree. Why can't you query it back? grep -r . /sysfs/dir/ All files' values have been previously set and should still be there on a read, I'd strongly hope. Your ->read routines should give the values back. > In interests of progress I'm not going to argue further. No one is > going to use this interface by hand anyway so the lost of useability > I'm seeing doesn't matter a lot. I had the suspicion that this user interface is not really going to be used by a user but by a tool. But then if you don't have a tool, you're lost. This is one of the reasons why you can control ftrace directly on the shell too - without a tool. This is very useful in certain cases where you cannot run some userspace tools. > In at least the CXL case I'm fairly sure most of them are not discoverable. > Until you see errors you have no idea what the memory topology is. Ok. > For that you'd need to have a path to read back what happened. So how is this scrubbing going to work? You get an error, you parse it for all the attributes and you go and write those attributes into the scrub interface and it starts scrubbing? But then why do you even need the interface at all? Why can't the kernel automatically collect all those attributes and start the scrubbing automatically - no need for any user interaction...? So why do you *actually* even need user interaction here and why can't the kernel be smart enough to start the scrub automatically? > Ok. Then can we just drop the range discoverability entirely or we go with > your suggestion and do not support read back of what has been > requested but instead have the reads return a range if known or "" / > return -EONOTSUPP if simply not known? Probably. > I can live with that though to me we are heading in the direction of > a less intuitive interface to save a small number of additional files. This is not the point. I already alluded to this earlier - we're talking about a user visible interface which, once it goes out, it is cast in stone forever. So those files better have a good reason to exist... And if we're not sure yet, we can upstream only those which are fine now and then continue discussing the rest. HTH.
On Thu, 9 Jan 2025 17:19:02 +0100 Borislav Petkov <bp@alien8.de> wrote: > On Thu, Jan 09, 2025 at 04:01:59PM +0000, Jonathan Cameron wrote: > > Ok. To me the fact it's not a single write was relevant. Seems not > > in your mental model of how this works. For me a single write > > that you cannot query back is fine, setting lots of parameters and > > being unable to query any of them less so. I guess you disagree. > > Why can't you query it back? > > grep -r . /sysfs/dir/ > > All files' values have been previously set and should still be there on > a read, I'd strongly hope. Your ->read routines should give the values back. Today you can. Seems we are talking cross purposes. I'm confused. I thought your proposal was for "bank" attribute to present an allowed range on read. "bank" attribute is currently written to and read back as the value of the bank on which to conduct a repair. Maybe this disconnect is down to the fact max_ and min_ attributes should have been marked as RO in the docs. They aren't controls, just presentation of limits to userspace. Was intent a separate bank_range type attribute rather than max_bank, min_bank? One of those would be absolutely fine (similar to the _available attributes in IIO - I added those years ago to meet a similar need and we've never had any issues with those). > > > In interests of progress I'm not going to argue further. No one is > > going to use this interface by hand anyway so the lost of useability > > I'm seeing doesn't matter a lot. > > I had the suspicion that this user interface is not really going to be used by > a user but by a tool. But then if you don't have a tool, you're lost. > > This is one of the reasons why you can control ftrace directly on the shell > too - without a tool. This is very useful in certain cases where you cannot > run some userspace tools. I fully agree. What I was saying was in response to me thinking you wanted it to not be possible to read back the user set values (overlapping uses of single bank attribute which wasn't what you meant). That is useful for a user wanting to do the cat /sys/... that you mention above, but not vital if they are directly reading the tracepoints for the error records and poking the sysfs interface. Given it seems I misunderstood that suggestion, ignore my reply to that as irrelevant. > > > In at least the CXL case I'm fairly sure most of them are not discoverable. > > Until you see errors you have no idea what the memory topology is. > > Ok. > > > For that you'd need to have a path to read back what happened. > > So how is this scrubbing going to work? You get an error, you parse it for all > the attributes and you go and write those attributes into the scrub interface > and it starts scrubbing? Repair not scrubbing. They are different things we should keep separate, scrub corrects the value, if it can, but doesn't change the underlying memory to new memory cells to avoid repeated errors. Replacing scrub with repair (which I think was the intent here)... You get error records that describe the error seen in hardware, write back the values into this interface and tell it to repair the memory. This is not necessarily a synchronous or immediate thing - instead typically based on trend analysis. As an example, the decision might be that bit of ram threw up 3 errors over a month including multiple system reboots (for other reasons) and that is over some threshold so we use a spare memory line to replace it. > > But then why do you even need the interface at all? > > Why can't the kernel automatically collect all those attributes and start the > scrubbing automatically - no need for any user interaction...? > > So why do you *actually* even need user interaction here and why can't the > kernel be smart enough to start the scrub automatically? Short answer, it needs to be very smart and there isn't a case of one size fits all - hence suggested approach of making it a user space problem. There are hardware autonomous solutions and ones handled by host firmware. That is how repair is done in many servers - at most software sees a slightly latency spike as the memory is repaired under the hood. Some CXL devices will do this as well. Those CXL devices may provide an additional repair interface for the less clear cut decisions that need more data processing / analysis than the device firmware is doing. Other CXL devices will take the view the OS is best placed to make all the decisions - those sometimes will give a 'maintenance needed' indication in the error records but that is still a hint the host may or may not take any notice of. Given in the systems being considered here, software is triggering the repair, we want to allow for policy in the decision. In simple cases we could push that policy into the kernel e.g. just repair the moment we see an error record. These repair resources are very limited in number, so immediately repairing may a bad idea. We want to build up a history of errors before making such a decision. That can be done in kernel. The decision to repair memory is heavily influenced by policy and time considerations against device resource constraints. Some options that are hard to do in kernel. 1. Typical asynchronous error report for a corrected error. Tells us memory had an error (perhaps from a scrubbing engine on the device running checks). No need to take action immediately. Instead build up more data over time and if lots of errors occur make decision to repair as no we are sure it is worth doing rather than a single random event. We may tune scrubbing engines to check this memory more frequently and adjust our data analysis to take that into account for setting thresholds etc. When an admin considers it a good time to take action, offline the memory and repair before bringing it back into use (sometimes by rebooting the machine). Sometimes repair can be triggered in a software transparent way, sometimes not. This also applies to uncorrectable errors though in that case you can't necessarily repair it without ever seeing a synchronous poison with all the impacts that has. 2. Soft repair across boots. We are actually storing the error records, then only applying the fix on reboot before using the memory - so maintaining a list of bad memory and saving it to a file to read back on boot. We could provide another kernel interface to get this info and reinject it after reboot instead of doing it in userspace but that is another ABI to design. 3. Complex policy across fleets. A lot of work is going on around prediction techniques that may change the local policy on each node dependent on the overall reliability patterns of a particular batch of devices and local characteristics, service guarantees etc. If it is hard repair, then once you've run out you need schedule an engineer out to replace the DIMM. All complex inputs to the decision. Similar cases like CPU offlining on repeated errors are done in userspace (e.g. RAS Daemon) for similar reasons of long term data gathering and potentially complex algorithms. > > > Ok. Then can we just drop the range discoverability entirely or we go with > > your suggestion and do not support read back of what has been > > requested but instead have the reads return a range if known or "" / > > return -EONOTSUPP if simply not known? > > Probably. Too many options in the above paragraph so just to check... Probably to which? If it's a separate attribute from the one we write the control so then we do what is already done here and don't present the interface at all if the range isn't discoverable. > > > I can live with that though to me we are heading in the direction of > > a less intuitive interface to save a small number of additional files. > > This is not the point. I already alluded to this earlier - we're talking about > a user visible interface which, once it goes out, it is cast in stone forever. > > So those files better have a good reason to exist... > > And if we're not sure yet, we can upstream only those which are fine now and > then continue discussing the rest. Ok. Best path is drop the available range support then (so no min_ max_ or anything to replace them for now). Added bonus is we don't have to rush this conversation and can make sure we come to the right solution driven by use cases. Jonathan > HTH. >
Jonathan Cameron wrote: > Ok. Best path is drop the available range support then (so no min_ max_ or > anything to replace them for now). I think less is more in this case. The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too wide for userspace to have a chance at writing a competent tool. At least I am struggling with where to even begin with those ABIs if I was asked to write a tool. Does a tool already exist for those? Some questions that read on those ABIs are: 1/ What if the platform has translation between HPA (CXL decode) and SPA (physical addresses reported in trace points that PIO and DMA see)? 2/ What if memory is interleaved across repair domains? 3/ What if the device does not use DDR terminology / topology terms for repair? I expect the flow rasdaemon would want is that the current PFA (leaky bucket Pre-Failure Analysis) decides that the number of soft-offlines it has performed exceeds some threshold and it wants to attempt to repair memory. However, what is missing today for volatile memory is that some failures can be repaired with in-band writes and some failures need heavier hammers like Post-Package-Repair to actively swap in whole new banks of memory. So don't we need something like "soft-offline-undo" on the way to PPR? So, yes, +1 to simpler for now where software effectively just needs to deal with a handful of "region repair" buttons and the semantics of those are coarse and sub-optimal. Wait for a future where a tool author says, "we have had good success getting bulk offlined pages back into service, but now we need this specific finer grained kernel interface to avoid wasting spare banks prematurely". Anything more complex than a set of /sys/devices/system/memory/ devices has a /sys/bus/edac/devices/devX/repair button, feels like a generation ahead of where the initial sophistication needs to lie. That said, I do not closely follow ras tooling to say whether someone has already identified the critical need for a fine grained repair ABI?
On Thu, 9 Jan 2025 15:51:39 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > Jonathan Cameron wrote: > > Ok. Best path is drop the available range support then (so no min_ max_ or > > anything to replace them for now). > > I think less is more in this case. A few notes before I get to specific questions. Key in the discussion that follows is that the 'repair' is a separate from the 'decision to repair'. They mostly need different information all of which is in the error trace points. A lot of the questions are about the 'decision to repair' part not the repair itself. Critical point is there is on some devices (all CXL ones), there may be no direct way to discover the mapping from HPA/SPA/DPA to bank row etc other than the error record. The memory mapping functions are an internal detail not exposed on any interfaces. Odd though it may seem, those mapping functions are considered confidential enough that manufacturers don't always publish them (though I believe they are fairly easy to reverse engineer) - I know a team whose job involves designing those. Anyhow, short of the kernel or RAS Daemon carrying a look up table of all known devices (no support for new ones until they are added) we can't do a reverse map from DPA etc to bank. There are complex ways round this like storing the mappings when issuing an error record, to build up the necessary reverse map, but that would have to be preserved across boot. These error tend not to be frequent, so cross reboot /kexec etc need to be incorporated. PPR on CXL does use DPA, but memory sparing commands are meant to supersede that interface (the reason for that is perhaps bordering on consortium confidential, but lets say it doesn't work well for some cases). Memory sparing does not use DPA. I'd advise mostly ignoring PPR and looking at memory sparing in the CXL spec if you want to consider an example. For PPR DPA is used (there is an HPA option that might be available). DPA is still needed for on boot soft repair (or we could delay that until regions configured, but then we'd need to do DPA to HPA mapping as that will be different on a new config, and then write HPA for the kernel to map it back to DPA. > > The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too > wide for userspace to have a chance at writing a competent tool. At > least I am struggling with where to even begin with those ABIs if I was > asked to write a tool. Does a tool already exist for those? There is little choice on that - those are the controls for this type of repair. If we do something like a magic 'key' based on a concatenation of those values we need to define that description to replace a clean self describing interface. I'm not 100% against that but I think it would be a terrible interface design and I don't think anyone is really in favor of it. All a userspace tool does is read the error record fields of exactly those names. From that it will log data (already happening under those names in RAS daemon alongside HPA/ DPA). Then, in simplest case, a threshold is passed and we write those values to the repair interface. There is zero need in that simple case for these to be understood at all. You can think of them as a complex key but divided into well defined fields. For more complex decision algorithms, that structure info may be needed to make the decision. As a dumb example, maybe certain banks are more error prone on a batch of devices so we need a lower threshold before repairing. Simplest case is maybe 20-30 lines of code looping over result of an SQL query on the RASDaemon DB and writing the values to the files. Not the most challenging userspace tool. The complexity is in the analysis of the errors, not this part. I don't think we bothered doing this one yet in rasdaemon because we considered it obvious enough an example wasn't needed. (Mauro / Shiju, does that estimate sound reasonable?) We would need a couple of variants but those map 1:1 with the variants of error record parsing and logging RAS Daemon already has. > > Some questions that read on those ABIs are: > > 1/ What if the platform has translation between HPA (CXL decode) and SPA > (physical addresses reported in trace points that PIO and DMA see)? See later for discussion of other interfaces.. This is assuming the repair key is not HPA (it is on some systems / situations) - if it's the repair key then that is easily dealt with. HPA / SPA more or less irrelevant for repair itself, they are relevant for the decision to repair. In the 'on reboot' soft repair case they may not even exist at the time of repair as it would be expected to happen before we've brought up a region (to get the RAM into a good state at boot). For cases where the memory decoders are configured and so there is an HPA to DPA mapping: The trace reports provide both all these somewhat magic values and the HPA. Thus we can do the HPA aware stuff on that before then looking up the other bit of the appropriate error reports to get the bank row etc. > > 2/ What if memory is interleaved across repair domains? Also not relevant to a repair control and only a little relevant to the decision to repair. The domains would be handled separately but if we are have to offline a chunk of memory to do it (needed for repair on some devices) that may be a bigger chunk if fine grained interleave in use and that may affect the decision. > > 3/ What if the device does not use DDR terminology / topology terms for > repair? Then we provide the additional interfaces assuming the correspond to well known terms. If someone is using a magic key then we can get grumpy with them, but that can also be supported. Mostly I'd expect a new technology to overlap a lot of the existing interface and maybe add one or two more; which layer in the stack for HBM for instance. The main alternative is where the device takes an HPA / SPA / DPA. We have one driver that does that queued up behind this series that uses HPA. PPR uses DPA. In that case userspace first tries to see if it can repair by HPA then DPA and if not moves on to see if it it can use the fuller description. We will see devices supporting HPA / DPA (which to use depends on when you are doing the operation and what has been configured) but mostly I'd expect either HPA/DPA or fine grained on a given repair instance. HPA only works if the address decoders are always configured (so not on CXL) What is actually happening in that case is typically that a firmware is involved that can look up address decoders etc, and map the control HPA to Bank / row etc to issue the actual low level commands. This keeps the memory mapping completely secret rather than exposing it in error records. > > I expect the flow rasdaemon would want is that the current PFA (leaky > bucket Pre-Failure Analysis) decides that the number of soft-offlines it > has performed exceeds some threshold and it wants to attempt to repair > memory. Sparing may happen prior to point where we'd have done a soft offline if non disruptive (whether it is can be read from another bit of the ABI). Memory repair might be much less disruptive than soft-offline! I rather hope memory manufacturers build that, but I'm aware of at least one case where they didn't and the memory must be offline. > > However, what is missing today for volatile memory is that some failures > can be repaired with in-band writes and some failures need heavier > hammers like Post-Package-Repair to actively swap in whole new banks of > memory. So don't we need something like "soft-offline-undo" on the way > to PPR? Ultimately we may do. That discussion was in one of the earlier threads on more heavy weight case of recovery from poison (unfortunately I can't find the thread) - the ask was for example code so that the complexity could be weighed against the demand for this sort of live repair or a lesser version where repair can only be done once a region is offline (and parts poisoned). However, there are other usecases where this isn't needed which is why that isn't a precursor for this series. Initial enablement targets two situations: 1) Repair can be done in non disruptive way - no need to soft offline at all. 2) Repair can be done at boot before memory is onlined or on admin action to take the whole region offline, then repair specific chunks of memory before bringing it back online. > > So, yes, +1 to simpler for now where software effectively just needs to > deal with a handful of "region repair" buttons and the semantics of > those are coarse and sub-optimal. Wait for a future where a tool author > says, "we have had good success getting bulk offlined pages back into > service, but now we need this specific finer grained kernel interface to > avoid wasting spare banks prematurely". Depends on where you think that interface is. I can absolutely see that as a control to RAS Daemon. Option 2 above, region is offline, repair all dodgy looking fine grained buckets. Note though that a suboptimal repair may mean permanent use of very rare resources. So there needs to be a control a the finest granularity as well. Which order those get added to userspace tools doesn't matter to me. If you mean that interface in kernel it brings some non trivial requirements. The kernel would need all of: 1) Tracking interface for all error records so the reverse map from region to specific bank / row etc is available for a subset of entries. The kernel would need to know which of those are important (soft offline might help in that use case, otherwise that means decision algorithms are in kernel or we have fine grained queue for region repair in parallel with soft-offline). 2) A way to inject the reverse map information from a userspace store (to deal with reboot etc). That sounds a lot harder to deal with than relying on the usespace program that already does the tracking across boots. > > Anything more complex than a set of /sys/devices/system/memory/ > devices has a /sys/bus/edac/devices/devX/repair button, feels like a > generation ahead of where the initial sophistication needs to lie. > > That said, I do not closely follow ras tooling to say whether someone > has already identified the critical need for a fine grained repair ABI? It's not that we necessarily want to repair at fine grain, it's that the control interface to hardware is fine grained and the reverse mapping often unknown except for specific error records. I'm fully on board with simple interfaces for common cases like repair the bad memory in this region. I'm just strongly against moving the complexity of doing that into the kernel. Jonathan >
Jonathan Cameron wrote: > On Thu, 9 Jan 2025 15:51:39 -0800 > Dan Williams <dan.j.williams@intel.com> wrote: > > > Jonathan Cameron wrote: > > > Ok. Best path is drop the available range support then (so no min_ max_ or > > > anything to replace them for now). > > > > I think less is more in this case. > > A few notes before I get to specific questions. > > Key in the discussion that follows is that the 'repair' is a separate > from the 'decision to repair'. They mostly need different information > all of which is in the error trace points. A lot of the questions > are about the 'decision to repair' part not the repair itself. > [snipped the parts I agree with] > I'd advise mostly ignoring PPR and looking at memory sparing in > the CXL spec if you want to consider an example. For PPR DPA is used > (there is an HPA option that might be available). DPA is still needed > for on boot soft repair (or we could delay that until regions configured, > but then we'd need to do DPA to HPA mapping as that will be different > on a new config, and then write HPA for the kernel to map it back to DPA. This is helpful because I was indeed getting lost in what kind of "repair" was being discussed in the thread. Ok, lets focus on sparing commands. > > > > > The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too > > wide for userspace to have a chance at writing a competent tool. At > > least I am struggling with where to even begin with those ABIs if I was > > asked to write a tool. Does a tool already exist for those? > > There is little choice on that - those are the controls for this type > of repair. If we do something like a magic 'key' based on a concatenation > of those values we need to define that description to replace a clean > self describing interface. I'm not 100% against that but I think it would > be a terrible interface design and I don't think anyone is really in favor of it. > > All a userspace tool does is read the error record fields of > exactly those names. From that it will log data (already happening under > those names in RAS daemon alongside HPA/ DPA). Then, in simplest case, > a threshold is passed and we write those values to the repair interface. > > There is zero need in that simple case for these to be understood at all. This is where you lose me. The error record is a point in time snapshot of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security model for memory operations is based on coordinating with the kernel's understanding of how that SPA is *currently* being used. The kernel can not just take userspace's word for it that potentially data changing, or temporary loss-of-use operations are safe to execute just because once upon a time userspace saw an error record that implicated a given SPA in the past, especially over reboot. The SPA:HPA:DPA:DIMM tuple is invalidated on reconfiguration and reboot events. It follows that the kernel has a security/integrity interest in declining to act on invalidated tuples. This is solvable within a single boot as the kernel can cache the error records and userspace can ask to "trigger sparing based on cached record X". For the reboot case when the error record cache is flushed the kernel needs a reliable way to refill that cache, not an ABI for userspace to say "take my word for it, this *should* be safe". [snipped the explanation of replaying the old trace record parameters data back through sysfs, because that is precisely the hang up I have with the proposal] > > > > > Some questions that read on those ABIs are: > > > > 1/ What if the platform has translation between HPA (CXL decode) and SPA > > (physical addresses reported in trace points that PIO and DMA see)? > > See later for discussion of other interfaces.. This is assuming the > repair key is not HPA (it is on some systems / situations) - if it's > the repair key then that is easily dealt with. > > HPA / SPA more or less irrelevant for repair itself, they are relevant > for the decision to repair. In the 'on reboot' soft repair case they may > not even exist at the time of repair as it would be expected to happen > before we've brought up a region (to get the RAM into a good state at boot). > > For cases where the memory decoders are configured and so there is an HPA > to DPA mapping: > The trace reports provide both all these somewhat magic values and > the HPA. Thus we can do the HPA aware stuff on that before then looking > up the other bit of the appropriate error reports to get the bank row etc. > > > > > 2/ What if memory is interleaved across repair domains? > > Also not relevant to a repair control and only a little relevant to the > decision to repair. The domains would be handled separately but if > we are have to offline a chunk of memory to do it (needed for repair > on some devices) that may be a bigger chunk if fine grained interleave > in use and that may affect the decision. Again, the repair control assumes that the kernel can just trust userspace to get it right. When the kernel knows the SPA implications it can add safety like "you are going to issue sparing on deviceA that will temporarily take deviceA offline. CXL subsystem tells me deviceA is interleaved with deviceB in SPA so the whole SPA range needs to be offline before this operation proceeds". That is not someting that userspace can reliably coordinate. > > 3/ What if the device does not use DDR terminology / topology terms for > > repair? > > Then we provide the additional interfaces assuming the correspond to well > known terms. If someone is using a magic key then we can get grumpy > with them, but that can also be supported. > > Mostly I'd expect a new technology to overlap a lot of the existing > interface and maybe add one or two more; which layer in the stack for > HBM for instance. The concern is the assertion that sysfs needs to care about all these parameters vs an ABI that says "repair errorX". If persistence and validity of error records is the concern lets build an ABI for that and not depend upon trust in userspace to properly coordinate memory integrity concerns. > > The main alternative is where the device takes an HPA / SPA / DPA. We have one > driver that does that queued up behind this series that uses HPA. PPR uses > DPA. In that case userspace first tries to see if it can repair by HPA then > DPA and if not moves on to see if it it can use the fuller description. > We will see devices supporting HPA / DPA (which to use depends on when you > are doing the operation and what has been configured) but mostly I'd expect > either HPA/DPA or fine grained on a given repair instance. > > HPA only works if the address decoders are always configured (so not on CXL) > What is actually happening in that case is typically that a firmware is > involved that can look up address decoders etc, and map the control HPA > to Bank / row etc to issue the actual low level commands. This keeps > the memory mapping completely secret rather than exposing it in error > records. > > > > > I expect the flow rasdaemon would want is that the current PFA (leaky > > bucket Pre-Failure Analysis) decides that the number of soft-offlines it > > has performed exceeds some threshold and it wants to attempt to repair > > memory. > > Sparing may happen prior to point where we'd have done a soft offline > if non disruptive (whether it is can be read from another bit of the > ABI). Memory repair might be much less disruptive than soft-offline! > I rather hope memory manufacturers build that, but I'm aware of at least > one case where they didn't and the memory must be offline. That's a good point, spare before offline makes sense. [..] > However, there are other usecases where this isn't needed which is why > that isn't a precursor for this series. > > Initial enablement targets two situations: > 1) Repair can be done in non disruptive way - no need to soft offline at all. Modulo needing to quiesce access over the sparing event? > 2) Repair can be done at boot before memory is onlined or on admin > action to take the whole region offline, then repair specific chunks of > memory before bringing it back online. Which is userspace racing the kernel to online memory? > > So, yes, +1 to simpler for now where software effectively just needs to > > deal with a handful of "region repair" buttons and the semantics of > > those are coarse and sub-optimal. Wait for a future where a tool author > > says, "we have had good success getting bulk offlined pages back into > > service, but now we need this specific finer grained kernel interface to > > avoid wasting spare banks prematurely". > > Depends on where you think that interface is. I can absolutely see that > as a control to RAS Daemon. Option 2 above, region is offline, repair > all dodgy looking fine grained buckets. > > Note though that a suboptimal repair may mean permanent use of very rare > resources. So there needs to be a control a the finest granularity as well. > Which order those get added to userspace tools doesn't matter to me. > > If you mean that interface in kernel it brings some non trivial requirements. > The kernel would need all of: > 1) Tracking interface for all error records so the reverse map from region > to specific bank / row etc is available for a subset of entries. The > kernel would need to know which of those are important (soft offline > might help in that use case, otherwise that means decision algorithms > are in kernel or we have fine grained queue for region repair in parallel > with soft-offline). > 2) A way to inject the reverse map information from a userspace store > (to deal with reboot etc). Not a way to inject the reverse map information, a way to inject the error records and assert that memory topology changes have not invalidated those records. > That sounds a lot harder to deal with than relying on the usespace program > that already does the tracking across boots. I am stuck behind the barrier of userspace must not assume it knows better than the kernel about the SPA impact of a DIMM sparing event. The kernel needs evidence either live records from within the same kernel boot or validated records from a previous boot. ...devices could also help us out here with a way to replay DIMM error events. That would allow for refreshing error records even if the memory topology change because the new record would generate a refreshed SPA:HPA:DPA:DIMM tuple. > > Anything more complex than a set of /sys/devices/system/memory/ > > devices has a /sys/bus/edac/devices/devX/repair button, feels like a > > generation ahead of where the initial sophistication needs to lie. > > > > That said, I do not closely follow ras tooling to say whether someone > > has already identified the critical need for a fine grained repair ABI? > > It's not that we necessarily want to repair at fine grain, it's that > the control interface to hardware is fine grained and the reverse mapping > often unknown except for specific error records. > > I'm fully on board with simple interfaces for common cases like repair > the bad memory in this region. I'm just strongly against moving the > complexity of doing that into the kernel. Yes, we are just caught up on where that "...but no simpler" line is drawn.
On Thu, Jan 09, 2025 at 06:34:48PM +0000, Jonathan Cameron wrote: > Today you can. Seems we are talking cross purposes. > > I'm confused. I thought your proposal was for "bank" attribute to present an > allowed range on read. > "bank" attribute is currently written to and read back as the value of the bank on which > to conduct a repair. Maybe this disconnect is down to the fact max_ and min_ > attributes should have been marked as RO in the docs. They aren't controls, > just presentation of limits to userspace. > > Was intent a separate bank_range type attribute rather than max_bank, min_bank? I don't know - I'm just throwing ideas out there. You could do: cat /sys/.../bank and that gives you [<low> <current_value> <high>] So you have all the needed information. Dunno if this would be abusing sysfs rules too much tho. > > > > > In at least the CXL case I'm fairly sure most of them are not discoverable. > > > Until you see errors you have no idea what the memory topology is. > > > > Ok. > > > > > For that you'd need to have a path to read back what happened. > > > > So how is this scrubbing going to work? You get an error, you parse it for all :> > the attributes and you go and write those attributes into the scrub interface > > and it starts scrubbing? > > Repair not scrubbing. They are different things we should keep separate, > scrub corrects the value, if it can, but doesn't change the underlying memory to > new memory cells to avoid repeated errors. Replacing scrub with repair > (which I think was the intent here)... Really? So how is scrubbing defined for CXL? You read memory, do ECC check on it, report any potential errors but write back the *original* wrong value?! I thought the point of scrubbing is to repair it while at it too... > You get error records that describe the error seen in hardware, write back the > values into this interface and tell it to repair the memory. This is not > necessarily a synchronous or immediate thing - instead typically based on > trend analysis. This is just silly: I'm scrubbing, I found an error, I should simply fix it while at it. Why would I need an additional command to repair it?! > As an example, the decision might be that bit of ram threw up 3 errors > over a month including multiple system reboots (for other reasons) and > that is over some threshold so we use a spare memory line to replace it. Right. > Short answer, it needs to be very smart and there isn't a case of one size > fits all - hence suggested approach of making it a user space problem. Making it a userspace problem is pretty much always a sign that the hw design failed. > Given in the systems being considered here, software is triggering the repair, > we want to allow for policy in the decision. Right, you can leave a high-level decision to userspace: repair only when idle, repair only non-correctable errors, blabla but exposing every single aspect of every single error... meh. > In simple cases we could push that policy into the kernel e.g. just repair > the moment we see an error record. > > These repair resources are very limited in number, so immediately repairing > may a bad idea. We want to build up a history of errors before making > such a decision. That can be done in kernel. Yap, we are doing this now: drivers/ras/cec.c Userspace involvement is minimal, if at all. It is mostly controlling the parameters of the leaky bucket. > The decision to repair memory is heavily influenced by policy and time considerations > against device resource constraints. > > Some options that are hard to do in kernel. > > 1. Typical asynchronous error report for a corrected error. > > Tells us memory had an error (perhaps from a scrubbing engine on the device > running checks). No need to take action immediately. Instead build up more data > over time and if lots of errors occur make decision to repair as no we are sure it > is worth doing rather than a single random event. We may tune scrubbing engines > to check this memory more frequently and adjust our data analysis to take that > into account for setting thresholds etc. See above. What happens when your daemon dies and loses all that collected data? > 2. Soft repair across boots. We are actually storing the error records, then only > applying the fix on reboot before using the memory - so maintaining a list > of bad memory and saving it to a file to read back on boot. We could provide > another kernel interface to get this info and reinject it after reboot instead > of doing it in userspace but that is another ABI to design. We did something similar recently: drivers/ras/amd/fmpm.c. It basically "replays" errors from persistent storage as that memory cannot be replaced. > 3. Complex policy across fleets. A lot of work is going on around prediction techniques > that may change the local policy on each node dependent on the overall reliability > patterns of a particular batch of devices and local characteristics, service guarantees > etc. If it is hard repair, then once you've run out you need schedule an engineer > out to replace the DIMM. All complex inputs to the decision. You probably could say here: "repair or report when this and that." or "offline page and report error" and similar high-level decisions by leaving the details to the kernel instead of looking at every possible error in userspace and returning back to the kernel to state your decision. > Similar cases like CPU offlining on repeated errors are done in userspace (e.g. > RAS Daemon) for similar reasons of long term data gathering and potentially > complex algorithms. > > > > > > Ok. Then can we just drop the range discoverability entirely or we go with > > > your suggestion and do not support read back of what has been > > > requested but instead have the reads return a range if known or "" / > > > return -EONOTSUPP if simply not known? > > > > Probably. > > Too many options in the above paragraph so just to check... Probably to which? > If it's a separate attribute from the one we write the control so then > we do what is already done here and don't present the interface at all if > the range isn't discoverable. Probably means I still don't get a warm and fuzzy feeling about this design. As I've noted above. > Ok. Best path is drop the available range support then (so no min_ max_ or > anything to replace them for now). > > Added bonus is we don't have to rush this conversation and can make sure we > come to the right solution driven by use cases. Yap, that sounds like a prudent idea. What I'm trying to say, basically, is, this interface through sysfs is a *lot* of attributes, there's no clear cut use case where we can judge how useful it is and as I alluded to above, I really think that you should leave the high-level decisions to userspace and let the kernel do its job. This'll make your interface a lot simpler. And if you really need to control every single aspect of scrubbing in userspace, then you can always come later with proper design and use case. But again, I really think you should keep as much recovery logic in the kernel and as automatic as possible. Only when you really really need user input, only then you should allow it... I hope I'm making sense here... Thx.
On Fri, 10 Jan 2025 14:49:03 -0800 Dan Williams <dan.j.williams@intel.com> wrote: > Jonathan Cameron wrote: > > On Thu, 9 Jan 2025 15:51:39 -0800 > > Dan Williams <dan.j.williams@intel.com> wrote: > > > > > Jonathan Cameron wrote: > > > > Ok. Best path is drop the available range support then (so no min_ max_ or > > > > anything to replace them for now). > > > > > > I think less is more in this case. > > > > A few notes before I get to specific questions. > > > > Key in the discussion that follows is that the 'repair' is a separate > > from the 'decision to repair'. They mostly need different information > > all of which is in the error trace points. A lot of the questions > > are about the 'decision to repair' part not the repair itself. > > > [snipped the parts I agree with] > > I'd advise mostly ignoring PPR and looking at memory sparing in > > the CXL spec if you want to consider an example. For PPR DPA is used > > (there is an HPA option that might be available). DPA is still needed > > for on boot soft repair (or we could delay that until regions configured, > > but then we'd need to do DPA to HPA mapping as that will be different > > on a new config, and then write HPA for the kernel to map it back to DPA. > > This is helpful because I was indeed getting lost in what kind of > "repair" was being discussed in the thread. Ok, lets focus on sparing > commands. > > > > > > > > > The hpa, dpa, nibble, column, channel, bank, rank, row... ABI looks too > > > wide for userspace to have a chance at writing a competent tool. At > > > least I am struggling with where to even begin with those ABIs if I was > > > asked to write a tool. Does a tool already exist for those? > > > > There is little choice on that - those are the controls for this type > > of repair. If we do something like a magic 'key' based on a concatenation > > of those values we need to define that description to replace a clean > > self describing interface. I'm not 100% against that but I think it would > > be a terrible interface design and I don't think anyone is really in favor of it. > > > > All a userspace tool does is read the error record fields of > > exactly those names. From that it will log data (already happening under > > those names in RAS daemon alongside HPA/ DPA). Then, in simplest case, > > a threshold is passed and we write those values to the repair interface. > > > > There is zero need in that simple case for these to be understood at all. > > This is where you lose me. The error record is a point in time snapshot > of the SPA:HPA:DPA:<proprietary internal "DIMM" mapping>. The security > model for memory operations is based on coordinating with the kernel's > understanding of how that SPA is *currently* being used. Whilst it is being used I agree. Key is to only do disruptive / data changing actions when it is not being used. > > The kernel can not just take userspace's word for it that potentially > data changing, or temporary loss-of-use operations are safe to execute > just because once upon a time userspace saw an error record that > implicated a given SPA in the past, especially over reboot. There are two cases (discoverable from hardware) 1) Non disruptive. No security concern as the device guarantees to not interrupt traffic and the memory contents is copied to the new location. Basically software never knows it happened. 2) Disruptive. We only allow this if the memory is offline. In the CXL case the CXL specific code must check no memory on the device is online so we aren't disrupting anything. The other implementation we have code for (will post after this lands) has finer granularity constraints and only the page needs to be offline. As it is offline the content is not preserved anyway. We may need to add extra constraints along with future support for temporal persistence / sharing but we can do that as part of adding that support in general. (Personally I think in those cases memory repair is a job for the out of band management anyway). In neither case am I seeing a security concern. Am I missing something? > > The SPA:HPA:DPA:DIMM tuple is invalidated on reconfiguration and reboot > events. It follows that the kernel has a security/integrity interest in > declining to act on invalidated tuples. This is solvable within a single > boot as the kernel can cache the error records and userspace can ask to > "trigger sparing based on cached record X". The above rules remove this complexity. Either it is always safe by device design, or the memory is offline and we'll zero fill it anyway so no security concern. > > For the reboot case when the error record cache is flushed the kernel > needs a reliable way to refill that cache, not an ABI for userspace to > say "take my word for it, this *should* be safe". It is safe because of 1 and 2 above we are not editing data in use except in a fashion that the device guarantees is safe. If you don't trust the device on this you have bigger problems. > > [snipped the explanation of replaying the old trace record parameters > data back through sysfs, because that is precisely the hang up I have > with the proposal] > > > > > > > > > Some questions that read on those ABIs are: > > > > > > 1/ What if the platform has translation between HPA (CXL decode) and SPA > > > (physical addresses reported in trace points that PIO and DMA see)? > > > > See later for discussion of other interfaces.. This is assuming the > > repair key is not HPA (it is on some systems / situations) - if it's > > the repair key then that is easily dealt with. > > > > HPA / SPA more or less irrelevant for repair itself, they are relevant > > for the decision to repair. In the 'on reboot' soft repair case they may > > not even exist at the time of repair as it would be expected to happen > > before we've brought up a region (to get the RAM into a good state at boot). > > > > For cases where the memory decoders are configured and so there is an HPA > > to DPA mapping: > > The trace reports provide both all these somewhat magic values and > > the HPA. Thus we can do the HPA aware stuff on that before then looking > > up the other bit of the appropriate error reports to get the bank row etc. > > > > > > > > 2/ What if memory is interleaved across repair domains? > > > > Also not relevant to a repair control and only a little relevant to the > > decision to repair. The domains would be handled separately but if > > we are have to offline a chunk of memory to do it (needed for repair > > on some devices) that may be a bigger chunk if fine grained interleave > > in use and that may affect the decision. > > Again, the repair control assumes that the kernel can just trust > userspace to get it right. When the kernel knows the SPA implications it > can add safety like "you are going to issue sparing on deviceA that will > temporarily take deviceA offline. CXL subsystem tells me deviceA is > interleaved with deviceB in SPA so the whole SPA range needs to be > offline before this operation proceeds". That is not someting that > userspace can reliably coordinate. Absolutely he kernel has to enforce this. Same way we protect against poison injection in some cases. Right now the enforcement is slightly wrong (Shiju is looking at it again) as we were enforcing at wrong granularity (specific dpa, not device). Identifying that hole is a good outcome of this discussion making us take another look. Enforcing this is one of the key jobs of the CXL specific driver. We considered doing it in the core, but the granularity differences between our initial few examples meant we decided on specific driver implementations of the checks for now. > > > > 3/ What if the device does not use DDR terminology / topology terms for > > > repair? > > > > Then we provide the additional interfaces assuming the correspond to well > > known terms. If someone is using a magic key then we can get grumpy > > with them, but that can also be supported. > > > > Mostly I'd expect a new technology to overlap a lot of the existing > > interface and maybe add one or two more; which layer in the stack for > > HBM for instance. > > The concern is the assertion that sysfs needs to care about all these > parameters vs an ABI that says "repair errorX". If persistence and > validity of error records is the concern lets build an ABI for that and > not depend upon trust in userspace to properly coordinate memory > integrity concerns. It doesn't have to. It just has to ensure that the memory device is in the correct state. So check, not coordinate. At a larger scale, coordination is already doable (subject to races that we must avoid by holding locks), tear down the regions so there are no mappings on the device you want to repair. Don't bring them up again until after you are done. The main use case is probably do it before you bring the mappings up, but same result. > > > > > The main alternative is where the device takes an HPA / SPA / DPA. We have one > > driver that does that queued up behind this series that uses HPA. PPR uses > > DPA. In that case userspace first tries to see if it can repair by HPA then > > DPA and if not moves on to see if it it can use the fuller description. > > We will see devices supporting HPA / DPA (which to use depends on when you > > are doing the operation and what has been configured) but mostly I'd expect > > either HPA/DPA or fine grained on a given repair instance. > > > > HPA only works if the address decoders are always configured (so not on CXL) > > What is actually happening in that case is typically that a firmware is > > involved that can look up address decoders etc, and map the control HPA > > to Bank / row etc to issue the actual low level commands. This keeps > > the memory mapping completely secret rather than exposing it in error > > records. > > > > > > > > I expect the flow rasdaemon would want is that the current PFA (leaky > > > bucket Pre-Failure Analysis) decides that the number of soft-offlines it > > > has performed exceeds some threshold and it wants to attempt to repair > > > memory. > > > > Sparing may happen prior to point where we'd have done a soft offline > > if non disruptive (whether it is can be read from another bit of the > > ABI). Memory repair might be much less disruptive than soft-offline! > > I rather hope memory manufacturers build that, but I'm aware of at least > > one case where they didn't and the memory must be offline. > > That's a good point, spare before offline makes sense. If transparent an resources not constrained. Very much not if we have to tear down the memory first. > > [..] > > However, there are other usecases where this isn't needed which is why > > that isn't a precursor for this series. > > > > Initial enablement targets two situations: > > 1) Repair can be done in non disruptive way - no need to soft offline at all. > > Modulo needing to quiesce access over the sparing event? Absolutely. This is only doable in devices that don't need to quiesce. > > > 2) Repair can be done at boot before memory is onlined or on admin > > action to take the whole region offline, then repair specific chunks of > > memory before bringing it back online. > > Which is userspace racing the kernel to online memory? If you are doing this scheme you don't automatically online memory. So both are in userspace control and can be easily sequenced. If you aren't auto onlining then buy devices with hard PPR and do it by offlining manually, repairing and rebooting. Or buy devices that don't need to quiecse and cross your fingers the dodgy ram doesn't throw an error before you get that far. Little choice if you decide to online right at the start as normal memory. > > > > So, yes, +1 to simpler for now where software effectively just needs to > > > deal with a handful of "region repair" buttons and the semantics of > > > those are coarse and sub-optimal. Wait for a future where a tool author > > > says, "we have had good success getting bulk offlined pages back into > > > service, but now we need this specific finer grained kernel interface to > > > avoid wasting spare banks prematurely". > > > > Depends on where you think that interface is. I can absolutely see that > > as a control to RAS Daemon. Option 2 above, region is offline, repair > > all dodgy looking fine grained buckets. > > > > Note though that a suboptimal repair may mean permanent use of very rare > > resources. So there needs to be a control a the finest granularity as well. > > Which order those get added to userspace tools doesn't matter to me. > > > > If you mean that interface in kernel it brings some non trivial requirements. > > The kernel would need all of: > > 1) Tracking interface for all error records so the reverse map from region > > to specific bank / row etc is available for a subset of entries. The > > kernel would need to know which of those are important (soft offline > > might help in that use case, otherwise that means decision algorithms > > are in kernel or we have fine grained queue for region repair in parallel > > with soft-offline). > > 2) A way to inject the reverse map information from a userspace store > > (to deal with reboot etc). > > Not a way to inject the reverse map information, a way to inject the > error records and assert that memory topology changes have not > invalidated those records. There is no way to tell that the topology hasn't changed. For the reasons above, I don't think we care. Instead of trying to stop userspace reparing the wrong memory, make sure it is safe for it to do that. (The kernel is rarely in the business of preventing the slightly stupid) > > > That sounds a lot harder to deal with than relying on the usespace program > > that already does the tracking across boots. > > I am stuck behind the barrier of userspace must not assume it knows > better than the kernel about the SPA impact of a DIMM sparing > event. The kernel needs evidence either live records from within the > same kernel boot or validated records from a previous boot. I think this is the wrong approach. The operation must be 'safe'. With that in place we absolutely can let userspace assume it knows better than the kernel. > > ...devices could also help us out here with a way to replay DIMM error > events. That would allow for refreshing error records even if the > memory topology change because the new record would generate a refreshed > SPA:HPA:DPA:DIMM tuple. Maybe, but I'm not seeing necessity. > > > > Anything more complex than a set of /sys/devices/system/memory/ > > > devices has a /sys/bus/edac/devices/devX/repair button, feels like a > > > generation ahead of where the initial sophistication needs to lie. > > > > > > That said, I do not closely follow ras tooling to say whether someone > > > has already identified the critical need for a fine grained repair ABI? > > > > It's not that we necessarily want to repair at fine grain, it's that > > the control interface to hardware is fine grained and the reverse mapping > > often unknown except for specific error records. > > > > I'm fully on board with simple interfaces for common cases like repair > > the bad memory in this region. I'm just strongly against moving the > > complexity of doing that into the kernel. > > Yes, we are just caught up on where that "...but no simpler" line is > drawn. > Sure. For now, I've proposed we split the two cases. 1) HPA / DPA repair (PPR) 2) Memory topology based repair (Sparing) If we can make progress on (1) perhaps we can come to a conclusion on what is required. Note that so far I see no reason why either should do any checking against errors observed by the kernel given the security guarantees above. Userspace can repair the wrong bit of memory. That's pointless and burning limited resources, but not a security problem. Jonathan
Em Tue, 14 Jan 2025 11:35:21 -0800 Dan Williams <dan.j.williams@intel.com> escreveu: > > There is no way to tell that the topology hasn't changed. > > For the reasons above, I don't think we care. Instead of trying to stop > > userspace reparing the wrong memory, make sure it is safe for it to do that. > > (The kernel is rarely in the business of preventing the slightly stupid) > > If the policy is "error records with SPA from the current boot can be > trusted" and "userspace requests outside of current boot error records > must only be submitted to known offline" then I think we are aligned. Surely userspace cannot infere if past errors on SPA are for the same DPA block, but it may still decide between soft/hard PPR based on different criteria adopted by the machine admins - or use instead memory sparing. So, yeah sanity checks at Kernel level to identify "trust" level based on having DPA data or not makes sense, but the final decision about the action should be taken on userspace. Thanks, Mauro
On Mon, Jan 13, 2025 at 11:07:40AM +0000, Jonathan Cameron wrote: > We can do that if you prefer. I'm not that fussed how this is handled > because, for tooling at least, I don't see why we'd ever read it. > It's for human parsing only and the above is fine. Is there even a concrete use case for humans currently? Because if not, we might as well not do it at all and keep it simple. All I see is an avalanche of sysfs nodes and I'm questioning the usefulness of the interface and what's the 30K ft big picture for all this. If this all is just wishful thinking on the part of how this is going to be used, then I agree with Dan: less is more. But I need to read the rest of that thread when there's time. ... > Repair cam be a feature of the DIMMs themselves or it can be a feature > of the memory controller. It is basically replacing them with spare > memory from somewhere else (usually elsewhere on same DIMMs that have > a bit of spare capacity for this). Bit like a hot spare in a RAID setup. Ooh, this is what you call repair. That's using a spare rank or so, under which I know it as one example. What I thought you mean with repair is what you mean with "correct". Ok, I see. > In some other systems the OS gets the errors and is responsible for making > the decision. This decision has been kept away from the OS in my world so far. So yes, the FW doing all the RAS recovery work is more like it. And the FW is the better agent in some sense because it has a lot more intimate knowledge of the platform. However... > Sticking to the corrected error case (uncorrected handling > is going to require a lot more work given we've lost data, Dan asked about that > in the other branch of the thread), the OS as a whole (kernel + userspace) > gets the error records and makes the policy decision to repair based on > assessment of risk vs resource availability to make a repair. > > Two reasons for this > 1) Hardware isn't necessarily capable of repairing autonomously as > other actions may be needed (memory traffic to some granularity of > memory may need to be stopped to avoid timeouts). Note there are many > graduations of this from A) can do it live with no visible effect, through > B) offline a page, to C) offlining the whole device. > 2) Policy can be a lot more sophisticated than a BMC can do. ... yes, that's why you can't rely only on the FW to do recovery but involve the OS too. Basically what I've been saying all those years. Oh well... > In some cases perhaps, but another very strong driver is that policy is involved. > > We can either try put a complex design in firmware and poke it with N opaque > parameters from a userspace tool or via some out of band method or we can put > the algorithm in userspace where it can be designed to incorporate lessons learnt > over time. We will start simple and see what is appropriate as this starts > to get used in large fleets. This stuff is a reasonable target for AI type > algorithms etc that we aren't going to put in the kernel. > > Doing this at all is a reliability optimization, normally it isn't required for > correct operation. I'm not saying you should put an AI engine into the kernel - all I'm saying is, the stuff which the kernel can decide itself without user input doesn't need user input. Only toggle: the kernel should do this correction and/or repair automatically or not. What is clear here is that you can't design an interface properly right now for algorithms which you don't have yet. And there's experience missing from running this in large fleets. But the interface you're adding now will remain forever cast in stone. Just for us to realize one day that we're not really using it but it is sitting out there dead in the water and we can't retract it. Or we're not using it as originally designed but differently and we need this and that hack to make it work for the current sensible use case. So the way it looks to me right now is, you want this to be in debugfs. You want to go nuts there, collect experience, algorithms, lessons learned etc and *then*, the parts which are really useful and sensible should be moved to sysfs and cast in stone. But not preemptively like that. > Offline has no permanent cost and no limit on number of times you can > do it. Repair is definitely a limited resource and may permanently use > up that resource (discoverable as a policy wants to know that too!) > In some cases once you run out of repair resources you have to send an > engineer to replace the memory before you can do it again. Yes, and until you can do that and because cloud doesn't want to *ever* reboot, you must do diminished but still present machine capabilities by offlining pages and cordoning off faulty hw, etc, etc. > Ok. I guess it is an option (I wasn't aware of that work). > > I was thinking that was far more complex to deal with than just doing it in > userspace tooling. From a quick look that solution seems to rely on ACPI ERSR > infrastructure to provide that persistence that we won't generally have but > I suppose we can read it from the filesystem or other persistent stores. > We'd need to be a lot more general about that as can't make system assumptions > that can be made in AMD specific code. > > So could be done, I don't think it is a good idea in this case, but that > example does suggest it is possible. You can look at this as specialized solutions. Could they be more general? Ofc. But we don't have a general RAS architecture which is vendor-agnostic. > In approach we are targetting, there is no round trip situation. We let the kernel > deal with any synchronous error just fine and run it's existing logic > to offline problematic memory. That needs to be timely and to carry on operating > exactly as it always has. > > In parallel with that we gather the error reports that we will already be > gathering and run analysis on those. From that we decide if a memory is likely to fail > again and perform a sparing operation if appropriate. > Effectively this is 'free'. All the information is already there in userspace > and already understood by tools like rasdaemon, we are not expanding that > reporting interface at all. That is fair. I think you can do that even now if the errors logged have enough hw information to classify them and use them for predictive analysis. > Ok. It seems you correlate number of files with complexity. No, wrong. I'm looking at the interface and am wondering how is this going to be used and whether it is worth it to have it cast in stone forever. > I correlated difficulty of understanding those files with complexity. > Everyone one of the files is clearly defined and aligned with long history > of how to describe DRAM (see how long CPER records have used these > fields for example - they go back to the beginning). Ok, then pls point me to the actual use cases how those files are going to be used or they are used already. > I'm all in favor of building an interface up by providing minimum first > and then adding to it, but here what is proposed is the minimum for basic > functionality and the alternative of doing the whole thing in kernel both > puts complexity in the wrong place and restricts us in what is possible. There's another point to consider: if this is the correct and proper solution for *your* fleet, that doesn't necessarily mean it is the correct and generic solution for *everybody* using the kernel. So you can imagine that I'd like to have a generic solution which can maximally include everyone instead of *some* special case only. > To some degree but I think there is a major mismatch in what we think > this is for. > > What I've asked Shiju to look at is splitting the repair infrastructure > into two cases so that maybe we can make partial progress: > > 1) Systems that support repair by Physical Address > - Covers Post Package Repair for CXL > > 2) Systems that support repair by description of the underlying hardware > - Covers Memory Sparing interfaces for CXL. > > We need both longer term anyway, but maybe 1 is less controversial simply > on basis it has fewer control parameters > > This still fundamentally puts the policy in userspace where I > believe it belongs. Ok, this is more concrete. Let's start with those. Can I have some more details on how this works pls and who does what? Is it generic enough? If not, can it live in debugfs for now? See above what I mean about this. Big picture: what is the kernel's role here? To be a parrot to carry data back'n'forth or can it simply do clear-cut decisions itself without the need for userspace involvement? So far I get the idea that this is something for your RAS needs. This should have general usability for the rest of the kernel users - otherwise it should remain a vendor-specific solution until it is needed by others and can be generalized. Also, can already existing solutions in the kernel be generalized so that you can use them too and others can benefit from your improvements? I hope this makes more sense. Thx.
On Tue, 21 Jan 2025 17:16:53 +0100 Borislav Petkov <bp@alien8.de> wrote: > On Mon, Jan 13, 2025 at 11:07:40AM +0000, Jonathan Cameron wrote: > > We can do that if you prefer. I'm not that fussed how this is handled > > because, for tooling at least, I don't see why we'd ever read it. > > It's for human parsing only and the above is fine. > Note we are dropping the min / max stuff in most cases as it was only added as it seems to be our misguided attempt to resolve an earlier review comment. That reduces some of the complexity and wasn't useful anyway. We are also splitting the patch set up differently so maybe we can move the discussion on to the 'extended' case for repair without also blocking the simple memory address based one. > Is there even a concrete use case for humans currently? Because if not, we > might as well not do it at all and keep it simple. Clearly we need to provide more evidence of use cases: 'Show us your code' seems to apply here. We'll do that over the next few weeks. The concrete use case is repair of CXL memory devices using sparing, based on simple algorithms applied to the data RAS Daemon already has. The interface for the reasons discussed in the long thread with Dan is the minimum required to provide the information needed to allow for two use cases. We enumerated them explicitly in the discussion with Dan because they possibly affected 'safety'. 1) Power up, pre memory online, (typically non persistent) repair of known bad memory. There are two interface options for this, inject the prior mapping from device physical address space (host address is not necessarily relevant here as no address decoders have been programmed yet in CXL - that happens as part of the flow to bring the memory up), or use the information that userspace already has (bank, rank etc) to select what memory is to be replaced with spare capacity. Given the injection interface and the repair interface have to convey the same data, the interface complexity is identical and we might as well have a single step 'repair' rather than 1. Inject prior records then 2. Pass a physical address that is matched to one of those records. There are no security related concerns here as we always treat this as new memory and zero it etc as part of onlining. 2) Online case. Here the restriction Dan proposed was that we 'check' that we have seen an error record on this boot that matches the full description. That is matching both the physical address and the topology (as that mapping can change from boot to boot, but not whilst the memory is in use). This doesn't prevent any use case we have come up with yet because, if we are making a post initial onlining decision to repair we can assume there is a new error record that provided new information on which we are acting. Hence the kernel had the information to check. Whilst I wasn't convinced that we had a definite security problem without this protection, it requires minimal changes and doesn't block the flows we care about so we are fine with adding this check. > > All I see is an avalanche of sysfs nodes and I'm questioning the usefulness of > the interface and what's the 30K ft big picture for all this. Ok. We'll put together an example script / RASdaemon code to show how it is used. I think you may be surprised at how simple this is and hopefully that will show that the interface is appropriate. > > If this all is just wishful thinking on the part of how this is going to be > used, then I agree with Dan: less is more. But I need to read the rest of that > thread when there's time. > I'll let Dan speak for himself, but my understanding is that what Dan is focused on is safety and other than tidying up a little isn't proposing an significant interface change. > ... > > Repair cam be a feature of the DIMMs themselves or it can be a feature > > of the memory controller. It is basically replacing them with spare > > memory from somewhere else (usually elsewhere on same DIMMs that have > > a bit of spare capacity for this). Bit like a hot spare in a RAID setup. > > Ooh, this is what you call repair. That's using a spare rank or so, under > which I know it as one example. > > What I thought you mean with repair is what you mean with "correct". Ok, > I see. > > > In some other systems the OS gets the errors and is responsible for making > > the decision. > > This decision has been kept away from the OS in my world so far. So yes, the > FW doing all the RAS recovery work is more like it. And the FW is the better > agent in some sense because it has a lot more intimate knowledge of the > platform. However... > > > Sticking to the corrected error case (uncorrected handling > > is going to require a lot more work given we've lost data, Dan asked about that > > in the other branch of the thread), the OS as a whole (kernel + userspace) > > gets the error records and makes the policy decision to repair based on > > assessment of risk vs resource availability to make a repair. > > > > Two reasons for this > > 1) Hardware isn't necessarily capable of repairing autonomously as > > other actions may be needed (memory traffic to some granularity of > > memory may need to be stopped to avoid timeouts). Note there are many > > graduations of this from A) can do it live with no visible effect, through > > B) offline a page, to C) offlining the whole device. > > 2) Policy can be a lot more sophisticated than a BMC can do. > > ... yes, that's why you can't rely only on the FW to do recovery but involve > the OS too. Basically what I've been saying all those years. Oh well... This we agree on. > > > In some cases perhaps, but another very strong driver is that policy is involved. > > > > We can either try put a complex design in firmware and poke it with N opaque > > parameters from a userspace tool or via some out of band method or we can put > > the algorithm in userspace where it can be designed to incorporate lessons learnt > > over time. We will start simple and see what is appropriate as this starts > > to get used in large fleets. This stuff is a reasonable target for AI type > > algorithms etc that we aren't going to put in the kernel. > > > > Doing this at all is a reliability optimization, normally it isn't required for > > correct operation. > > I'm not saying you should put an AI engine into the kernel - all I'm saying > is, the stuff which the kernel can decide itself without user input doesn't > need user input. Only toggle: the kernel should do this correction and/or > repair automatically or not. This we disagree on. For this persistent case in particular these are limited resources. Once you have used them all you can't do it again. Using them carefully is key. An exception is mentioned below as a possible extension but it relies on a specific subset of allowed device functionality and only covers some use cases (so it's an extra, not a replacement for what this set does). > > What is clear here is that you can't design an interface properly right now > for algorithms which you don't have yet. And there's experience missing from > running this in large fleets. With the decision algorithms in userspace, we can design the userspace to kernel interface because we don't care about the algorithm choice - only what it needs to control which is well defined. Algorithms will start simple and then we'll iterate but it won't need changes in this interface because none of it is connected to how we use the data. > > But the interface you're adding now will remain forever cast in stone. Just > for us to realize one day that we're not really using it but it is sitting out > there dead in the water and we can't retract it. Or we're not using it as > originally designed but differently and we need this and that hack to make it > work for the current sensible use case. > > So the way it looks to me right now is, you want this to be in debugfs. You > want to go nuts there, collect experience, algorithms, lessons learned etc and > *then*, the parts which are really useful and sensible should be moved to > sysfs and cast in stone. But not preemptively like that. In general an ABI that is used is cast in stone. To my understanding there is nothing special about debugfs. If we introduce a regression in tooling that uses that interface are we actually any better off than sysfs? https://lwn.net/Articles/309298/ was a good article on this a while back. Maybe there has been a change of opinion on this that I missed. > > > Offline has no permanent cost and no limit on number of times you can > > do it. Repair is definitely a limited resource and may permanently use > > up that resource (discoverable as a policy wants to know that too!) > > In some cases once you run out of repair resources you have to send an > > engineer to replace the memory before you can do it again. > > Yes, and until you can do that and because cloud doesn't want to *ever* > reboot, you must do diminished but still present machine capabilities by > offlining pages and cordoning off faulty hw, etc, etc. Absolutely though the performance impact of punching holes in memory over time is getting some cloud folk pushing back because they can't get their 1GIB pages to put under a VM. Mind you that's not particularly relevant to this thread. > > > Ok. I guess it is an option (I wasn't aware of that work). > > > > I was thinking that was far more complex to deal with than just doing it in > > userspace tooling. From a quick look that solution seems to rely on ACPI ERSR > > infrastructure to provide that persistence that we won't generally have but > > I suppose we can read it from the filesystem or other persistent stores. > > We'd need to be a lot more general about that as can't make system assumptions > > that can be made in AMD specific code. > > > > So could be done, I don't think it is a good idea in this case, but that > > example does suggest it is possible. > > You can look at this as specialized solutions. Could they be more general? > Ofc. But we don't have a general RAS architecture which is vendor-agnostic. It could perhaps be made more general, but so far I'm not seeing why we would do this for this particular feature. It does seem like an interesting avenue to investigate more generally. The use cases discussed in the thread with Dan do not require injection of prior records. Mauro called out his view that complex policy should not be in the kernel as well and as you have gathered I fully agree with him on that! > > > In approach we are targetting, there is no round trip situation. We let the kernel > > deal with any synchronous error just fine and run it's existing logic > > to offline problematic memory. That needs to be timely and to carry on operating > > exactly as it always has. > > > > In parallel with that we gather the error reports that we will already be > > gathering and run analysis on those. From that we decide if a memory is likely to fail > > again and perform a sparing operation if appropriate. > > Effectively this is 'free'. All the information is already there in userspace > > and already understood by tools like rasdaemon, we are not expanding that > > reporting interface at all. > > That is fair. I think you can do that even now if the errors logged have > enough hw information to classify them and use them for predictive analysis. Definitely. In general (outside of CXL in particular) we think there are some gaps that we'll look to address in future, but that's simple stuff for another patch series. > > > Ok. It seems you correlate number of files with complexity. > > No, wrong. I'm looking at the interface and am wondering how is this going to > be used and whether it is worth it to have it cast in stone forever. Ok. I'm not concerned by this because of the alignment with old specifications going back a long way. I'm not sure of the history of the CXL definition but it is near identical to the UEFI CPER records that have been in use a long time. For me that convinces me that this form of device description is pretty general and stable. I'm sure we'll get small new things over time (sub-channel came with DDR5 for example) but those are minor additions. > > > I correlated difficulty of understanding those files with complexity. > > Everyone one of the files is clearly defined and aligned with long history > > of how to describe DRAM (see how long CPER records have used these > > fields for example - they go back to the beginning). > > Ok, then pls point me to the actual use cases how those files are going to be > used or they are used already. For this we'll do as we did for scrub control and send a patch set adding tooling to RASdaemon and/or if more appropriate a script along side it. My fault, I falsely thought this one was more obvious and we could leave that until this landed. Seems not! > > > I'm all in favor of building an interface up by providing minimum first > > and then adding to it, but here what is proposed is the minimum for basic > > functionality and the alternative of doing the whole thing in kernel both > > puts complexity in the wrong place and restricts us in what is possible. > > There's another point to consider: if this is the correct and proper solution > for *your* fleet, that doesn't necessarily mean it is the correct and > generic solution for *everybody* using the kernel. So you can imagine that I'd > like to have a generic solution which can maximally include everyone instead > of *some* special case only. This I agree on. However, if CXL takes off (and there seems to be agreement it will to some degree at least) then this interface is fully general for any spec compliant device. It would be nice to have visibility of more OS managed repair interfaces but for now I can only see one other and that is more similar to PPR in CXL which is device/host physical address based. So we have 3 examples on which to build generalizations, but only one fits the model we are discussing here (which is the second part of repair support in v19 patch set). > > > To some degree but I think there is a major mismatch in what we think > > this is for. > > > > What I've asked Shiju to look at is splitting the repair infrastructure > > into two cases so that maybe we can make partial progress: > > > > 1) Systems that support repair by Physical Address > > - Covers Post Package Repair for CXL > > > > 2) Systems that support repair by description of the underlying hardware > > - Covers Memory Sparing interfaces for CXL. > > > > We need both longer term anyway, but maybe 1 is less controversial simply > > on basis it has fewer control parameters > > > > This still fundamentally puts the policy in userspace where I > > believe it belongs. > > Ok, this is more concrete. Let's start with those. Can I have some more > details on how this works pls and who does what? Is it generic enough? Sure. We can definitely do that. We have this split in v19 (just undergoing some final docs tidy up etc, should be posted soon). > > If not, can it live in debugfs for now? See above what I mean about this. > > Big picture: what is the kernel's role here? To be a parrot to carry data > back'n'forth or can it simply do clear-cut decisions itself without the need > for userspace involvement? With the additional 'safety' checks Dan has asked for the kernel requirement (beyond a standardized interface / place to look for the controls etc) is now responsible for ensuring that a request to repair memory that is online matches up with an error record that we have received. First instance of this will be CXL native error handling based. For now we've made this device specific because exactly what needs checking depends on the type of repair implementation. My intent was that the kernel never makes decisions for this feature. Potentially in future we could relax that to allow it to do so for a few usecases - the non persistent ones, where it could be considered a way to avoid offlining a page. I see that as a much more complex case though than the userspace managed handling so one for future work. It only applies on some subset of devices and there are not enough in the market yet for us to know if that is going to be commonly supported. They will support repair, but whether they allow online repair rather than only offline is yet to be seen. That question corresponds to a single attribute in sysfs and a couple of checks in the driver code but changes whether the second usecase above is possible. Early devices and the ones in a few years time may make different decisions on this. All options are covered by this driver (autonomous repair is covered for free as nothing to do!) Online sparing to avoid offline is a cute idea only at the moment. > > So far I get the idea that this is something for your RAS needs. This should > have general usability for the rest of the kernel users - otherwise it should > remain a vendor-specific solution until it is needed by others and can be > generalized. CXL is not vendor specific. Our other driver that I keep referring to as 'coming soon' is though. I'll see if I can get a few memory device manufacturers to specifically stick their hands up that they care about this. As an example we presented on this topic with Micron at the LPC CXL uconf (+CC Vandana). I don't have access to Micron parts so this isn't just Huawei using Micron, we simply had two proposals on the same topic so combined the sessions. We have a CXL open source sync call in an hour so I'll ask there. > > Also, can already existing solutions in the kernel be generalized so that you > can use them too and others can benefit from your improvements? Maybe for the follow on topic of non persistent repair as a path to avoid offlining memory detected as bad. Maybe that counts as generalization (rather than extension). But that's not covering our usecase of restablishing the offline at boot, or the persistent usecases. So it's a value add feature for a follow up effort, not a baseline one which is the intent of this patch set. > > I hope this makes more sense. Thanks for taking time to continue the discussion and I think we are converging somewhat even if there is further to go. Jonathan > > Thx. >
On Tue, Jan 21, 2025 at 06:16:32PM +0000, Jonathan Cameron wrote: > Clearly we need to provide more evidence of use cases: 'Show us your code' > seems to apply here. We'll do that over the next few weeks. Thanks. > based on simple algorithms applied to the data RAS Daemon already has. > The interface for the reasons discussed in the long thread with Dan > is the minimum required to provide the information needed to allow > for two use cases. We enumerated them explicitly in the discussion with > Dan because they possibly affected 'safety'. > > 1) Power up, pre memory online, (typically non persistent) repair of > known bad memory. Lemme make sure I understand this: during boot you simply know from somewhere that a certain rank (let's use rank for simplicity's sake) is faulty. Before you online the memory, you simply replace that rank in the logic so that the system uses the spare rank while the faulty rank is disabled. > There are two interface options for this, inject the prior mapping from > device physical address space (host address is not necessarily relevant > here as no address decoders have been programmed yet in CXL - that > happens as part of the flow to bring the memory up), or use the > information that userspace already has (bank, rank etc) to select what > memory is to be replaced with spare capacity. Ok, so this is all CXL-specific because this use case relies on userspace being present. Which means you cannot really use this for DIMMs used during boot. So if DIMMs, those should be online-able later, when userspace is there. > Given the injection interface and the repair interface have to > convey the same data, the interface complexity is identical and > we might as well have a single step 'repair' rather than > 1. Inject prior records then What exactly is this injecting? The faulty rank? Which then would cause the respective driver to go and do that repairing. Which then means that you can online that device after rasdaemon has loaded and has the required info to online it. Which then means, rasdaemon needs to be part of the device onlining process. I'm simply conjecturing here - I guess I'll see your detailed use case later. > 2. Pass a physical address that is matched to one of those records. I don't know what that one does. > There are no security related concerns here as we always treat this > as new memory and zero it etc as part of onlining. Right, goes without saying. > 2) Online case. Here the restriction Dan proposed was that we 'check' > that we have seen an error record on this boot that matches the full > description. That is matching both the physical address and the > topology (as that mapping can change from boot to boot, but not whilst > the memory is in use). This doesn't prevent any use case we have > come up with yet because, if we are making a post initial onlining > decision to repair we can assume there is a new error record that > provided new information on which we are acting. Hence the kernel > had the information to check. > > Whilst I wasn't convinced that we had a definite security > problem without this protection, it requires minimal changes and doesn't > block the flows we care about so we are fine with adding this check. I need more detail on that 2nd case - lemme read that other subthread. > Ok. We'll put together an example script / RASdaemon code to show how > it is used. I think you may be surprised at how simple this is and hopefully > that will show that the interface is appropriate. That sounds good, thanks. > This we disagree on. For this persistent case in particular these are limited > resources. Once you have used them all you can't do it again. Using them > carefully is key. An exception is mentioned below as a possible extension but > it relies on a specific subset of allowed device functionality and only > covers some use cases (so it's an extra, not a replacement for what this > set does). By "this persistent case" you mean collecting logs per error address, collating them and massaging them or hunting them through a neural network to recognize potential patterns and then act upon them? In any case, I don't mean that - I mean something simple like: "after X errors on address Y, offline page Z." Like we do with .../ras/cec.c. Ofc you can't put really complex handling in the kernel and why would you - it must be *the* best thing after sliced bread to impose that on everyone. All I'm saying is, simple logic like that can be in the kernel if it is useful in the general case. You don't *have* to carry all logic in some userspace daemon - the kernel can be smart too :-) > With the decision algorithms in userspace, we can design the userspace to kernel > interface because we don't care about the algorithm choice - only what it needs > to control which is well defined. Algorithms will start simple and then > we'll iterate but it won't need changes in this interface because none of it > is connected to how we use the data. Are you saying that this interface you have right now is the necessary and sufficient set of sysfs nodes which will be enough for most algorithms in userspace? And you won't have to change it because you realize down the road that it is not enough? > In general an ABI that is used is cast in stone. To my understanding there > is nothing special about debugfs. If we introduce a regression in tooling > that uses that interface are we actually any better off than sysfs? > https://lwn.net/Articles/309298/ was a good article on this a while back. > > Maybe there has been a change of opinion on this that I missed. I don't think so and I can see that article's point. So let's cut to the chase: what are we going to do when the sysfs or debugfs nodes you've added become insufficient and you or someone else needs to change them in the future, for their specific use case? The last paragraph of that article basically sums it up pretty nicely. > Absolutely though the performance impact of punching holes in memory over > time is getting some cloud folk pushing back because they can't get their > 1GIB pages to put under a VM. Mind you that's not particularly relevant > to this thread. What is relevant to this thread is the fact that you can't simply reboot as a RAS recovery action. Not in all cases. > For this we'll do as we did for scrub control and send a patch set adding tooling > to RASdaemon and/or if more appropriate a script along side it. My fault, > I falsely thought this one was more obvious and we could leave that until > this landed. Seems not! Sorry, I can't always guess the use case by looking solely at the sysfs nodes. > This I agree on. However, if CXL takes off (and there seems to be agreement > it will to some degree at least) then this interface is fully general for any spec > compliant device. Ok, sounds good. > Sure. We can definitely do that. We have this split in v19 (just undergoing > some final docs tidy up etc, should be posted soon). Thx. You don't have to rush it - we have merge window anyway. > Early devices and the ones in a few years time may make different > decisions on this. All options are covered by this driver (autonomous > repair is covered for free as nothing to do!) Don't forget devices which deviate from the spec because they were implemented wrong. It happens and we have to support them because no one else cares but people have already paid for them and want to use them. > CXL is not vendor specific. Our other driver that I keep referring > to as 'coming soon' is though. I'll see if I can get a few memory > device manufacturers to specifically stick their hands up that they > care about this. As an example we presented on this topic with > Micron at the LPC CXL uconf (+CC Vandana). I don't have access > to Micron parts so this isn't just Huawei using Micron, we simply had two > proposals on the same topic so combined the sessions. We have a CXL > open source sync call in an hour so I'll ask there. Having hw vendors agree on a single driver and Linux implementing it would be ofc optimal. > Maybe for the follow on topic of non persistent repair as a path to > avoid offlining memory detected as bad. Maybe that counts > as generalization (rather than extension). But that's not covering > our usecase of restablishing the offline at boot, or the persistent > usecases. So it's a value add feature for a follow up effort, > not a baseline one which is the intent of this patch set. Ok, I think this whole pile should simply be in two parts: generic, CXL-spec implementing, vendor-agnostic pieces and vendor-specific drivers which use that. It'll be lovely if vendors could agree on this interface you're proposing but I won't hold my breath... > Thanks for taking time to continue the discussion and I think we > are converging somewhat even if there is further to go. Yap, I think so. A lot of things got cleared up for me too, so thanks too. I'm sure you know what the important things are that we need to pay attention when it comes to designing this with a broader audience in mind. Thx.
On Wed, 22 Jan 2025 20:09:17 +0100 Borislav Petkov <bp@alien8.de> wrote: > On Tue, Jan 21, 2025 at 06:16:32PM +0000, Jonathan Cameron wrote: > > Clearly we need to provide more evidence of use cases: 'Show us your code' > > seems to apply here. We'll do that over the next few weeks. > > Thanks. Shiju is just finalizing a v19 + the userspace code. So may make sense to read this reply only after that is out! > > > based on simple algorithms applied to the data RAS Daemon already has. > > The interface for the reasons discussed in the long thread with Dan > > is the minimum required to provide the information needed to allow > > for two use cases. We enumerated them explicitly in the discussion with > > Dan because they possibly affected 'safety'. > > > > 1) Power up, pre memory online, (typically non persistent) repair of > > known bad memory. > > Lemme make sure I understand this: during boot you simply know from somewhere > that a certain rank (let's use rank for simplicity's sake) is faulty. Before > you online the memory, you simply replace that rank in the logic so that the > system uses the spare rank while the faulty rank is disabled. Yes. > > > There are two interface options for this, inject the prior mapping from > > device physical address space (host address is not necessarily relevant > > here as no address decoders have been programmed yet in CXL - that > > happens as part of the flow to bring the memory up), or use the > > information that userspace already has (bank, rank etc) to select what > > memory is to be replaced with spare capacity. > > Ok, so this is all CXL-specific because this use case relies on userspace > being present. Which means you cannot really use this for DIMMs used during > boot. So if DIMMs, those should be online-able later, when userspace is there. In this case yes, this isn't appropriate for general purpose DIMMs. It's not CXL Specific as such but that is the most common type of device providing expansion / memory pooling type facilities today and those are the most common forms of memory not needed at boot (I don't think NVDIMMS ever supported PPR etc). > > > Given the injection interface and the repair interface have to > > convey the same data, the interface complexity is identical and > > we might as well have a single step 'repair' rather than > > 1. Inject prior records then > > What exactly is this injecting? The faulty rank? Which then would cause the > respective driver to go and do that repairing. For this comment I was referring letting the kernel do the stats gathering etc. We would need to put back records from a previous boot. That requires almost the same interface as just telling it to repair. Note the address to physical memory mapping is not stable across boots so we can't just provide a physical address, we need full description. > > Which then means that you can online that device after rasdaemon has loaded > and has the required info to online it. > > Which then means, rasdaemon needs to be part of the device onlining process. Yes. For this flow (sort of) The PoC / proposal is a boot script rather than rasdaemon but it uses the rasdaemon DB. It would definitely be interesting to explore other options in addition to this. Perhaps we get some firmware interface standardization to ask the firmware if it can repair stuff before the memory is in use. Might work for cases where the firmware knows enough about them to do the repair. > > I'm simply conjecturing here - I guess I'll see your detailed use case later. > > > 2. Pass a physical address that is matched to one of those records. > > I don't know what that one does. This was about practical restrictions on interface simplification. If we have prior records pushed back into the kernel, we could then 'find' the data we need to repair by looking it up by physical address. In v19 we do that for records from this boot if the memory is online. This is part of the sanity check Dan asked for to harden against userspace repairing something based on stale data, but it is specific to the CXL driver. More generally that sanity check may not be needed if PA to actual memory mapping is a stable thing. > > > There are no security related concerns here as we always treat this > > as new memory and zero it etc as part of onlining. > > Right, goes without saying. > > > 2) Online case. Here the restriction Dan proposed was that we 'check' > > that we have seen an error record on this boot that matches the full > > description. That is matching both the physical address and the > > topology (as that mapping can change from boot to boot, but not whilst > > the memory is in use). This doesn't prevent any use case we have > > come up with yet because, if we are making a post initial onlining > > decision to repair we can assume there is a new error record that > > provided new information on which we are acting. Hence the kernel > > had the information to check. > > > > Whilst I wasn't convinced that we had a definite security > > problem without this protection, it requires minimal changes and doesn't > > block the flows we care about so we are fine with adding this check. > > I need more detail on that 2nd case - lemme read that other subthread. > > > Ok. We'll put together an example script / RASdaemon code to show how > > it is used. I think you may be surprised at how simple this is and hopefully > > that will show that the interface is appropriate. > > That sounds good, thanks. > > > This we disagree on. For this persistent case in particular these are limited > > resources. Once you have used them all you can't do it again. Using them > > carefully is key. An exception is mentioned below as a possible extension but > > it relies on a specific subset of allowed device functionality and only > > covers some use cases (so it's an extra, not a replacement for what this > > set does). > > By "this persistent case" you mean collecting logs per error address, > collating them and massaging them or hunting them through a neural network to > recognize potential patterns and then act upon them? Ah. No not that. I was just meaning the case where it is hard PPR. (hence persistent for all time) Once you've done it you can't go back so after N uses, any more errors mean you need a new device ASAP. That is as decision with a very different threshold to soft PPR where it's a case of you do it until you run out of spares, then you fall back to offlining pages. Next boot you get your spares back again and may use them differently this time. > > In any case, I don't mean that - I mean something simple like: "after X errors > on address Y, offline page Z." Like we do with .../ras/cec.c. Ofc you can't > put really complex handling in the kernel and why would you - it must be *the* > best thing after sliced bread to impose that on everyone. > > All I'm saying is, simple logic like that can be in the kernel if it is useful > in the general case. You don't *have* to carry all logic in some userspace > daemon - the kernel can be smart too :-) True enough. I'm not against doing things in kernel in some cases. Even then I want the controls to allow user space to do more complex things. Even in the cases where the devices suggests repair, we may not want to for reasons that device can't know about. > > > With the decision algorithms in userspace, we can design the userspace to kernel > > interface because we don't care about the algorithm choice - only what it needs > > to control which is well defined. Algorithms will start simple and then > > we'll iterate but it won't need changes in this interface because none of it > > is connected to how we use the data. > > Are you saying that this interface you have right now is the necessary and > sufficient set of sysfs nodes which will be enough for most algorithms in > userspace? > > And you won't have to change it because you realize down the road that it is > not enough? > The interface provides all the data, and all the controls to match. Sure, something new might come along that needs additional controls (subchannel for DDR5 showed up recently for instance and are in v19) but that extension should be easy and fit within the ABI. Those new 'features' will need kernel changes and matching rasdaemon changes anyway as there is new data in the error records so this sort of extension should be fine. > > In general an ABI that is used is cast in stone. To my understanding there > > is nothing special about debugfs. If we introduce a regression in tooling > > that uses that interface are we actually any better off than sysfs? > > https://lwn.net/Articles/309298/ was a good article on this a while back. > > > > Maybe there has been a change of opinion on this that I missed. > > I don't think so and I can see that article's point. So let's cut to the > chase: what are we going to do when the sysfs or debugfs nodes you've added > become insufficient and you or someone else needs to change them in the > future, for their specific use case? > > The last paragraph of that article basically sums it up pretty nicely. Agreed. We need an interface we can support indefinitely - there is nothing different between doing it sysfs or debugfs. That should be extensible in a clean fashion to support new data and matching control. We don't have to guarantee that interface supports something 'new' though as our crystal balls aren't perfect, but we do want to make extending to cover the new straight forward. > > > Absolutely though the performance impact of punching holes in memory over > > time is getting some cloud folk pushing back because they can't get their > > 1GIB pages to put under a VM. Mind you that's not particularly relevant > > to this thread. > > What is relevant to this thread is the fact that you can't simply reboot as > a RAS recovery action. Not in all cases. Agreed. We still have the option to soft offline the memory if there is no other choice. > > > For this we'll do as we did for scrub control and send a patch set adding tooling > > to RASdaemon and/or if more appropriate a script along side it. My fault, > > I falsely thought this one was more obvious and we could leave that until > > this landed. Seems not! > > Sorry, I can't always guess the use case by looking solely at the sysfs nodes. > > > This I agree on. However, if CXL takes off (and there seems to be agreement > > it will to some degree at least) then this interface is fully general for any spec > > compliant device. > > Ok, sounds good. > > > Sure. We can definitely do that. We have this split in v19 (just undergoing > > some final docs tidy up etc, should be posted soon). > > Thx. > > You don't have to rush it - we have merge window anyway. > > > Early devices and the ones in a few years time may make different > > decisions on this. All options are covered by this driver (autonomous > > repair is covered for free as nothing to do!) > > Don't forget devices which deviate from the spec because they were implemented > wrong. It happens and we have to support them because no one else cares but > people have already paid for them and want to use them. Mostly for CXL stuff we've so far avoided that in the upstream code, but it is indeed the case that quirks will turn up :( I'd hope the majority can be handled in the CXL specific driver or by massaging the error records on their way out of the kernel. For now we have constrained records to have to be complete as defined by the spec, but I can definitely see we might have a 1 rank only device that doesn't set the valid bit in the error record for rank. In discussion with Shiju, we decided that's a case we'll solve in the driver if it turns out to be relevant. Maybe we'd quirk the error report to fill in the missing data for that device, or maybe we'd relax the constraints on parameters when doing an online repair. That's a question to resolve if anyone ever builds it! > > > CXL is not vendor specific. Our other driver that I keep referring > > to as 'coming soon' is though. I'll see if I can get a few memory > > device manufacturers to specifically stick their hands up that they > > care about this. As an example we presented on this topic with > > Micron at the LPC CXL uconf (+CC Vandana). I don't have access > > to Micron parts so this isn't just Huawei using Micron, we simply had two > > proposals on the same topic so combined the sessions. We have a CXL > > open source sync call in an hour so I'll ask there. > > Having hw vendors agree on a single driver and Linux implementing it would be > ofc optimal. > > > Maybe for the follow on topic of non persistent repair as a path to > > avoid offlining memory detected as bad. Maybe that counts > > as generalization (rather than extension). But that's not covering > > our usecase of restablishing the offline at boot, or the persistent > > usecases. So it's a value add feature for a follow up effort, > > not a baseline one which is the intent of this patch set. > > Ok, I think this whole pile should simply be in two parts: generic, CXL-spec > implementing, vendor-agnostic pieces and vendor-specific drivers which use > that. There is room for vendor specific drivers in that longer term and it is happening in the CXL core to move from just supporting spec complaint type 3 devices (the ones that use the class code) to supporting accelerators (network cards, GPUs etc with local memory). What we have here is vbuilt on the CXL core that provides services to all those drivers. The actual opt in etc is coming from the type 3 device driver calling the registration call after finding the hardware reports the relevant commands. Overall the CXL subsystem is evolving to allow more reuse for accelerators. Until recently the only public devices where type 3 memory only, so it was a guessing game on where those spits should be. Alejandro and others are now fleshing that out. So far no sign of memory repair on those devices, but if they support it and use standard interfaces, then all good. (for type 2 support) https://lore.kernel.org/linux-cxl/20250205151950.25268-1-alucerop@amd.com/T/#t > > It'll be lovely if vendors could agree on this interface you're proposing but > I won't hold my breath... True enough that vendors will do silly things. However, if they want Linux support for a CXL memory device out of the box the only option they have is the driver stack that binds to the class code. Dan and the rest of us are pushing back very hard (possibly too hard) on vendor defined interfaces etc. If we don't see at least an effort to standardize them (not necessarily in the CXL spec, but somewhere) then their chances of getting upstream support is near 0. There have been a few 'discussions' about exceptions to this in the past. Their only option might be fwctl for a few things with it's taints etc. The controls used for this set are explicitly excluded from being used via fwctl: https://lore.kernel.org/linux-cxl/20250204220430.4146187-1-dave.jiang@intel.com/T/#ma478ae9d7529f31ec6f08c2e98432d5721ca0b0e If a vendor wants to do their own thing then good luck to them but don't expect the standard software stack to work. So far I have seen no sign of anyone doing a non compliant memory expansion device and there are quite a few spec compliant ones. We will get weird memory devices with accelerators perhaps but then that memory won't be treated as normal memory anyway and likely has a custom RAS solution. If they do use the spec defined commands, then this support should work fine. Just needs a call from their drive to hook it up. It might not be the best analogy, but I think of the CXL type 3 device spec as being similar to NVME. There are lots of options, but most people will run one standard driver. There may be custom features but the device better be compatible with the NVME driver if they advertise the class code (there are compliance suites etc) > > > Thanks for taking time to continue the discussion and I think we > > are converging somewhat even if there is further to go. > > Yap, I think so. A lot of things got cleared up for me too, so thanks too. > I'm sure you know what the important things are that we need to pay attention > when it comes to designing this with a broader audience in mind. I'll encourage a few more memory vendor folk (who have visibility of more specific implementations / use cases than me) to take another look at v19 and the user space tooling. Hopefully they will point out any remaining holes. Thanks, Jonathan > > Thx. >
diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair b/Documentation/ABI/testing/sysfs-edac-memory-repair new file mode 100644 index 000000000000..e9268f3780ed --- /dev/null +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair @@ -0,0 +1,244 @@ +What: /sys/bus/edac/devices/<dev-name>/mem_repairX +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + The sysfs EDAC bus devices /<dev-name>/mem_repairX subdirectory + pertains to the memory media repair features control, such as + PPR (Post Package Repair), memory sparing etc, where<dev-name> + directory corresponds to a device registered with the EDAC + device driver for the memory repair features. + + Post Package Repair is a maintenance operation requests the memory + device to perform a repair operation on its media, in detail is a + memory self-healing feature that fixes a failing memory location by + replacing it with a spare row in a DRAM device. For example, a + CXL memory device with DRAM components that support PPR features may + implement PPR maintenance operations. DRAM components may support + two types of PPR functions: hard PPR, for a permanent row repair, and + soft PPR, for a temporary row repair. soft PPR is much faster than + hard PPR, but the repair is lost with a power cycle. + + Memory sparing is a repair function that replaces a portion + of memory with a portion of functional memory at that same + sparing granularity. Memory sparing has cacheline/row/bank/rank + sparing granularities. For example, in memory-sparing mode, + one memory rank serves as a spare for other ranks on the same + channel in case they fail. The spare rank is held in reserve and + not used as active memory until a failure is indicated, with + reserved capacity subtracted from the total available memory + in the system.The DIMM installation order for memory sparing + varies based on the number of processors and memory modules + installed in the server. After an error threshold is surpassed + in a system protected by memory sparing, the content of a failing + rank of DIMMs is copied to the spare rank. The failing rank is + then taken offline and the spare rank placed online for use as + active memory in place of the failed rank. + + The sysfs attributes nodes for a repair feature are only + present if the parent driver has implemented the corresponding + attr callback function and provided the necessary operations + to the EDAC device driver during registration. + + In some states of system configuration (e.g. before address + decoders have been configured), memory devices (e.g. CXL) + may not have an active mapping in the main host address + physical address map. As such, the memory to repair must be + identified by a device specific physical addressing scheme + using a device physical address(DPA). The DPA and other control + attributes to use will be presented in related error records. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_function +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RO) Memory repair function type. For eg. post package repair, + memory sparing etc. + EDAC_SOFT_PPR - Soft post package repair + EDAC_HARD_PPR - Hard post package repair + EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing + EDAC_ROW_MEM_SPARING - Row memory sparing + EDAC_BANK_MEM_SPARING - Bank memory sparing + EDAC_RANK_MEM_SPARING - Rank memory sparing + All other values are reserved. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/persist_mode +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RW) Read/Write the current persist repair mode set for a + repair function. Persist repair modes supported in the + device, based on the memory repair function is temporary + or permanent and is lost with a power cycle. + EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair). + EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair). + All other values are reserved. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/dpa_support +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RO) True if memory device required device physical + address (DPA) of memory to repair. + False if memory device required host specific physical + address (HPA) of memory to repair. + In some states of system configuration (e.g. before address + decoders have been configured), memory devices (e.g. CXL) + may not have an active mapping in the main host address + physical address map. As such, the memory to repair must be + identified by a device specific physical addressing scheme + using a DPA. The device physical address(DPA) to use will be + presented in related error records. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair_safe_when_in_use +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RO) True if memory media is accessible and data is retained + during the memory repair operation. + The data may not be retained and memory requests may not be + correctly processed during a repair operation. In such case + the repair operation should not executed at runtime. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/hpa +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RW) Host Physical Address (HPA) of the memory to repair. + See attribute 'dpa_support' for more details. + The HPA to use will be provided in related error records. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/dpa +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RW) Device Physical Address (DPA) of the memory to repair. + See attribute 'dpa_support' for more details. + The specific DPA to use will be provided in related error + records. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/nibble_mask +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RW) Read/Write Nibble mask of the memory to repair. + Nibble mask identifies one or more nibbles in error on the + memory bus that produced the error event. Nibble Mask bit 0 + shall be set if nibble 0 on the memory bus produced the + event, etc. For example, CXL PPR and sparing, a nibble mask + bit set to 1 indicates the request to perform repair + operation in the specific device. All nibble mask bits set + to 1 indicates the request to perform the operation in all + devices. For CXL memory to repiar, the specific value of + nibble mask to use will be provided in related error records. + For more details, See nibble mask field in CXL spec ver 3.1, + section 8.2.9.7.1.2 Table 8-103 soft PPR and section + 8.2.9.7.1.3 Table 8-104 hard PPR, section 8.2.9.7.1.4 + Table 8-105 memory sparing. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank_group +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/bank +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/rank +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/row +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/column +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/channel +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/sub_channel +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RW) The control attributes associated with memory address + that is to be repaired. The specific value of attributes to + use depends on the portion of memory to repair and may be + reported to host in related error records and may be + available to userspace in trace events, such as in CXL + memory devices. + + channel - The channel of the memory to repair. Channel is + defined as an interface that can be independently accessed + for a transaction. + rank - The rank of the memory to repair. Rank is defined as a + set of memory devices on a channel that together execute a + transaction. + bank_group - The bank group of the memory to repair. + bank - The bank number of the memory to repair. + row - The row number of the memory to repair. + column - The column number of the memory to repair. + sub_channel - The sub-channel of the memory to repair. + + The requirement to set these attributes varies based on the + repair function. The attributes in sysfs are not present + unless required for a repair function. + For example, CXL spec ver 3.1, Section 8.2.9.7.1.2 Table 8-103 + soft PPR and Section 8.2.9.7.1.3 Table 8-104 hard PPR operations, + these attributes are not required to set. + For example, CXL spec ver 3.1, Section 8.2.9.7.1.4 Table 8-105 + memory sparing, these attributes are required to set based on + memory sparing granularity as follows. + Channel: Channel associated with the DPA that is to be spared + and applies to all subclasses of sparing (cacheline, bank, + row and rank sparing). + Rank: Rank associated with the DPA that is to be spared and + applies to all subclasses of sparing. + Bank & Bank Group: Bank & bank group are associated with + the DPA that is to be spared and applies to cacheline sparing, + row sparing and bank sparing subclasses. + Row: Row associated with the DPA that is to be spared and + applies to cacheline sparing and row sparing subclasses. + Column: column associated with the DPA that is to be spared + and applies to cacheline sparing only. + Sub-channel: sub-channel associated with the DPA that is to + be spared and applies to cacheline sparing only. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_hpa +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_dpa +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_nibble_mask +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank_group +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_bank +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_rank +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_row +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_column +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_channel +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/min_sub_channel +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_hpa +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_dpa +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_nibble_mask +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank_group +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_bank +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_rank +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_row +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_column +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_channel +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/max_sub_channel +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (RW) The supported range of control attributes (optional) + associated with a memory address that is to be repaired. + The memory device may give the supported range of + attributes to use and it will depend on the memory device + and the portion of memory to repair. + The userspace may receive the specific value of attributes + to use for a repair operation from the memory device via + related error records and trace events, such as in CXL + memory devices. + +What: /sys/bus/edac/devices/<dev-name>/mem_repairX/repair +Date: Jan 2025 +KernelVersion: 6.14 +Contact: linux-edac@vger.kernel.org +Description: + (WO) Issue the memory repair operation for the specified + memory repair attributes. The operation may fail if resources + are insufficient based on the requirements of the memory + device and repair function. + EDAC_DO_MEM_REPAIR - issue repair operation. + All other values are reserved. diff --git a/Documentation/edac/features.rst b/Documentation/edac/features.rst index ba3ab993ee4f..bfd5533b81b7 100644 --- a/Documentation/edac/features.rst +++ b/Documentation/edac/features.rst @@ -97,3 +97,6 @@ RAS features ------------ 1. Memory Scrub Memory scrub features are documented in `Documentation/edac/scrub.rst`. + +2. Memory Repair +Memory repair features are documented in `Documentation/edac/memory_repair.rst`. diff --git a/Documentation/edac/index.rst b/Documentation/edac/index.rst index dfb0c9fb9ab1..d6778f4562dd 100644 --- a/Documentation/edac/index.rst +++ b/Documentation/edac/index.rst @@ -8,4 +8,5 @@ EDAC Subsystem :maxdepth: 1 features + memory_repair scrub diff --git a/Documentation/edac/memory_repair.rst b/Documentation/edac/memory_repair.rst new file mode 100644 index 000000000000..2787a8a2d6ba --- /dev/null +++ b/Documentation/edac/memory_repair.rst @@ -0,0 +1,101 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +EDAC Memory Repair Control +========================== + +Copyright (c) 2024 HiSilicon Limited. + +:Author: Shiju Jose <shiju.jose@huawei.com> +:License: The GNU Free Documentation License, Version 1.2 + (dual licensed under the GPL v2) +:Original Reviewers: + +- Written for: 6.14 + +Introduction +------------ +Memory devices may support repair operations to address issues in their +memory media. Post Package Repair (PPR) and memory sparing are examples +of such features. + +Post Package Repair(PPR) +~~~~~~~~~~~~~~~~~~~~~~~~ +Post Package Repair is a maintenance operation requests the memory device +to perform repair operation on its media, in detail is a memory self-healing +feature that fixes a failing memory location by replacing it with a spare +row in a DRAM device. For example, a CXL memory device with DRAM components +that support PPR features may implement PPR maintenance operations. DRAM +components may support types of PPR functions, hard PPR, for a permanent row +repair, and soft PPR, for a temporary row repair. Soft PPR is much faster +than hard PPR, but the repair is lost with a power cycle. The data may not +be retained and memory requests may not be correctly processed during a +repair operation. In such case, the repair operation should not executed +at runtime. +For example, CXL memory devices, soft PPR and hard PPR repair operations +may be supported. See CXL spec rev 3.1 sections 8.2.9.7.1.1 PPR Maintenance +Operations, 8.2.9.7.1.2 sPPR Maintenance Operation and 8.2.9.7.1.3 hPPR +Maintenance Operation for more details. + +Memory Sparing +~~~~~~~~~~~~~~ +Memory sparing is a repair function that replaces a portion of memory with +a portion of functional memory at that same sparing granularity. Memory +sparing has cacheline/row/bank/rank sparing granularities. For example, in +memory-sparing mode, one memory rank serves as a spare for other ranks on +the same channel in case they fail. The spare rank is held in reserve and +not used as active memory until a failure is indicated, with reserved +capacity subtracted from the total available memory in the system. The DIMM +installation order for memory sparing varies based on the number of processors +and memory modules installed in the server. After an error threshold is +surpassed in a system protected by memory sparing, the content of a failing +rank of DIMMs is copied to the spare rank. The failing rank is then taken +offline and the spare rank placed online for use as active memory in place +of the failed rank. + +For example, CXL memory devices may support various subclasses for sparing +operation vary in terms of the scope of the sparing being performed. +Cacheline sparing subclass refers to a sparing action that can replace a +full cacheline. Row sparing is provided as an alternative to PPR sparing +functions and its scope is that of a single DDR row. Bank sparing allows +an entire bank to be replaced. Rank sparing is defined as an operation +in which an entire DDR rank is replaced. See CXL spec 3.1 section +8.2.9.7.1.4 Memory Sparing Maintenance Operations for more details. + +Use cases of generic memory repair features control +--------------------------------------------------- + +1. The soft PPR , hard PPR and memory-sparing features share similar +control attributes. Therefore, there is a need for a standardized, generic +sysfs repair control that is exposed to userspace and used by +administrators, scripts and tools. + +2. When a CXL device detects an error in a memory component, it may inform +the host of the need for a repair maintenance operation by using an event +record where the "maintenance needed" flag is set. The event record +specifies the device physical address(DPA) and attributes of the memory that +requires repair. The kernel reports the corresponding CXL general media or +DRAM trace event to userspace, and userspace tools (e.g. rasdaemon) initiate +a repair maintenance operation in response to the device request using the +sysfs repair control. + +3. Userspace tools, such as rasdaemon, may request a PPR/sparing on a memory +region when an uncorrected memory error or an excess of corrected memory +errors is reported on that memory. + +4. Multiple PPR/sparing instances may be present per memory device. + +The File System +--------------- + +The control attributes of a registered memory repair instance could be +accessed in the + +/sys/bus/edac/devices/<dev-name>/mem_repairX/ + +sysfs +----- + +Sysfs files are documented in + +`Documentation/ABI/testing/sysfs-edac-memory-repair`. diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile index 3a49304860f0..1de9fe66ac6b 100644 --- a/drivers/edac/Makefile +++ b/drivers/edac/Makefile @@ -10,7 +10,7 @@ obj-$(CONFIG_EDAC) := edac_core.o edac_core-y := edac_mc.o edac_device.o edac_mc_sysfs.o edac_core-y += edac_module.o edac_device_sysfs.o wq.o -edac_core-y += scrub.o ecs.o +edac_core-y += scrub.o ecs.o mem_repair.o edac_core-$(CONFIG_EDAC_DEBUG) += debugfs.o diff --git a/drivers/edac/edac_device.c b/drivers/edac/edac_device.c index 1c1142a2e4e4..a401d81dad8a 100644 --- a/drivers/edac/edac_device.c +++ b/drivers/edac/edac_device.c @@ -575,6 +575,7 @@ static void edac_dev_release(struct device *dev) { struct edac_dev_feat_ctx *ctx = container_of(dev, struct edac_dev_feat_ctx, dev); + kfree(ctx->mem_repair); kfree(ctx->scrub); kfree(ctx->dev.groups); kfree(ctx); @@ -611,6 +612,7 @@ int edac_dev_register(struct device *parent, char *name, const struct attribute_group **ras_attr_groups; struct edac_dev_data *dev_data; struct edac_dev_feat_ctx *ctx; + int mem_repair_cnt = 0; int attr_gcnt = 0; int scrub_cnt = 0; int ret, feat; @@ -628,6 +630,10 @@ int edac_dev_register(struct device *parent, char *name, case RAS_FEAT_ECS: attr_gcnt += ras_features[feat].ecs_info.num_media_frus; break; + case RAS_FEAT_MEM_REPAIR: + attr_gcnt++; + mem_repair_cnt++; + break; default: return -EINVAL; } @@ -651,8 +657,17 @@ int edac_dev_register(struct device *parent, char *name, } } + if (mem_repair_cnt) { + ctx->mem_repair = kcalloc(mem_repair_cnt, sizeof(*ctx->mem_repair), GFP_KERNEL); + if (!ctx->mem_repair) { + ret = -ENOMEM; + goto data_mem_free; + } + } + attr_gcnt = 0; scrub_cnt = 0; + mem_repair_cnt = 0; for (feat = 0; feat < num_features; feat++, ras_features++) { switch (ras_features->ft_type) { case RAS_FEAT_SCRUB: @@ -686,6 +701,23 @@ int edac_dev_register(struct device *parent, char *name, attr_gcnt += ras_features->ecs_info.num_media_frus; break; + case RAS_FEAT_MEM_REPAIR: + if (!ras_features->mem_repair_ops || + mem_repair_cnt != ras_features->instance) + goto data_mem_free; + + dev_data = &ctx->mem_repair[mem_repair_cnt]; + dev_data->instance = mem_repair_cnt; + dev_data->mem_repair_ops = ras_features->mem_repair_ops; + dev_data->private = ras_features->ctx; + ret = edac_mem_repair_get_desc(parent, &ras_attr_groups[attr_gcnt], + ras_features->instance); + if (ret) + goto data_mem_free; + + mem_repair_cnt++; + attr_gcnt++; + break; default: ret = -EINVAL; goto data_mem_free; @@ -712,6 +744,7 @@ int edac_dev_register(struct device *parent, char *name, return devm_add_action_or_reset(parent, edac_dev_unreg, &ctx->dev); data_mem_free: + kfree(ctx->mem_repair); kfree(ctx->scrub); groups_free: kfree(ras_attr_groups); diff --git a/drivers/edac/mem_repair.c b/drivers/edac/mem_repair.c new file mode 100755 index 000000000000..e7439fd26c41 --- /dev/null +++ b/drivers/edac/mem_repair.c @@ -0,0 +1,492 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * The generic EDAC memory repair driver is designed to control the memory + * devices with memory repair features, such as Post Package Repair (PPR), + * memory sparing etc. The common sysfs memory repair interface abstracts + * the control of various arbitrary memory repair functionalities into a + * unified set of functions. + * + * Copyright (c) 2024 HiSilicon Limited. + */ + +#include <linux/edac.h> + +enum edac_mem_repair_attributes { + MEM_REPAIR_FUNCTION, + MEM_REPAIR_PERSIST_MODE, + MEM_REPAIR_DPA_SUPPORT, + MEM_REPAIR_SAFE_IN_USE, + MEM_REPAIR_HPA, + MEM_REPAIR_MIN_HPA, + MEM_REPAIR_MAX_HPA, + MEM_REPAIR_DPA, + MEM_REPAIR_MIN_DPA, + MEM_REPAIR_MAX_DPA, + MEM_REPAIR_NIBBLE_MASK, + MEM_REPAIR_MIN_NIBBLE_MASK, + MEM_REPAIR_MAX_NIBBLE_MASK, + MEM_REPAIR_BANK_GROUP, + MEM_REPAIR_MIN_BANK_GROUP, + MEM_REPAIR_MAX_BANK_GROUP, + MEM_REPAIR_BANK, + MEM_REPAIR_MIN_BANK, + MEM_REPAIR_MAX_BANK, + MEM_REPAIR_RANK, + MEM_REPAIR_MIN_RANK, + MEM_REPAIR_MAX_RANK, + MEM_REPAIR_ROW, + MEM_REPAIR_MIN_ROW, + MEM_REPAIR_MAX_ROW, + MEM_REPAIR_COLUMN, + MEM_REPAIR_MIN_COLUMN, + MEM_REPAIR_MAX_COLUMN, + MEM_REPAIR_CHANNEL, + MEM_REPAIR_MIN_CHANNEL, + MEM_REPAIR_MAX_CHANNEL, + MEM_REPAIR_SUB_CHANNEL, + MEM_REPAIR_MIN_SUB_CHANNEL, + MEM_REPAIR_MAX_SUB_CHANNEL, + MEM_DO_REPAIR, + MEM_REPAIR_MAX_ATTRS +}; + +struct edac_mem_repair_dev_attr { + struct device_attribute dev_attr; + u8 instance; +}; + +struct edac_mem_repair_context { + char name[EDAC_FEAT_NAME_LEN]; + struct edac_mem_repair_dev_attr mem_repair_dev_attr[MEM_REPAIR_MAX_ATTRS]; + struct attribute *mem_repair_attrs[MEM_REPAIR_MAX_ATTRS + 1]; + struct attribute_group group; +}; + +#define TO_MEM_REPAIR_DEV_ATTR(_dev_attr) \ + container_of(_dev_attr, struct edac_mem_repair_dev_attr, dev_attr) + +#define EDAC_MEM_REPAIR_ATTR_SHOW(attrib, cb, type, format) \ +static ssize_t attrib##_show(struct device *ras_feat_dev, \ + struct device_attribute *attr, char *buf) \ +{ \ + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance; \ + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \ + const struct edac_mem_repair_ops *ops = \ + ctx->mem_repair[inst].mem_repair_ops; \ + type data; \ + int ret; \ + \ + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, \ + &data); \ + if (ret) \ + return ret; \ + \ + return sysfs_emit(buf, format, data); \ +} + +EDAC_MEM_REPAIR_ATTR_SHOW(repair_function, get_repair_function, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(persist_mode, get_persist_mode, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(dpa_support, get_dpa_support, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(repair_safe_when_in_use, get_repair_safe_when_in_use, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(hpa, get_hpa, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_hpa, get_min_hpa, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_hpa, get_max_hpa, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(dpa, get_dpa, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_dpa, get_min_dpa, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_dpa, get_max_dpa, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(nibble_mask, get_nibble_mask, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_nibble_mask, get_min_nibble_mask, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_nibble_mask, get_max_nibble_mask, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(bank_group, get_bank_group, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank_group, get_min_bank_group, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank_group, get_max_bank_group, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(bank, get_bank, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_bank, get_min_bank, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_bank, get_max_bank, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(rank, get_rank, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_rank, get_min_rank, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_rank, get_max_rank, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(row, get_row, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_row, get_min_row, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_row, get_max_row, u64, "0x%llx\n") +EDAC_MEM_REPAIR_ATTR_SHOW(column, get_column, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_column, get_min_column, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_column, get_max_column, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(channel, get_channel, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_channel, get_min_channel, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_channel, get_max_channel, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(sub_channel, get_sub_channel, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(min_sub_channel, get_min_sub_channel, u32, "%u\n") +EDAC_MEM_REPAIR_ATTR_SHOW(max_sub_channel, get_max_sub_channel, u32, "%u\n") + +#define EDAC_MEM_REPAIR_ATTR_STORE(attrib, cb, type, conv_func) \ +static ssize_t attrib##_store(struct device *ras_feat_dev, \ + struct device_attribute *attr, \ + const char *buf, size_t len) \ +{ \ + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance; \ + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \ + const struct edac_mem_repair_ops *ops = \ + ctx->mem_repair[inst].mem_repair_ops; \ + type data; \ + int ret; \ + \ + ret = conv_func(buf, 0, &data); \ + if (ret < 0) \ + return ret; \ + \ + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, \ + data); \ + if (ret) \ + return ret; \ + \ + return len; \ +} + +EDAC_MEM_REPAIR_ATTR_STORE(persist_mode, set_persist_mode, unsigned long, kstrtoul) +EDAC_MEM_REPAIR_ATTR_STORE(hpa, set_hpa, u64, kstrtou64) +EDAC_MEM_REPAIR_ATTR_STORE(dpa, set_dpa, u64, kstrtou64) +EDAC_MEM_REPAIR_ATTR_STORE(nibble_mask, set_nibble_mask, u64, kstrtou64) +EDAC_MEM_REPAIR_ATTR_STORE(bank_group, set_bank_group, unsigned long, kstrtoul) +EDAC_MEM_REPAIR_ATTR_STORE(bank, set_bank, unsigned long, kstrtoul) +EDAC_MEM_REPAIR_ATTR_STORE(rank, set_rank, unsigned long, kstrtoul) +EDAC_MEM_REPAIR_ATTR_STORE(row, set_row, u64, kstrtou64) +EDAC_MEM_REPAIR_ATTR_STORE(column, set_column, unsigned long, kstrtoul) +EDAC_MEM_REPAIR_ATTR_STORE(channel, set_channel, unsigned long, kstrtoul) +EDAC_MEM_REPAIR_ATTR_STORE(sub_channel, set_sub_channel, unsigned long, kstrtoul) + +#define EDAC_MEM_REPAIR_DO_OP(attrib, cb) \ +static ssize_t attrib##_store(struct device *ras_feat_dev, \ + struct device_attribute *attr, \ + const char *buf, size_t len) \ +{ \ + u8 inst = TO_MEM_REPAIR_DEV_ATTR(attr)->instance; \ + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); \ + const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops; \ + unsigned long data; \ + int ret; \ + \ + ret = kstrtoul(buf, 0, &data); \ + if (ret < 0) \ + return ret; \ + \ + ret = ops->cb(ras_feat_dev->parent, ctx->mem_repair[inst].private, data); \ + if (ret) \ + return ret; \ + \ + return len; \ +} + +EDAC_MEM_REPAIR_DO_OP(repair, do_repair) + +static umode_t mem_repair_attr_visible(struct kobject *kobj, struct attribute *a, int attr_id) +{ + struct device *ras_feat_dev = kobj_to_dev(kobj); + struct device_attribute *dev_attr = container_of(a, struct device_attribute, attr); + struct edac_dev_feat_ctx *ctx = dev_get_drvdata(ras_feat_dev); + u8 inst = TO_MEM_REPAIR_DEV_ATTR(dev_attr)->instance; + const struct edac_mem_repair_ops *ops = ctx->mem_repair[inst].mem_repair_ops; + + switch (attr_id) { + case MEM_REPAIR_FUNCTION: + if (ops->get_repair_function) + return a->mode; + break; + case MEM_REPAIR_PERSIST_MODE: + if (ops->get_persist_mode) { + if (ops->set_persist_mode) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_DPA_SUPPORT: + if (ops->get_dpa_support) + return a->mode; + break; + case MEM_REPAIR_SAFE_IN_USE: + if (ops->get_repair_safe_when_in_use) + return a->mode; + break; + case MEM_REPAIR_HPA: + if (ops->get_hpa) { + if (ops->set_hpa) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_HPA: + if (ops->get_min_hpa) + return a->mode; + break; + case MEM_REPAIR_MAX_HPA: + if (ops->get_max_hpa) + return a->mode; + break; + case MEM_REPAIR_DPA: + if (ops->get_dpa) { + if (ops->set_dpa) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_DPA: + if (ops->get_min_dpa) + return a->mode; + break; + case MEM_REPAIR_MAX_DPA: + if (ops->get_max_dpa) + return a->mode; + break; + case MEM_REPAIR_NIBBLE_MASK: + if (ops->get_nibble_mask) { + if (ops->set_nibble_mask) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_NIBBLE_MASK: + if (ops->get_min_nibble_mask) + return a->mode; + break; + case MEM_REPAIR_MAX_NIBBLE_MASK: + if (ops->get_max_nibble_mask) + return a->mode; + break; + case MEM_REPAIR_BANK_GROUP: + if (ops->get_bank_group) { + if (ops->set_bank_group) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_BANK_GROUP: + if (ops->get_min_bank_group) + return a->mode; + break; + case MEM_REPAIR_MAX_BANK_GROUP: + if (ops->get_max_bank_group) + return a->mode; + break; + case MEM_REPAIR_BANK: + if (ops->get_bank) { + if (ops->set_bank) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_BANK: + if (ops->get_min_bank) + return a->mode; + break; + case MEM_REPAIR_MAX_BANK: + if (ops->get_max_bank) + return a->mode; + break; + case MEM_REPAIR_RANK: + if (ops->get_rank) { + if (ops->set_rank) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_RANK: + if (ops->get_min_rank) + return a->mode; + break; + case MEM_REPAIR_MAX_RANK: + if (ops->get_max_rank) + return a->mode; + break; + case MEM_REPAIR_ROW: + if (ops->get_row) { + if (ops->set_row) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_ROW: + if (ops->get_min_row) + return a->mode; + break; + case MEM_REPAIR_MAX_ROW: + if (ops->get_max_row) + return a->mode; + break; + case MEM_REPAIR_COLUMN: + if (ops->get_column) { + if (ops->set_column) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_COLUMN: + if (ops->get_min_column) + return a->mode; + break; + case MEM_REPAIR_MAX_COLUMN: + if (ops->get_max_column) + return a->mode; + break; + case MEM_REPAIR_CHANNEL: + if (ops->get_channel) { + if (ops->set_channel) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_CHANNEL: + if (ops->get_min_channel) + return a->mode; + break; + case MEM_REPAIR_MAX_CHANNEL: + if (ops->get_max_channel) + return a->mode; + break; + case MEM_REPAIR_SUB_CHANNEL: + if (ops->get_sub_channel) { + if (ops->set_sub_channel) + return a->mode; + else + return 0444; + } + break; + case MEM_REPAIR_MIN_SUB_CHANNEL: + if (ops->get_min_sub_channel) + return a->mode; + break; + case MEM_REPAIR_MAX_SUB_CHANNEL: + if (ops->get_max_sub_channel) + return a->mode; + break; + case MEM_DO_REPAIR: + if (ops->do_repair) + return a->mode; + break; + default: + break; + } + + return 0; +} + +#define EDAC_MEM_REPAIR_ATTR_RO(_name, _instance) \ + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RO(_name), \ + .instance = _instance }) + +#define EDAC_MEM_REPAIR_ATTR_WO(_name, _instance) \ + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_WO(_name), \ + .instance = _instance }) + +#define EDAC_MEM_REPAIR_ATTR_RW(_name, _instance) \ + ((struct edac_mem_repair_dev_attr) { .dev_attr = __ATTR_RW(_name), \ + .instance = _instance }) + +static int mem_repair_create_desc(struct device *dev, + const struct attribute_group **attr_groups, + u8 instance) +{ + struct edac_mem_repair_context *ctx; + struct attribute_group *group; + int i; + struct edac_mem_repair_dev_attr dev_attr[] = { + [MEM_REPAIR_FUNCTION] = EDAC_MEM_REPAIR_ATTR_RO(repair_function, + instance), + [MEM_REPAIR_PERSIST_MODE] = + EDAC_MEM_REPAIR_ATTR_RW(persist_mode, instance), + [MEM_REPAIR_DPA_SUPPORT] = + EDAC_MEM_REPAIR_ATTR_RO(dpa_support, instance), + [MEM_REPAIR_SAFE_IN_USE] = + EDAC_MEM_REPAIR_ATTR_RO(repair_safe_when_in_use, + instance), + [MEM_REPAIR_HPA] = EDAC_MEM_REPAIR_ATTR_RW(hpa, instance), + [MEM_REPAIR_MIN_HPA] = EDAC_MEM_REPAIR_ATTR_RO(min_hpa, instance), + [MEM_REPAIR_MAX_HPA] = EDAC_MEM_REPAIR_ATTR_RO(max_hpa, instance), + [MEM_REPAIR_DPA] = EDAC_MEM_REPAIR_ATTR_RW(dpa, instance), + [MEM_REPAIR_MIN_DPA] = EDAC_MEM_REPAIR_ATTR_RO(min_dpa, instance), + [MEM_REPAIR_MAX_DPA] = EDAC_MEM_REPAIR_ATTR_RO(max_dpa, instance), + [MEM_REPAIR_NIBBLE_MASK] = + EDAC_MEM_REPAIR_ATTR_RW(nibble_mask, instance), + [MEM_REPAIR_MIN_NIBBLE_MASK] = + EDAC_MEM_REPAIR_ATTR_RO(min_nibble_mask, instance), + [MEM_REPAIR_MAX_NIBBLE_MASK] = + EDAC_MEM_REPAIR_ATTR_RO(max_nibble_mask, instance), + [MEM_REPAIR_BANK_GROUP] = + EDAC_MEM_REPAIR_ATTR_RW(bank_group, instance), + [MEM_REPAIR_MIN_BANK_GROUP] = + EDAC_MEM_REPAIR_ATTR_RO(min_bank_group, instance), + [MEM_REPAIR_MAX_BANK_GROUP] = + EDAC_MEM_REPAIR_ATTR_RO(max_bank_group, instance), + [MEM_REPAIR_BANK] = EDAC_MEM_REPAIR_ATTR_RW(bank, instance), + [MEM_REPAIR_MIN_BANK] = EDAC_MEM_REPAIR_ATTR_RO(min_bank, instance), + [MEM_REPAIR_MAX_BANK] = EDAC_MEM_REPAIR_ATTR_RO(max_bank, instance), + [MEM_REPAIR_RANK] = EDAC_MEM_REPAIR_ATTR_RW(rank, instance), + [MEM_REPAIR_MIN_RANK] = EDAC_MEM_REPAIR_ATTR_RO(min_rank, instance), + [MEM_REPAIR_MAX_RANK] = EDAC_MEM_REPAIR_ATTR_RO(max_rank, instance), + [MEM_REPAIR_ROW] = EDAC_MEM_REPAIR_ATTR_RW(row, instance), + [MEM_REPAIR_MIN_ROW] = EDAC_MEM_REPAIR_ATTR_RO(min_row, instance), + [MEM_REPAIR_MAX_ROW] = EDAC_MEM_REPAIR_ATTR_RO(max_row, instance), + [MEM_REPAIR_COLUMN] = EDAC_MEM_REPAIR_ATTR_RW(column, instance), + [MEM_REPAIR_MIN_COLUMN] = EDAC_MEM_REPAIR_ATTR_RO(min_column, instance), + [MEM_REPAIR_MAX_COLUMN] = EDAC_MEM_REPAIR_ATTR_RO(max_column, instance), + [MEM_REPAIR_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RW(channel, instance), + [MEM_REPAIR_MIN_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RO(min_channel, instance), + [MEM_REPAIR_MAX_CHANNEL] = EDAC_MEM_REPAIR_ATTR_RO(max_channel, instance), + [MEM_REPAIR_SUB_CHANNEL] = + EDAC_MEM_REPAIR_ATTR_RW(sub_channel, instance), + [MEM_REPAIR_MIN_SUB_CHANNEL] = + EDAC_MEM_REPAIR_ATTR_RO(min_sub_channel, instance), + [MEM_REPAIR_MAX_SUB_CHANNEL] = + EDAC_MEM_REPAIR_ATTR_RO(max_sub_channel, instance), + [MEM_DO_REPAIR] = EDAC_MEM_REPAIR_ATTR_WO(repair, instance) + }; + + ctx = devm_kzalloc(dev, sizeof(*ctx), GFP_KERNEL); + if (!ctx) + return -ENOMEM; + + for (i = 0; i < MEM_REPAIR_MAX_ATTRS; i++) { + memcpy(&ctx->mem_repair_dev_attr[i].dev_attr, + &dev_attr[i], sizeof(dev_attr[i])); + ctx->mem_repair_attrs[i] = + &ctx->mem_repair_dev_attr[i].dev_attr.attr; + } + + sprintf(ctx->name, "%s%d", "mem_repair", instance); + group = &ctx->group; + group->name = ctx->name; + group->attrs = ctx->mem_repair_attrs; + group->is_visible = mem_repair_attr_visible; + attr_groups[0] = group; + + return 0; +} + +/** + * edac_mem_repair_get_desc - get EDAC memory repair descriptors + * @dev: client device with memory repair feature + * @attr_groups: pointer to attribute group container + * @instance: device's memory repair instance number. + * + * Return: + * * %0 - Success. + * * %-EINVAL - Invalid parameters passed. + * * %-ENOMEM - Dynamic memory allocation failed. + */ +int edac_mem_repair_get_desc(struct device *dev, + const struct attribute_group **attr_groups, u8 instance) +{ + if (!dev || !attr_groups) + return -EINVAL; + + return mem_repair_create_desc(dev, attr_groups, instance); +} diff --git a/include/linux/edac.h b/include/linux/edac.h index 979e91426701..5d07192bf1a7 100644 --- a/include/linux/edac.h +++ b/include/linux/edac.h @@ -668,6 +668,7 @@ static inline struct dimm_info *edac_get_dimm(struct mem_ctl_info *mci, enum edac_dev_feat { RAS_FEAT_SCRUB, RAS_FEAT_ECS, + RAS_FEAT_MEM_REPAIR, RAS_FEAT_MAX }; @@ -729,11 +730,147 @@ int edac_ecs_get_desc(struct device *ecs_dev, const struct attribute_group **attr_groups, u16 num_media_frus); +enum edac_mem_repair_function { + EDAC_SOFT_PPR, + EDAC_HARD_PPR, + EDAC_CACHELINE_MEM_SPARING, + EDAC_ROW_MEM_SPARING, + EDAC_BANK_MEM_SPARING, + EDAC_RANK_MEM_SPARING, +}; + +enum edac_mem_repair_persist_mode { + EDAC_MEM_REPAIR_SOFT, /* soft memory repair */ + EDAC_MEM_REPAIR_HARD, /* hard memory repair */ +}; + +enum edac_mem_repair_cmd { + EDAC_DO_MEM_REPAIR = 1, +}; + +/** + * struct edac_mem_repair_ops - memory repair operations + * (all elements are optional except do_repair, set_hpa/set_dpa) + * @get_repair_function: get the memory repair function, listed in + * enum edac_mem_repair_function. + * @get_persist_mode: get the current persist mode. Persist repair modes supported + * in the device is based on the memory repair function which is + * temporary or permanent and is lost with a power cycle. + * EDAC_MEM_REPAIR_SOFT - Soft repair function (temporary repair). + * EDAC_MEM_REPAIR_HARD - Hard memory repair function (permanent repair). + * All other values are reserved. + * @set_persist_mode: set the persist mode of the memory repair instance. + * @get_dpa_support: get dpa support flag. In some states of system configuration + * (e.g. before address decoders have been configured), memory devices + * (e.g. CXL) may not have an active mapping in the main host address + * physical address map. As such, the memory to repair must be identified + * by a device specific physical addressing scheme using a device physical + * address(DPA). The DPA and other control attributes to use for the + * dry_run and repair operations will be presented in related error records. + * @get_repair_safe_when_in_use: get whether memory media is accessible and + * data is retained during repair operation. + * @get_hpa: get current host physical address (HPA). + * @set_hpa: set host physical address (HPA) of memory to repair. + * @get_min_hpa: get the minimum supported host physical address (HPA). + * @get_max_hpa: get the maximum supported host physical address (HPA). + * @get_dpa: get current device physical address (DPA). + * @set_dpa: set device physical address (DPA) of memory to repair. + * @get_min_dpa: get the minimum supported device physical address (DPA). + * @get_max_dpa: get the maximum supported device physical address (DPA). + * @get_nibble_mask: get current nibble mask. + * @set_nibble_mask: set nibble mask of memory to repair. + * @get_min_nibble_mask: get the minimum supported nibble mask. + * @get_max_nibble_mask: get the maximum supported nibble mask. + * @get_bank_group: get current bank group. + * @set_bank_group: set bank group of memory to repair. + * @get_min_bank_group: get the minimum supported bank group. + * @get_max_bank_group: get the maximum supported bank group. + * @get_bank: get current bank. + * @set_bank: set bank of memory to repair. + * @get_min_bank: get the minimum supported bank. + * @get_max_bank: get the maximum supported bank. + * @get_rank: get current rank. + * @set_rank: set rank of memory to repair. + * @get_min_rank: get the minimum supported rank. + * @get_max_rank: get the maximum supported rank. + * @get_row: get current row. + * @set_row: set row of memory to repair. + * @get_min_row: get the minimum supported row. + * @get_max_row: get the maximum supported row. + * @get_column: get current column. + * @set_column: set column of memory to repair. + * @get_min_column: get the minimum supported column. + * @get_max_column: get the maximum supported column. + * @get_channel: get current channel. + * @set_channel: set channel of memory to repair. + * @get_min_channel: get the minimum supported channel. + * @get_max_channel: get the maximum supported channel. + * @get_sub_channel: get current sub channel. + * @set_sub_channel: set sub channel of memory to repair. + * @get_min_sub_channel: get the minimum supported sub channel. + * @get_max_sub_channel: get the maximum supported sub channel. + * @do_repair: Issue memory repair operation for the HPA/DPA and + * other control attributes set for the memory to repair. + */ +struct edac_mem_repair_ops { + int (*get_repair_function)(struct device *dev, void *drv_data, u32 *val); + int (*get_persist_mode)(struct device *dev, void *drv_data, u32 *mode); + int (*set_persist_mode)(struct device *dev, void *drv_data, u32 mode); + int (*get_dpa_support)(struct device *dev, void *drv_data, u32 *val); + int (*get_repair_safe_when_in_use)(struct device *dev, void *drv_data, u32 *val); + int (*get_hpa)(struct device *dev, void *drv_data, u64 *hpa); + int (*set_hpa)(struct device *dev, void *drv_data, u64 hpa); + int (*get_min_hpa)(struct device *dev, void *drv_data, u64 *hpa); + int (*get_max_hpa)(struct device *dev, void *drv_data, u64 *hpa); + int (*get_dpa)(struct device *dev, void *drv_data, u64 *dpa); + int (*set_dpa)(struct device *dev, void *drv_data, u64 dpa); + int (*get_min_dpa)(struct device *dev, void *drv_data, u64 *dpa); + int (*get_max_dpa)(struct device *dev, void *drv_data, u64 *dpa); + int (*get_nibble_mask)(struct device *dev, void *drv_data, u64 *val); + int (*set_nibble_mask)(struct device *dev, void *drv_data, u64 val); + int (*get_min_nibble_mask)(struct device *dev, void *drv_data, u64 *val); + int (*get_max_nibble_mask)(struct device *dev, void *drv_data, u64 *val); + int (*get_bank_group)(struct device *dev, void *drv_data, u32 *val); + int (*set_bank_group)(struct device *dev, void *drv_data, u32 val); + int (*get_min_bank_group)(struct device *dev, void *drv_data, u32 *val); + int (*get_max_bank_group)(struct device *dev, void *drv_data, u32 *val); + int (*get_bank)(struct device *dev, void *drv_data, u32 *val); + int (*set_bank)(struct device *dev, void *drv_data, u32 val); + int (*get_min_bank)(struct device *dev, void *drv_data, u32 *val); + int (*get_max_bank)(struct device *dev, void *drv_data, u32 *val); + int (*get_rank)(struct device *dev, void *drv_data, u32 *val); + int (*set_rank)(struct device *dev, void *drv_data, u32 val); + int (*get_min_rank)(struct device *dev, void *drv_data, u32 *val); + int (*get_max_rank)(struct device *dev, void *drv_data, u32 *val); + int (*get_row)(struct device *dev, void *drv_data, u64 *val); + int (*set_row)(struct device *dev, void *drv_data, u64 val); + int (*get_min_row)(struct device *dev, void *drv_data, u64 *val); + int (*get_max_row)(struct device *dev, void *drv_data, u64 *val); + int (*get_column)(struct device *dev, void *drv_data, u32 *val); + int (*set_column)(struct device *dev, void *drv_data, u32 val); + int (*get_min_column)(struct device *dev, void *drv_data, u32 *val); + int (*get_max_column)(struct device *dev, void *drv_data, u32 *val); + int (*get_channel)(struct device *dev, void *drv_data, u32 *val); + int (*set_channel)(struct device *dev, void *drv_data, u32 val); + int (*get_min_channel)(struct device *dev, void *drv_data, u32 *val); + int (*get_max_channel)(struct device *dev, void *drv_data, u32 *val); + int (*get_sub_channel)(struct device *dev, void *drv_data, u32 *val); + int (*set_sub_channel)(struct device *dev, void *drv_data, u32 val); + int (*get_min_sub_channel)(struct device *dev, void *drv_data, u32 *val); + int (*get_max_sub_channel)(struct device *dev, void *drv_data, u32 *val); + int (*do_repair)(struct device *dev, void *drv_data, u32 val); +}; + +int edac_mem_repair_get_desc(struct device *dev, + const struct attribute_group **attr_groups, + u8 instance); + /* EDAC device feature information structure */ struct edac_dev_data { union { const struct edac_scrub_ops *scrub_ops; const struct edac_ecs_ops *ecs_ops; + const struct edac_mem_repair_ops *mem_repair_ops; }; u8 instance; void *private; @@ -744,6 +881,7 @@ struct edac_dev_feat_ctx { void *private; struct edac_dev_data *scrub; struct edac_dev_data ecs; + struct edac_dev_data *mem_repair; }; struct edac_dev_feature { @@ -752,6 +890,7 @@ struct edac_dev_feature { union { const struct edac_scrub_ops *scrub_ops; const struct edac_ecs_ops *ecs_ops; + const struct edac_mem_repair_ops *mem_repair_ops; }; void *ctx; struct edac_ecs_ex_info ecs_info;