[RFC,00/13] x86 User Interrupts support

Message ID	20210913200132.3396598-1-sohil.mehta@intel.com
Headers	show Return-Path: <linux-kselftest-owner@kernel.org> From: Sohil Mehta <sohil.mehta@intel.com> To: x86@kernel.org Cc: Sohil Mehta <sohil.mehta@intel.com>, Tony Luck <tony.luck@intel.com>, Dave Hansen <dave.hansen@intel.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, "H . Peter Anvin" <hpa@zytor.com>, Andy Lutomirski <luto@kernel.org>, Jens Axboe <axboe@kernel.dk>, Christian Brauner <christian@brauner.io>, Peter Zijlstra <peterz@infradead.org>, Shuah Khan <shuah@kernel.org>, Arnd Bergmann <arnd@arndb.de>, Jonathan Corbet <corbet@lwn.net>, Ashok Raj <ashok.raj@intel.com>, Jacob Pan <jacob.jun.pan@linux.intel.com>, Gayatri Kammela <gayatri.kammela@intel.com>, Zeng Guang <guang.zeng@intel.com>, Dan Williams <dan.j.williams@intel.com>, Randy E Witt <randy.e.witt@intel.com>, Ravi V Shankar <ravi.v.shankar@intel.com>, Ramesh Thomas <ramesh.thomas@intel.com>, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org Subject: [RFC PATCH 00/13] x86 User Interrupts support Date: Mon, 13 Sep 2021 13:01:19 -0700 Message-Id: <20210913200132.3396598-1-sohil.mehta@intel.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	x86 User Interrupts support \| expand [RFC,00/13] x86 User Interrupts support [RFC,02/13] Documentation/x86: Add documentation for User Interrupts [RFC,04/13] x86/fpu/xstate: Enumerate User Interrupts supervisor state [RFC,06/13] x86/uintr: Introduce uintr receiver syscalls [RFC,07/13] x86/process/64: Add uintr task context switch support [RFC,08/13] x86/process/64: Clean up uintr task fork and exit paths [RFC,10/13] x86/uintr: Introduce user IPI sender syscalls

Sohil Mehta Sept. 13, 2021, 8:01 p.m. UTC

User Interrupts Introduction
============================

User Interrupts (Uintr) is a hardware technology that enables delivering
interrupts directly to user space.

Today, virtually all communication across privilege boundaries happens by going
through the kernel. These include signals, pipes, remote procedure calls and
hardware interrupt based notifications. User interrupts provide the foundation
for more efficient (low latency and low CPU utilization) versions of these
common operations by avoiding transitions through the kernel.

In the User Interrupts hardware architecture, a receiver is always expected to
be a user space task. However, a user interrupt can be sent by another user
space task, kernel or an external source (like a device).

In addition to the general infrastructure to receive user interrupts, this
series introduces a single source: interrupts from another user task.  These
are referred to as User IPIs.

The first implementation of User IPIs will be in the Intel processor code-named
Sapphire Rapids. Refer Chapter 11 of the Intel Architecture instruction set
extensions for details of the hardware architecture [1].

Series-reviewed-by: Tony Luck <tony.luck@intel.com>

Main goals of this RFC
======================
- Introduce this upcoming technology to the community.
This cover letter includes a hardware architecture summary along with the
software architecture and kernel design choices. This post is a bit long as a
result. Hopefully, it helps answer more questions than it creates :) I am also
planning to talk about User Interrupts next week at the LPC Kernel summit.

- Discuss potential use cases.
We are starting to look at actual usages and libraries (like libevent[2] and
liburing[3]) that can take advantage of this technology. Unfortunately, we
don't have much to share on this right now. We need some help from the
community to identify usages that can benefit from this. We would like to make
sure the proposed APIs work for the eventual consumers.

- Get early feedback on the software architecture.
We are hoping to get some feedback on the direction of overall software
architecture - starting with User IPI, extending it for kernel-to-user
interrupt notifications and external interrupts in the future. 

- Discuss some of the main architecture opens.
There is lot of work that still needs to happen to enable this technology. We
are looking for some input on future patches that would be of interest. Here
are some of the big opens that we are looking to resolve.
* Should Uintr interrupt all blocking system calls like sleep(), read(),
  poll(), etc? If so, should we implement an SA_RESTART type of mechanism
  similar to signals? - Refer Blocking for interrupts section below.

* Should the User Interrupt Target table (UITT) be shared between threads of a
  multi-threaded application or maybe even across processes? - Refer Sharing
  the UITT section below.

Why care about this? - Micro benchmark performance
==================================================
There is a ~9x or higher performance improvement using User IPI over other IPC
mechanisms for event signaling.

Below is the average normalized latency for a 1M ping-pong IPC notifications
with message size=1.

+------------+-------------------------+
| IPC type   |   Relative Latency      |
|            |(normalized to User IPI) |
+------------+-------------------------+
| User IPI   |                     1.0 |
| Signal     |                    14.8 |
| Eventfd    |                     9.7 |
| Pipe       |                    16.3 |
| Domain     |                    17.3 |
+------------+-------------------------+

Results have been estimated based on tests on internal hardware with Linux
v5.14 + User IPI patches.

Original benchmark: https://github.com/goldsborough/ipc-bench
Updated benchmark: https://github.com/intel/uintr-ipc-bench/tree/linux-rfc-v1

*Performance varies by use, configuration and other factors.

How it works underneath? - Hardware Summary
===========================================
User Interrupts is a posted interrupt delivery mechanism. The interrupts are
first posted to a memory location and then delivered to the receiver when they
are running with CPL=3.

Kernel managed architectural data structures
--------------------------------------------
UPID: User Posted Interrupt Descriptor - Holds receiver interrupt vector
information and notification state (like an ongoing notification, suppressed
notifications).

UITT: User Interrupt Target Table - Stores UPID pointer and vector information
for interrupt routing on the sender side. Referred by the senduipi instruction.

The interrupt state of each task is referenced via MSRs which are saved and
restored by the kernel during context switch.

Instructions
------------
senduipi <index> - send a user IPI to a target task based on the UITT index.

clui - Mask user interrupts by clearing UIF (User Interrupt Flag).

stui - Unmask user interrupts by setting UIF.

testui - Test current value of UIF.

uiret - return from a user interrupt handler.

User IPI
--------
When a User IPI sender executes 'senduipi <index>', the hardware refers the
UITT table entry pointed by the index and posts the interrupt vector (63-0)
into the receiver's UPID.

If the receiver is running (CPL=3), the sender cpu would send a physical IPI to
the receiver's cpu. On the receiver side this IPI is detected as a User
Interrupt. The User Interrupt handler for the receiver is invoked and the
vector number (63-0) is pushed onto the stack.

Upon execution of 'uiret' in the interrupt handler, the control is transferred
back to instruction that was interrupted.

Refer Chapter 11 of the Intel Architecture instruction set extensions [1] for
more details.

Application interface - Software Architecture
=============================================
User Interrupts (Uintr) is an opt-in feature (unlike signals). Applications
wanting to use Uintr are expected to register themselves with the kernel using
the Uintr related system calls. A Uintr receiver is always a userspace task. A
Uintr sender can be another userspace task, kernel or a device.

1) A receiver can register/unregister an interrupt handler using the Uintr
receiver related syscalls. 
		uintr_register_handler(handler, flags)
		uintr_unregister_handler(flags)

2) A syscall also allows a receiver to register a vector and create a user
interrupt file descriptor - uintr_fd. 
		uintr_fd = uintr_create_fd(vector, flags)

Uintr can be useful in some of the usages where eventfd or signals are used for
frequent userspace event notifications. The semantics of uintr_fd are somewhat
similar to an eventfd() or the write end of a pipe.

3) Any sender with access to uintr_fd can use it to deliver events (in this
case - interrupts) to a receiver. A sender task can manage its connection with
the receiver using the sender related syscalls based on uintr_fd.
		uipi_index = uintr_register_sender(uintr_fd, flags)

Using an FD abstraction provides a secure mechanism to connect with a receiver.
The FD sharing and isolation mechanisms put in place by the kernel would extend
to Uintr as well. 

4a) After the initial setup, a sender task can use the SENDUIPI instruction
along with the uipi_index to generate user IPIs without any kernel
intervention.
		SENDUIPI <uipi_index>

If the receiver is running (CPL=3), then the user interrupt is delivered
directly without a kernel transition. If the receiver isn't running the
interrupt is delivered when the receiver gets context switched back. If the
receiver is blocked in the kernel, the user interrupt is delivered to the
kernel which then unblocks the intended receiver to deliver the interrupt.

4b) If the sender is the kernel or a device, the uintr_fd can be passed onto
the related kernel entity to allow them to setup a connection and then generate
a user interrupt for event delivery. <The exact details of this API are still
being worked upon.>

For details of the user interface and associated system calls refer the Uintr
man-pages draft:
https://github.com/intel/uintr-linux-kernel/tree/rfc-v1/tools/uintr/manpages.
We have also included the same content as patch 1 of this series to make it
easier to review.

Refer the Uintr compiler programming guide [4] for details on Uintr integration
with GCC and Binutils.

Kernel design choices
=====================
Here are some of the reasons and trade-offs for the current design of the APIs.

System call interface
---------------------
Why a system call interface?: The 2 options we considered are using a char
device at /dev or use system calls (current approach). A syscall approach
avoids exposing a core cpu feature through a driver model. Also, we want to
have a user interrupt FD per vector and share a single common interrupt handler
among all vectors. This seems easier for the kernel and userspace to accomplish
using a syscall based approach.

Data sharing using user interrupts: Uintr doesn't include a mechanism to
share/transmit data. The expectation is applications use existing data sharing
mechanisms to share data and use Uintr only for signaling.

An FD for each vector: A uintr_fd is assigned to each vector to allow fine
grained priority and event management by the receiver. The alternative we
considered was to allocate an FD to the interrupt handler and having that
shared with the sender. However, that approach relies on the sender selecting
the vector and moves the vector priority management to the sender. Also, if
multiple senders want to send unique user interrupts they would need to
coordinate the vector selection amongst them.

Extending the APIs: Currently, the system calls are only extendable using the
flags argument. We can add a variable size struct to some of the syscalls if
needed.

Extending existing mechanisms
-----------------------------
Uintr can be beneficial in some of the usages where eventfd() or signals are
used. Since Uintr is hardware-dependent, thread-specific and bypasses the
kernel in the fast path, it makes extending existing mechanisms harder.

Main issues with extending signals:
Signal handlers are defined significantly differently than a User interrupt
handler. An application needs to save/restore registers in a user interrupt
handler and call uiret to return from it. Also, signals can be process directed
(or thread directed) but user interrupts are always thread directed.

Comparison of signals with User Interrupts:
+=====================+===========================+===========================+
|                     | Signals                   | User Interrupts           |
+=====================+===========================+===========================+
| Stacks              | Has alt stacks            | Uses application stack    |
|                     |                           | (alternate stack option   |
|                     |                           | not yet enabled)          |
+---------------------+---------------------------+---------------------------+
| Registers state     | Kernel manages incl.      | App responsible (Use GCC  |
|                     | FPU/XSTATE area           | 'interrupt' attribute for |
|                     |                           | general purpose registers)|
+---------------------+---------------------------+---------------------------+
| Blocking/Masking    | sigprocmask(2)/sa_mask    | CLUI instruction (No per  |
|                     |                           | vector masking)           |
+---------------------+---------------------------+---------------------------+
| Direction           | Uni-directional           | Uni-directional           |
+---------------------+---------------------------+---------------------------+
| Post event          | kill(), signal(),         | SENDUIPI <index> - index  |
|                     | sigqueue(), etc.          | derived from uintr_fd     |
+---------------------+---------------------------+---------------------------+
| Target              | Process-directed or       | Thread-directed           |
|                     | thread-directed           |                           |
+---------------------+---------------------------+---------------------------+
| Fork/inheritance    | Empty signal set          | Nothing is inherited      |
+---------------------+---------------------------+---------------------------+
| Execv               | Pending signals preserved | Nothing is inherited      |
+---------------------+---------------------------+---------------------------+
| Order of delivery   | Undetermined              | High to low vector numbers|
| for multiple signals|                           |                           |
+---------------------+---------------------------+---------------------------+
| Handler re-entry    | All signals except the    | No interrupts can cause   |
|                     | one being handled         | handler re-entry.         |
+---------------------+---------------------------+---------------------------+
| Delivery feedback   | 0 or -1 based on whether  | No feedback on whether the|
|                     | the signal was sent       | interrupt was sent or     |
|                     |                           | received.                 |
+---------------------+---------------------------+---------------------------+

Main issues with extending eventfd():
eventfd() has a counter value that is core to the API. User interrupts can't
have an associated counter since the signaling happens at the user level and
the hardware doesn't have a memory counter mechanism. Also, eventfd can be used
for bi-directional signaling where as uintr_fd is uni-directional.

Comparison of eventfd with uintr_fd:
+====================+======================+==============================+
|                    | Eventfd              | uintr_fd (User Interrupt FD) |
+====================+======================+==============================+
| Object             | Counter - uint64     | Receiver vector information  |
+--------------------+----------------------+------------------------------+
| Post event         | write() to eventfd   | SENDUIPI <index> - index     |
|                    |                      | derived from uintr_fd        |
+--------------------+----------------------+------------------------------+
| Receive event      | read() on eventfd    | Implicit - Handler is        |
|                    |                      | invoked with associated      |
|                    |                      | vector.                      |
+--------------------+----------------------+------------------------------+
| Direction          | Bi-directional       | Uni-directional              |
+--------------------+----------------------+------------------------------+
| Data transmitted   | Counter - uint64     | None                         |
+--------------------+----------------------+------------------------------+
| Waiting for events | Poll() family of     | No per vector wait.          |
|                    | syscalls             | uintr_wait() allows waiting  |
|                    |                      | for all user interrupts      |
+--------------------+----------------------+------------------------------+

Security Model
==============
User Interrupts is designed as an opt-in feature (unlike signals). The security
model for user interrupts is intended to be similar to eventfd(). The general
idea is that any sender with access to uintr_fd would be able to generate the
associated interrupt vector for the receiver task that created the fd.

Untrusted processes
-------------------
The current implementation expects only trusted and cooperating processes to
communicate using user interrupts. Coordination is expected between processes
for a connection teardown. In situations where coordination doesn't happen
(say, due to abrupt process exit), the kernel would end up keeping shared
resources (like UPID) allocated to avoid faults.

Currently, a sender can easily cause a denial of service for the receiver by
generating a storm of user interrupts. A user interrupt handler is invoked with
interrupts disabled, but upon execution of uiret, interrupts get enabled again
by the hardware. This can lead to the handler being invoked again before normal
execution can resume. There isn't a hardware mechanism to mask specific
interrupt vectors. 

To enable untrusted processes to communicate, we need to add a per-vector
masking option through another syscall (or maybe IOCTL). However, this can add
some complexity to the kernel code. A vector can only be masked by modifying
the UITT entries at the source. We need to be careful about races while
removing and restoring the UPID from the UITT.

Resource limits
---------------
The maximum number of receiver-sender connections would be limited by the
maximum number of open file descriptors and the size of the UITT.

The UITT size is chosen as 4kB fixed size arbitrarily right now. We plan to
make it dynamic and configurable in size. RLIMIT_MEMLOCK or ENOMEM should be
triggered when the size limits have been hit.

Main Opens
==========

Blocking for interrupts
-----------------------
User interrupts are delivered to applications immediately if they are running
in userspace. If a receiver task has blocked in the kernel using the placeholder
uintr_wait() syscall, the task would be woken up to deliver the user interrupt.
However, if the task is blocked due to any other blocking calls like read(),
sleep(), etc; the interrupt will only get delivered when the application gets
scheduled again. We need to consider if applications need to receive User
Interrupts as soon as they are posted (similar to signals) when they are
blocked due to some other reason. Adding this capability would likely make the
kernel implementation more complex.

Interrupting system calls using User Interrupts would also mean we need to
consider an SA_RESTART type of mechanism. We also need to evaluate if some of
the signal handler related semantics in the kernel can be reused for User
Interrupts.

Sharing the User Interrupt Target Table (UITT)
----------------------------------------------
The current implementation assigns a unique UITT to each task. This assumes
that User interrupts are used for point-to-point communication between 2 tasks.
Also, this keeps the kernel implementation relatively simple.

However, there are of benefits to sharing the UITT between threads of a
multi-threaded application. One, they would see a consistent view of the UITT.
i.e. SENDUIPI <index> would mean the same on all threads of the application.
Also, each thread doesn't have to register itself using the common uintr_fd.
This would simplify the userspace setup and make efficient use of kernel
memory. The potential downside is that the kernel implementation to allocate,
modify, expand and free the UITT would be more complex.

A similar argument can be made for a set of processes that do a lot of IPC
amongst them. They would prefer to have a shared UITT that lets them target any
process from any process. With the current file descriptor based approach, the
connection setup can be time consuming and somewhat cumbersome. We need to
evaluate if this can be made simpler as well.

Kernel page table isolation (KPTI)
----------------------------------
SENDUIPI is a special ring-3 instruction that makes a supervisor mode memory
access to the UPID and UITT memory. The current patches need KPTI to be
disabled for User IPIs to work. To make User IPI work with KPTI, we need to
allocate these structures from a special memory region that has supervisor
access but it is mapped into userspace. The plan is to implement a mechanism
similar to LDT. 

Processors that support user interrupts are not affected by Meltdown so the
auto mode of KPTI will default to off. Users who want to force enable KPTI will
need to wait for a later version of this patch series to use user interrupts.
Please let us know if you want the development of these patches to be
prioritized (or deprioritized).

FAQs
====
Q: What happens if a process is "surprised" by a user interrupt?
A: For tasks that haven't registered with the kernel and requested for user
interrupts aren't expected or able to receive to user interrupts.

Q: Do user interrupts affect kernel scheduling?
A: No. If a task is blocked waiting for user interrupts, when the kernel
receives a notification on behalf of that task we only put it back on the
runqueue. Delivery of a user interrupt in no way changes the scheduling
priorities of a task.

Q: Does the sender get to know if the interrupt was delivered?
A: No. User interrupts only provides a posted interrupt delivery mechanism. If
applications need to rely on whether the interrupt was delivered they should
consider a userspace mechanism for feedback (like a shared memory counter or a
user interrupt back to the sender).

Q: Why is there no feedback on interrupt delivery?
A: Being a posted interrupt delivery mechanism, the interrupt delivery
happens in 2 steps:
1) The interrupt information is stored in a memory location (UPID).
2) The physical interrupt is delivered to the interrupt receiver.

The 2nd step could happen immediately, after an extended period, or it might
never happen based on the state of the receiver after step 1. (The receiver
could have disabled interrupts, have been context switched out or it might have
crashed during that time.) This makes it very hard for the hardware to reliably
provide feedback upon execution of SENDUIPI.

Q: Can user interrupts be nested?
A: Yes. Using STUI instruction in the interrupt handler would allow new user
interrupts to be delivered. However, there no TPR(thread priority register)
like mechanism to allow only higher priority interrupts. Any user interrupt can
be taken when nesting is enabled.

Q: Can a task receive all pending user interrupts in one go?
A: No. The hardware allows only one vector to be processed at a time. If a task
is interested in knowing all the interrupts that are pending then we could add
a syscall that provides the pending interrupts information.

Q: Do the processes need to be pinned to a cpu?
A: No. User interrupts will be routed correctly to whichever cpu the receiver
is running on. The kernel updates the cpu information in the UPID during
context switch.

Q: Why are UPID and UITT allocated by the kernel?
A: If allocated by user space, applications could misuse the UPID and UITT to
write to unauthorized memory and generate interrupts on any cpu. The UPID and
UITT are allocated by the kernel and accessed by the hardware with supervisor
privilege.

Patch structure for this series
===============================
- Man-pages and Kernel documentation (patch 1,2)
- Hardware enumeration (patch 3, 4)
- User IPI kernel vector reservation (patch 5)
- Syscall interface for interrupt receiver, sender and vector
  management(uintr_fd) (patch 6-12)
- Basic selftests (patch 13)

Along with the patches in this RFC, there are additional tests and samples that
are available at:
https://github.com/intel/uintr-linux-kernel/tree/rfc-v1

Links
=====
[1]: https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
[2]: https://libevent.org/
[3]: https://github.com/axboe/liburing
[4]: https://github.com/intel/uintr-compiler-guide/blob/uintr-gcc-11.1/UINTR-compiler-guide.pdf

Sohil Mehta (13):
  x86/uintr/man-page: Include man pages draft for reference
  Documentation/x86: Add documentation for User Interrupts
  x86/cpu: Enumerate User Interrupts support
  x86/fpu/xstate: Enumerate User Interrupts supervisor state
  x86/irq: Reserve a user IPI notification vector
  x86/uintr: Introduce uintr receiver syscalls
  x86/process/64: Add uintr task context switch support
  x86/process/64: Clean up uintr task fork and exit paths
  x86/uintr: Introduce vector registration and uintr_fd syscall
  x86/uintr: Introduce user IPI sender syscalls
  x86/uintr: Introduce uintr_wait() syscall
  x86/uintr: Wire up the user interrupt syscalls
  selftests/x86: Add basic tests for User IPI

 .../admin-guide/kernel-parameters.txt         |   2 +
 Documentation/x86/index.rst                   |   1 +
 Documentation/x86/user-interrupts.rst         | 107 +++
 arch/x86/Kconfig                              |  12 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   6 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   6 +
 arch/x86/include/asm/cpufeatures.h            |   1 +
 arch/x86/include/asm/disabled-features.h      |   8 +-
 arch/x86/include/asm/entry-common.h           |   4 +
 arch/x86/include/asm/fpu/types.h              |  20 +-
 arch/x86/include/asm/fpu/xstate.h             |   3 +-
 arch/x86/include/asm/hardirq.h                |   4 +
 arch/x86/include/asm/idtentry.h               |   5 +
 arch/x86/include/asm/irq_vectors.h            |   6 +-
 arch/x86/include/asm/msr-index.h              |   8 +
 arch/x86/include/asm/processor.h              |   8 +
 arch/x86/include/asm/uintr.h                  |  76 ++
 arch/x86/include/uapi/asm/processor-flags.h   |   2 +
 arch/x86/kernel/Makefile                      |   1 +
 arch/x86/kernel/cpu/common.c                  |  61 ++
 arch/x86/kernel/cpu/cpuid-deps.c              |   1 +
 arch/x86/kernel/fpu/core.c                    |  17 +
 arch/x86/kernel/fpu/xstate.c                  |  20 +-
 arch/x86/kernel/idt.c                         |   4 +
 arch/x86/kernel/irq.c                         |  51 +
 arch/x86/kernel/process.c                     |  10 +
 arch/x86/kernel/process_64.c                  |   4 +
 arch/x86/kernel/uintr_core.c                  | 880 ++++++++++++++++++
 arch/x86/kernel/uintr_fd.c                    | 300 ++++++
 include/linux/syscalls.h                      |   8 +
 include/uapi/asm-generic/unistd.h             |  15 +-
 kernel/sys_ni.c                               |   8 +
 scripts/checksyscalls.sh                      |   6 +
 tools/testing/selftests/x86/Makefile          |  10 +
 tools/testing/selftests/x86/uintr.c           | 147 +++
 tools/uintr/manpages/0_overview.txt           | 265 ++++++
 tools/uintr/manpages/1_register_receiver.txt  | 122 +++
 .../uintr/manpages/2_unregister_receiver.txt  |  62 ++
 tools/uintr/manpages/3_create_fd.txt          | 104 +++
 tools/uintr/manpages/4_register_sender.txt    | 121 +++
 tools/uintr/manpages/5_unregister_sender.txt  |  79 ++
 tools/uintr/manpages/6_wait.txt               |  59 ++
 42 files changed, 2626 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/x86/user-interrupts.rst
 create mode 100644 arch/x86/include/asm/uintr.h
 create mode 100644 arch/x86/kernel/uintr_core.c
 create mode 100644 arch/x86/kernel/uintr_fd.c
 create mode 100644 tools/testing/selftests/x86/uintr.c
 create mode 100644 tools/uintr/manpages/0_overview.txt
 create mode 100644 tools/uintr/manpages/1_register_receiver.txt
 create mode 100644 tools/uintr/manpages/2_unregister_receiver.txt
 create mode 100644 tools/uintr/manpages/3_create_fd.txt
 create mode 100644 tools/uintr/manpages/4_register_sender.txt
 create mode 100644 tools/uintr/manpages/5_unregister_sender.txt
 create mode 100644 tools/uintr/manpages/6_wait.txt


base-commit: 6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f

Dave Hansen Sept. 13, 2021, 8:27 p.m. UTC | #1

On 9/13/21 1:01 PM, Sohil Mehta wrote:
> User Interrupts (Uintr) is a hardware technology that enables delivering
> interrupts directly to user space.

Your problem in all of this is going to be convincing folks that this is
a problem worth solving.  I'd start this off with something
attention-grabbing.

Two things.  Good, snazzy writing doesn't repeat words.  You repeated
"interrupt" twice in that first sentence.  It also doesn't get my
attention.  Here's a more concise way of saying it, and also adding
something to get the reader's attention:

	User Interrupts directly deliver events to user space and are
	10x faster than the closest alternative.

Sohil Mehta Sept. 14, 2021, 7:03 p.m. UTC | #2

Resending.. There were some email delivery issues.

On 9/13/2021 1:27 PM, Dave Hansen wrote:
>	User Interrupts directly deliver events to user space and are

>	10x faster than the closest alternative.

Thanks Dave. This is definitely more attention-grabbing than the
previous intro. I'll include this next time.

One thing to note, the 10x gain is only applicable for User IPIs.
For other source of User Interrupts (like kernel-to-user
notifications and other external sources), we don't have the data
yet.

I realized the User IPI data in the cover also needs some
clarification. The 10x gain is only seen when the receiver is
spinning in User space - waiting for interrupts.

If the receiver were to block (wait) in the kernel, the performance
would drop as expected. However, User IPI (blocked) would still be
10% faster than Eventfd and 40% faster than signals.

Here is the updated table:
+---------------------+-------------------------+
| IPC type            |   Relative Latency      |
|                     |(normalized to User IPI) |
+---------------------+-------------------------+
| User IPI            |                     1.0 |
| User IPI (blocked)  |                     8.9 |
| Signal              |                    14.8 |
| Eventfd             |                     9.7 |
| Pipe                |                    16.3 |
| Domain              |                    17.3 |
+---------------------+-------------------------+

--Sohil

Greg Kroah-Hartman Sept. 23, 2021, 12:19 p.m. UTC | #3

On Tue, Sep 14, 2021 at 07:03:36PM +0000, Mehta, Sohil wrote:
> Resending.. There were some email delivery issues.

> 

> On 9/13/2021 1:27 PM, Dave Hansen wrote:

> >	User Interrupts directly deliver events to user space and are

> >	10x faster than the closest alternative.

> 

> Thanks Dave. This is definitely more attention-grabbing than the

> previous intro. I'll include this next time.

> 

> One thing to note, the 10x gain is only applicable for User IPIs.

> For other source of User Interrupts (like kernel-to-user

> notifications and other external sources), we don't have the data

> yet.

> 

> I realized the User IPI data in the cover also needs some

> clarification. The 10x gain is only seen when the receiver is

> spinning in User space - waiting for interrupts.

> 

> If the receiver were to block (wait) in the kernel, the performance

> would drop as expected. However, User IPI (blocked) would still be

> 10% faster than Eventfd and 40% faster than signals.

> 

> Here is the updated table:

> +---------------------+-------------------------+

> | IPC type            |   Relative Latency      |

> |                     |(normalized to User IPI) |

> +---------------------+-------------------------+

> | User IPI            |                     1.0 |

> | User IPI (blocked)  |                     8.9 |

> | Signal              |                    14.8 |

> | Eventfd             |                     9.7 |

> | Pipe                |                    16.3 |

> | Domain              |                    17.3 |

> +---------------------+-------------------------+


Relative is just that, "relative".  If the real values are extremely
tiny, then relative is just "this goes a tiny tiny bit faster than what
you have today in eventfd", right?

So how about "absolute"?  What are we talking here?

And this is really only for the "one userspace task waking up another
userspace task" policies.  What real workload can actually use this?

thanks,

greg k-h

Greg Kroah-Hartman Sept. 23, 2021, 2:09 p.m. UTC | #4

On Thu, Sep 23, 2021 at 02:19:05PM +0200, Greg KH wrote:
> On Tue, Sep 14, 2021 at 07:03:36PM +0000, Mehta, Sohil wrote:

> > Resending.. There were some email delivery issues.

> > 

> > On 9/13/2021 1:27 PM, Dave Hansen wrote:

> > >	User Interrupts directly deliver events to user space and are

> > >	10x faster than the closest alternative.

> > 

> > Thanks Dave. This is definitely more attention-grabbing than the

> > previous intro. I'll include this next time.

> > 

> > One thing to note, the 10x gain is only applicable for User IPIs.

> > For other source of User Interrupts (like kernel-to-user

> > notifications and other external sources), we don't have the data

> > yet.

> > 

> > I realized the User IPI data in the cover also needs some

> > clarification. The 10x gain is only seen when the receiver is

> > spinning in User space - waiting for interrupts.

> > 

> > If the receiver were to block (wait) in the kernel, the performance

> > would drop as expected. However, User IPI (blocked) would still be

> > 10% faster than Eventfd and 40% faster than signals.

> > 

> > Here is the updated table:

> > +---------------------+-------------------------+

> > | IPC type            |   Relative Latency      |

> > |                     |(normalized to User IPI) |

> > +---------------------+-------------------------+

> > | User IPI            |                     1.0 |

> > | User IPI (blocked)  |                     8.9 |

> > | Signal              |                    14.8 |

> > | Eventfd             |                     9.7 |

> > | Pipe                |                    16.3 |

> > | Domain              |                    17.3 |

> > +---------------------+-------------------------+

> 

> Relative is just that, "relative".  If the real values are extremely

> tiny, then relative is just "this goes a tiny tiny bit faster than what

> you have today in eventfd", right?

> 

> So how about "absolute"?  What are we talking here?

> 

> And this is really only for the "one userspace task waking up another

> userspace task" policies.  What real workload can actually use this?


Also, you forgot to list Binder in the above IPC type.

And you forgot to mention that this is tied to one specific CPU type
only.  Are syscalls allowed to be created that would only work on
obscure cpus like this one?

thanks,

greg k-h

Jens Axboe Sept. 23, 2021, 2:39 p.m. UTC | #5

On 9/13/21 2:01 PM, Sohil Mehta wrote:
> - Discuss potential use cases.

> We are starting to look at actual usages and libraries (like libevent[2] and

> liburing[3]) that can take advantage of this technology. Unfortunately, we

> don't have much to share on this right now. We need some help from the

> community to identify usages that can benefit from this. We would like to make

> sure the proposed APIs work for the eventual consumers.

One use case for liburing/io_uring would be to use it instead of eventfd
for notifications. I know some folks do use eventfd right now, though
it's not that common. But if we had support for something like this,
then you could use it to know when to reap events rather than sleep in
the kernel. Or at least to be notified when new events have been posted
to the cq ring.

-- 
Jens Axboe

Dave Hansen Sept. 23, 2021, 2:46 p.m. UTC | #6

On 9/23/21 7:09 AM, Greg KH wrote:
> And you forgot to mention that this is tied to one specific CPU type

> only.  Are syscalls allowed to be created that would only work on

> obscure cpus like this one?

Well, you have to start somewhere.  For example, when memory protection
keys went in, we added three syscalls:

> 329     common  pkey_mprotect           sys_pkey_mprotect

> 330     common  pkey_alloc              sys_pkey_alloc

> 331     common  pkey_free               sys_pkey_free

At the point that I started posting these, you couldn't even buy a
system with this feature.  For a while, there was only one Intel Xeon
generation that had support.

But, if you build it, they will come.  Today, there is powerpc support
and our friends at AMD added support to their processors.  In addition,
protection keys are found across Intel's entire CPU line: from big
Xeons, down to the little Atoms you find in Chromebooks.

I encourage everyone submitting new hardware features to include
information about where their feature will show up to end users *and* to
say how widely it will be available.  I'd actually prefer if maintainers
rejected patches that didn't have this information.

Greg Kroah-Hartman Sept. 23, 2021, 3:07 p.m. UTC | #7

On Thu, Sep 23, 2021 at 07:46:43AM -0700, Dave Hansen wrote:
> I encourage everyone submitting new hardware features to include

> information about where their feature will show up to end users *and* to

> say how widely it will be available.  I'd actually prefer if maintainers

> rejected patches that didn't have this information.


Make sense.  So, what are the answers to these questions for this new
CPU feature?

thanks,

greg k-h

Thomas Gleixner Sept. 23, 2021, 10:24 p.m. UTC | #8

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
> SENDUIPI is a special ring-3 instruction that makes a supervisor mode

> memory access to the UPID and UITT memory. Currently, KPTI needs to be

> off for User IPIs to work.  Processors that support user interrupts are

> not affected by Meltdown so the auto mode of KPTI will default to off.

>

> Users who want to force enable KPTI will need to wait for a later

> version of this patch series that is compatible with KPTI. We need to

> allocate the UPID and UITT structures from a special memory region that

> has supervisor access but it is mapped into userspace. The plan is to

> implement a mechanism similar to LDT.


Seriously?

> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

> Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>


This SOB chain is invalid. Ditto in several other patches.

>  

> +config X86_USER_INTERRUPTS

> +	bool "User Interrupts (UINTR)"

> +	depends on X86_LOCAL_APIC && X86_64


X86_64 does not work w/o LOCAL_APIC so this dependency is pointless.

> +	depends on CPU_SUP_INTEL

> +	help

> +	  User Interrupts are events that can be delivered directly to

> +	  userspace without a transition through the kernel. The interrupts

> +	  could be generated by another userspace application, kernel or a

> +	  device.

> +

> +	  Refer, Documentation/x86/user-interrupts.rst for details.


"Refer, Documentation..." is not a sentence.

>  

> +/* User Interrupt interface */

> +#define MSR_IA32_UINTR_RR		0x985

> +#define MSR_IA32_UINTR_HANDLER		0x986

> +#define MSR_IA32_UINTR_STACKADJUST	0x987

> +#define MSR_IA32_UINTR_MISC		0x988	/* 39:32-UINV, 31:0-UITTSZ */


Bah, these tail comments are crap. Please define proper masks/shift
constants for this instead of using magic numbers in the code.

> +static __always_inline void setup_uintr(struct cpuinfo_x86 *c)


This has to be always inline because it's performance critical or what?

> +{

> +	/* check the boot processor, plus compile options for UINTR. */


Sentences start with uppercase letters.

> +	if (!cpu_feature_enabled(X86_FEATURE_UINTR))

> +		goto disable_uintr;

> +

> +	/* checks the current processor's cpuid bits: */

> +	if (!cpu_has(c, X86_FEATURE_UINTR))

> +		goto disable_uintr;

> +

> +	/*

> +	 * User Interrupts currently doesn't support PTI. For processors that

> +	 * support User interrupts PTI in auto mode will default to off.  Need

> +	 * this check only for users who have force enabled PTI.

> +	 */

> +	if (boot_cpu_has(X86_FEATURE_PTI)) {

> +		pr_info_once("x86: User Interrupts (UINTR) not enabled. Please disable PTI using 'nopti' kernel parameter\n");


That message does not make sense. The admin has explicitly added 'pti'
to the kernel command line on a CPU which is not affected. So why would
he now have to add 'nopti' ?

Thanks,

        tglx

Thomas Gleixner Sept. 23, 2021, 11:07 p.m. UTC | #9

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
> A user interrupt notification vector is used on the receiver's cpu to

> identify an interrupt as a user interrupt (and not a kernel interrupt).

> Hardware uses the same notification vector to generate an IPI from a

> sender's cpu core when the SENDUIPI instruction is executed.

>

> Typically, the kernel shouldn't receive an interrupt with this vector.

> However, it is possible that the kernel might receive this vector.

>

> Scenario that can cause the spurious interrupt:

>

> Step	cpu 0 (receiver task)		cpu 1 (sender task)

> ----	---------------------		-------------------

> 1	task is running

> 2					executes SENDUIPI

> 3					IPI sent

> 4	context switched out

> 5	IPI delivered

> 	(kernel interrupt detected)

>

> A kernel interrupt can be detected, if a receiver task gets scheduled

> out after the SENDUIPI-based IPI was sent but before the IPI was

> delivered.


What happens if the SENDUIPI is issued when the target task is not on
the CPU? How is that any different from the above?

> The kernel doesn't need to do anything in this case other than receiving

> the interrupt and clearing the local APIC. The user interrupt is always

> stored in the receiver's UPID before the IPI is generated. When the

> receiver gets scheduled back the interrupt would be delivered based on

> its UPID.


So why on earth is that vector reaching the CPU at all?

> +#ifdef CONFIG_X86_USER_INTERRUPTS

> +	seq_printf(p, "%*s: ", prec, "UIS");


No point in printing that when user interrupts are not available/enabled
on the system.

> +	for_each_online_cpu(j)

> +		seq_printf(p, "%10u ", irq_stats(j)->uintr_spurious_count);

> +	seq_puts(p, "  User-interrupt spurious event\n");

>  #endif

>  	return 0;

>  }

> @@ -325,6 +331,33 @@ DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_kvm_posted_intr_nested_ipi)

>  }

>  #endif

>  

> +#ifdef CONFIG_X86_USER_INTERRUPTS

> +/*

> + * Handler for UINTR_NOTIFICATION_VECTOR.

> + *

> + * The notification vector is used by the cpu to detect a User Interrupt. In

> + * the typical usage, the cpu would handle this interrupt and clear the local

> + * apic.

> + *

> + * However, it is possible that the kernel might receive this vector. This can

> + * happen if the receiver thread was running when the interrupt was sent but it

> + * got scheduled out before the interrupt was delivered. The kernel doesn't

> + * need to do anything other than clearing the local APIC. A pending user

> + * interrupt is always saved in the receiver's UPID which can be referenced

> + * when the receiver gets scheduled back.

> + *

> + * If the kernel receives a storm of these, it could mean an issue with the

> + * kernel's saving and restoring of the User Interrupt MSR state; Specifically,

> + * the notification vector bits in the IA32_UINTR_MISC_MSR.


Definitely well thought out hardware that.

Thanks,

        tglx

Sohil Mehta Sept. 23, 2021, 11:09 p.m. UTC | #10

On 9/23/2021 5:19 AM, Greg KH wrote:
> On Tue, Sep 14, 2021 at 07:03:36PM +0000, Mehta, Sohil wrote:

>

> Here is the updated table:

> +---------------------+-------------------------+

> | IPC type            |   Relative Latency      |

> |                     |(normalized to User IPI) |

> +---------------------+-------------------------+

> | User IPI            |                     1.0 |

> | User IPI (blocked)  |                     8.9 |

> | Signal              |                    14.8 |

> | Eventfd             |                     9.7 |

> | Pipe                |                    16.3 |

> | Domain              |                    17.3 |

> +---------------------+-------------------------+

> Relative is just that, "relative".  If the real values are extremely

> tiny, then relative is just "this goes a tiny tiny bit faster than what

> you have today in eventfd", right?

>

> So how about "absolute"?  What are we talking here?

Thanks Greg for reviewing the patches.

The reason I have not included absolute numbers is that on a 
pre-production platform it could be misleading. The data here is more of 
an approximation with the final performance expected to trend in this 
direction.

I have used the term "relative" only to signify that this is comparing 
User IPI with others.

Let's say, if eventfd took 9.7 usec on a system then User IPI (running) 
would take 1 usec. So it would still be a 9x improvement.

But, I agree with your point. This is only a micro-benchmark performance 
comparison. The overall gain in a real workload would depend on how it 
uses IPC.

+---------------------+------------------------------+
| IPC type            |       Example Latency        |
|                     |        (micro seconds)       |
+---------------------+------------------------------+
| User IPI (running)  |                     1.0 usec |
| User IPI (blocked)  |                     8.9 usec |
| Signal              |                    14.8 usec |
| Eventfd             |                     9.7 usec |
| Pipe                |                    16.3 usec |
| Domain              |                    17.3 usec |
+---------------------+------------------------------+

> And this is really only for the "one userspace task waking up another

> userspace task" policies.  What real workload can actually use this?

A User IPI sender could be registered to send IPIs to multiple targets. 
But, there is no broadcast mechanism, so it can only target one receiver 
everytime it executes the SENDUIPI instruction.

Thanks,

Sohil

> thanks,

>

> greg k-h

Sohil Mehta Sept. 23, 2021, 11:24 p.m. UTC | #11

On 9/23/2021 7:09 AM, Greg KH wrote:
> Also, you forgot to list Binder in the above IPC type.

>

Thanks for pointing that out. In the LPC discussion today there was also 
a suggestion to compare this with Futex wake.

I'll include a comparison with Binder and Futex next time.

I used this IPC benchmark this time but it doesn't include Binder and Futex.

https://github.com/goldsborough/ipc-bench

Would you know if there is anything out there that is more comprehensive 
for benchmarking IPC?

Thanks,

Sohil

Sohil Mehta Sept. 24, 2021, 12:17 a.m. UTC | #12

On 9/23/2021 5:19 AM, Greg KH wrote:

> What real workload can actually use this?

>

I missed replying to this.

User mode runtimes is one the usages that we think would benefit from 
User IPIs.

Also as Jens mentioned in another thread, this could help kernel to user 
notifications in io_uring (using User Interrupts instead of eventfd for 
signaling).

Libevent is another abstraction that we are evaluating.

Thanks,

Sohil

Thomas Gleixner Sept. 24, 2021, 11:04 a.m. UTC | #13

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
> Add a new system call to allow applications to block in the kernel and

> wait for user interrupts.

>

> <The current implementation doesn't support waking up from other

> blocking system calls like sleep(), read(), epoll(), etc.

>

> uintr_wait() is a placeholder syscall while we decide on that

> behaviour.>

>

> When the application makes this syscall the notification vector is

> switched to a new kernel vector. Any new SENDUIPI will invoke the kernel

> interrupt which is then used to wake up the process.

>

> Currently, the task wait list is global one. To make the implementation

> scalable there is a need to move to a distributed per-cpu wait list.


How are per cpu wait lists going to solve the problem?

> +

> +/*

> + * Handler for UINTR_KERNEL_VECTOR.

> + */

> +DEFINE_IDTENTRY_SYSVEC(sysvec_uintr_kernel_notification)

> +{

> +	/* TODO: Add entry-exit tracepoints */

> +	ack_APIC_irq();

> +	inc_irq_stat(uintr_kernel_notifications);

> +

> +	uintr_wake_up_process();


So this interrupt happens for any of those notifications. How are they
differentiated? 
>  

> +int uintr_receiver_wait(void)

> +{

> +	struct uintr_upid_ctx *upid_ctx;

> +	unsigned long flags;

> +

> +	if (!is_uintr_receiver(current))

> +		return -EOPNOTSUPP;

> +

> +	upid_ctx = current->thread.ui_recv->upid_ctx;

> +	upid_ctx->upid->nc.nv = UINTR_KERNEL_VECTOR;

> +	upid_ctx->waiting = true;

> +	spin_lock_irqsave(&uintr_wait_lock, flags);

> +	list_add(&upid_ctx->node, &uintr_wait_list);

> +	spin_unlock_irqrestore(&uintr_wait_lock, flags);

> +

> +	set_current_state(TASK_INTERRUPTIBLE);


Because we have not enough properly implemented wait primitives you need
to open code one which is blantantly wrong vs. a concurrent wake up?

> +	schedule();


How is that correct vs. a spurious wakeup? What takes care that the
entry is removed from the list?

Again. We have proper wait primitives.

> +	return -EINTR;

> +}

> +

> +/*

> + * Runs in interrupt context.

> + * Scan through all UPIDs to check if any interrupt is on going.

> + */

> +void uintr_wake_up_process(void)

> +{

> +	struct uintr_upid_ctx *upid_ctx, *tmp;

> +	unsigned long flags;

> +

> +	spin_lock_irqsave(&uintr_wait_lock, flags);

> +	list_for_each_entry_safe(upid_ctx, tmp, &uintr_wait_list, node) {

> +		if (test_bit(UPID_ON, (unsigned long*)&upid_ctx->upid->nc.status)) {

> +			set_bit(UPID_SN, (unsigned long *)&upid_ctx->upid->nc.status);

> +			upid_ctx->upid->nc.nv = UINTR_NOTIFICATION_VECTOR;

> +			upid_ctx->waiting = false;

> +			wake_up_process(upid_ctx->task);

> +			list_del(&upid_ctx->node);


So any of these notification interrupts does a global mass wake up? How
does that make sense?

> +		}

> +	}

> +	spin_unlock_irqrestore(&uintr_wait_lock, flags);

> +}

> +

> +/* Called when task is unregistering/exiting */

> +static void uintr_remove_task_wait(struct task_struct *task)

> +{

> +	struct uintr_upid_ctx *upid_ctx, *tmp;

> +	unsigned long flags;

> +

> +	spin_lock_irqsave(&uintr_wait_lock, flags);

> +	list_for_each_entry_safe(upid_ctx, tmp, &uintr_wait_list, node) {

> +		if (upid_ctx->task == task) {

> +			pr_debug("wait: Removing task %d from wait\n",

> +				 upid_ctx->task->pid);

> +			upid_ctx->upid->nc.nv = UINTR_NOTIFICATION_VECTOR;

> +			upid_ctx->waiting = false;

> +			list_del(&upid_ctx->node);

> +		}


What? You have to do a global list walk to find the entry which you
added yourself?

Thanks,

        tglx

Sohil Mehta Sept. 24, 2021, 7:59 p.m. UTC | #14

On 9/23/2021 3:24 PM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:

>> Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>

>> Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>

> This SOB chain is invalid. Ditto in several other patches.

>

>

Thank you Thomas for reviewing the patches! Really appreciate it.

I'll fix the SOB chain next time. I am planning to reply to rest of the 
comments over the next week.

Thanks,

Sohil

Thomas Gleixner Sept. 25, 2021, 12:08 p.m. UTC | #15

On Fri, Sep 24 2021 at 13:04, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:

>> +int uintr_receiver_wait(void)

>> +{

>> +	struct uintr_upid_ctx *upid_ctx;

>> +	unsigned long flags;

>> +

>> +	if (!is_uintr_receiver(current))

>> +		return -EOPNOTSUPP;

>> +

>> +	upid_ctx = current->thread.ui_recv->upid_ctx;

>> +	upid_ctx->upid->nc.nv = UINTR_KERNEL_VECTOR;

>> +	upid_ctx->waiting = true;

>> +	spin_lock_irqsave(&uintr_wait_lock, flags);

>> +	list_add(&upid_ctx->node, &uintr_wait_list);

>> +	spin_unlock_irqrestore(&uintr_wait_lock, flags);

>> +

>> +	set_current_state(TASK_INTERRUPTIBLE);

>

> Because we have not enough properly implemented wait primitives you need

> to open code one which is blantantly wrong vs. a concurrent wake up?

>

>> +	schedule();

>

> How is that correct vs. a spurious wakeup? What takes care that the

> entry is removed from the list?

>

> Again. We have proper wait primitives.


Aisde of that this is completely broken vs. CPU hotplug.

CPUX
  switchto(tsk)
    tsk->upid.ndst = apicid(smp_processor_id();

  ret_to_user()
  ...
  sys_uintr_wait()
    ...
    schedule()

After that CPU X is unplugged which means the task won't be woken up by
an user IPI which is issued after CPU X went down.

Thanks,

        tglx

Thomas Gleixner Sept. 25, 2021, 1:30 p.m. UTC | #16

On Fri, Sep 24 2021 at 01:07, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:

>> The kernel doesn't need to do anything in this case other than receiving

>> the interrupt and clearing the local APIC. The user interrupt is always

>> stored in the receiver's UPID before the IPI is generated. When the

>> receiver gets scheduled back the interrupt would be delivered based on

>> its UPID.

>

> So why on earth is that vector reaching the CPU at all?

Let's see how this works:

  task starts using UINTR.
    set UINTR_NOTIFACTION_VECTOR in MSR_IA32_UINTR_MISC

So from that point on the User-Interrupt Notification Identification
mechanism swallows the vector.

Where this stops working is not limited to context switch. The wreckage
comes from XSAVES:

 "After saving the user-interrupt state component, XSAVES clears
  UINV. (UINV is IA32_UINTR_MISC[39:32]; XSAVES does not modify the
  remainder of that MSR.)"

So the problem is _not_ context switch. The problem is XSAVES and that
can be issued even without a context switch.

The obvious question is: What is the value of clearing UINV?

Absolutely none. That notification vector cannot be used for anything
else, so why would the OS be interested to see it ever? This is about
user space interupts, right?

UINV should be set _ONCE_ when CR4.UINTR is enabled and not be touched
by XSAVES/XRSTORS at all. Any delivery of this vector to the OS should
be considered a hardware bug.

Thanks,

         tglx

Thomas Gleixner Sept. 26, 2021, 12:39 p.m. UTC | #17

On Sat, Sep 25 2021 at 15:30, Thomas Gleixner wrote:
> On Fri, Sep 24 2021 at 01:07, Thomas Gleixner wrote:

> The obvious question is: What is the value of clearing UINV?

>

> Absolutely none. That notification vector cannot be used for anything

> else, so why would the OS be interested to see it ever? This is about

> user space interupts, right?

>

> UINV should be set _ONCE_ when CR4.UINTR is enabled and not be touched

> by XSAVES/XRSTORS at all. Any delivery of this vector to the OS should

> be considered a hardware bug.


After decoding the documentation (sigh) and staring at the implications of
keeping UINV armed, I can see the point vs. the UPID lifetime issue when
a task gets scheduled out and migrated to a different CPU.

Not the most pretty solution, but as there needs to be some invalidation
which needs to be undone on return to user space it probably does not
matter much. 

As the whole thing is tightly coupled to XSAVES/RSTORS we need to
integrate it into that machinery and not pretend that it's something
half independent.

That means we have to handle the setting of the SN bit in UPID whenever
XSTATE is saved either during context switch, when the kernel uses the
FPU or in other places (signals, fpu_clone ...). They all end up in
save_fpregs_to_fpstate() so that might be the place to look at.

While talking about that: fpu_clone() has to invalidate the UINTR state
in the clone's xstate after the memcpy() or xsaves() operation.

Also the restore portion on the way back to user space has to be coupled
more tightly:

arch_exit_to_user_mode_prepare()
{
        ...
        if (unlikely(ti_work & _TIF_UPID))
        	uintr_restore_upid(ti_work & _TIF_NEED_FPU_LOAD);
        if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))
        	switch_fpu_return();
}

upid_set_ndst(upid)
{
	apicid = __this_cpu_read(x86_cpu_to_apicid);

        if (x2apic_enabled())
            upid->ndst.x2apic = apicid;
        else
            upid->ndst.apic = apicid;
}

uintr_restore_upid(bool xrstors_pending)
{
        clear_thread_flag(TIF_UPID);
        
	// Update destination
        upid_set_ndst(upid);

        // Do we need something stronger here?
        barrier();

        clear_bit(SN, upid->status);

        // Any SENDUIPI after this point sends to this CPU
           
        // Any bit which was set in upid->pir after SN was set
        // and/or UINV was cleared by XSAVES up to the point
        // where SN was cleared above is not reflected in UIRR.

	// As this runs with interrupts disabled the current state
        // of upid->pir can be read and used for restore. A SENDUIPI
        // which sets a bit in upid->pir after that read will send
        // the notification vector which is going to be handled once
        // the task reenables interrupts on return to user space.
        // If the SENDUIPI set the bit before the read then the
        // notification vector handling will just observe the same
        // PIR state.

        // Needs to be a locked access as there might be a
        // concurrent SENDUIPI modiying it.
        pir = read_locked(upid->pir);

        if (xrstors_pending)) {
        	// Update the saved xstate for xrstors
           	current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR;
                current->xstate.uintr.uirr = pir;
        } else {
                // Manually restore UIRR and UINV
                wrmsrl(IA32_UINTR_RR, pir);

	        misc.val64 = 0;
                misc.uittsz = current->uintr->uittsz;
                misc.uinv = UINTR_NOTIFICATION_VECTOR;
                wrmsrl(IA32_UINTR_MISC, misc.val64);
        }
}

That's how I deciphered the documentation and I don't think this is far
from reality, but I might be wrong as usual.

Hmm?

Thanks,

        tglx

Thomas Gleixner Sept. 26, 2021, 2:41 p.m. UTC | #18

On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:
> Add a new system call to allow applications to block in the kernel and

> wait for user interrupts.

>

> <The current implementation doesn't support waking up from other

> blocking system calls like sleep(), read(), epoll(), etc.

>

> uintr_wait() is a placeholder syscall while we decide on that

> behaviour.>

Which behaviour? You cannot integrate this into [clock_]nanosleep() by
any means or wakeup something which is blocked in read(somefd) via
SENDUIPI.

What you can do is implement read() and poll() support for the
uintrfd. Anything else is just not going to fly.

Adding support for read/poll is pretty much a straight forward variant
of a correctly implemented wait()/wakeup() mechanism.

While poll()/read() support might be useful and poll() also provides a
timeout, having an explicit (timed) wait mechanism might be interesting.

But that brings me to an interesting question. There are two cases:

 1) The task installed a user space interrupt handler. Now it
    want's to play nice and yield the CPU while waiting.

    So it needs to reinstall the UINV vector on return to user and
    update UIRR, but that'd be covered by the existing mechanism. Fine.

 2) Task has no user space interrupt handler installed and just want's
    to use that wait mechanism.

    What is consuming the pending bit(s)? 

    If that's not a valid use case, then the wait has to check for that
    and reject the syscall with EINVAL.

    If it is valid, then how are the pending bits consumed and relayed to
    user space?

The same questions arise when you think about implementing poll/read
support simply because the regular poll/read semantics are:

  poll waits for the event and read consumes the event

which would be similar to #2 above, but with an installed user space
interrupt handler the return from the poll system call would consume the
event immediately (assumed that UIF is set).

Thanks,

        tglx

Sohil Mehta Sept. 27, 2021, 7:07 p.m. UTC | #19

On 9/26/2021 5:39 AM, Thomas Gleixner wrote:
> On Sat, Sep 25 2021 at 15:30, Thomas Gleixner wrote:

>> On Fri, Sep 24 2021 at 01:07, Thomas Gleixner wrote:

>> The obvious question is: What is the value of clearing UINV?

>>

>> Absolutely none. That notification vector cannot be used for anything

>> else, so why would the OS be interested to see it ever? This is about

>> user space interupts, right?

>>

>> UINV should be set _ONCE_ when CR4.UINTR is enabled and not be touched

>> by XSAVES/XRSTORS at all. Any delivery of this vector to the OS should

>> be considered a hardware bug.

> After decoding the documentation (sigh) and staring at the implications of

> keeping UINV armed, I can see the point vs. the UPID lifetime issue when

> a task gets scheduled out and migrated to a different CPU.



I think you got it right. Here is my understanding of this.

The User-interrupt notification processing moves all the pending 
interrupts from UPID.PIR to the UIRR.

As you mentioned below, XSTATE is saved due to several reasons which 
saves the UIRR into memory. UIRR should no longer be updated after it 
has been saved.

XSAVES clears UINV is to stop detecting additional interrupts in the 
UIRR after it has been saved.


> Not the most pretty solution, but as there needs to be some invalidation

> which needs to be undone on return to user space it probably does not

> matter much.

>

> As the whole thing is tightly coupled to XSAVES/RSTORS we need to

> integrate it into that machinery and not pretend that it's something

> half independent.



I agree. Thank you for pointing this out.

> That means we have to handle the setting of the SN bit in UPID whenever

> XSTATE is saved either during context switch, when the kernel uses the

> FPU or in other places (signals, fpu_clone ...). They all end up in

> save_fpregs_to_fpstate() so that might be the place to look at.


  Yes. The current code doesn't do this. SN bit should be set whenever 
UINTR XSTATE is saved.

> While talking about that: fpu_clone() has to invalidate the UINTR state

> in the clone's xstate after the memcpy() or xsaves() operation.

>

> Also the restore portion on the way back to user space has to be coupled

> more tightly:

>

> arch_exit_to_user_mode_prepare()

> {

>          ...

>          if (unlikely(ti_work & _TIF_UPID))

>          	uintr_restore_upid(ti_work & _TIF_NEED_FPU_LOAD);

>          if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))

>          	switch_fpu_return();

> }


I am assuming _TIF_UPID would be set everytime SN is set and XSTATE is 
saved.

> upid_set_ndst(upid)

> {

> 	apicid = __this_cpu_read(x86_cpu_to_apicid);

>

>          if (x2apic_enabled())

>              upid->ndst.x2apic = apicid;

>          else

>              upid->ndst.apic = apicid;

> }

>

> uintr_restore_upid(bool xrstors_pending)

> {

>          clear_thread_flag(TIF_UPID);

>          

> 	// Update destination

>          upid_set_ndst(upid);

>

>          // Do we need something stronger here?

>          barrier();

>

>          clear_bit(SN, upid->status);

>

>          // Any SENDUIPI after this point sends to this CPU

>             

>          // Any bit which was set in upid->pir after SN was set

>          // and/or UINV was cleared by XSAVES up to the point

>          // where SN was cleared above is not reflected in UIRR.

>

> 	// As this runs with interrupts disabled the current state

>          // of upid->pir can be read and used for restore. A SENDUIPI

>          // which sets a bit in upid->pir after that read will send

>          // the notification vector which is going to be handled once

>          // the task reenables interrupts on return to user space.

>          // If the SENDUIPI set the bit before the read then the

>          // notification vector handling will just observe the same

>          // PIR state.

>

>          // Needs to be a locked access as there might be a

>          // concurrent SENDUIPI modiying it.

>          pir = read_locked(upid->pir);

>

>          if (xrstors_pending)) {

>          	// Update the saved xstate for xrstors

>             	current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR;


XSAVES saves the UINV value into the XSTATE buffer. I am not sure if we 
need this again. Is it because it could have been overwritten by calling 
XSAVES twice?


>                  current->xstate.uintr.uirr = pir;


I believe PIR should be ORed. There could be some bits already set in 
the UIRR.

Also, shouldn't UPID->PIR be cleared? If not, we would detect these 
interrupts all over again during the next ring transition.

>          } else {

>                  // Manually restore UIRR and UINV

>                  wrmsrl(IA32_UINTR_RR, pir);

I believe read-modify-write here as well.
> 	        misc.val64 = 0;

>                  misc.uittsz = current->uintr->uittsz;

>                  misc.uinv = UINTR_NOTIFICATION_VECTOR;

>                  wrmsrl(IA32_UINTR_MISC, misc.val64);


Thanks! This helps reduce the additional MSR read.

>          }

> }

>

> That's how I deciphered the documentation and I don't think this is far

> from reality, but I might be wrong as usual.

>

> Hmm?


Thank you for the simplification. This is very helpful.

Sohil

Sohil Mehta Sept. 27, 2021, 7:26 p.m. UTC | #20

On 9/23/2021 4:07 PM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:

>> A user interrupt notification vector is used on the receiver's cpu to

>> identify an interrupt as a user interrupt (and not a kernel interrupt).

>> Hardware uses the same notification vector to generate an IPI from a

>> sender's cpu core when the SENDUIPI instruction is executed.

>>

>> Typically, the kernel shouldn't receive an interrupt with this vector.

>> However, it is possible that the kernel might receive this vector.

>>

>> Scenario that can cause the spurious interrupt:

>>

>> Step	cpu 0 (receiver task)		cpu 1 (sender task)

>> ----	---------------------		-------------------

>> 1	task is running

>> 2					executes SENDUIPI

>> 3					IPI sent

>> 4	context switched out

>> 5	IPI delivered

>> 	(kernel interrupt detected)

>>

>> A kernel interrupt can be detected, if a receiver task gets scheduled

>> out after the SENDUIPI-based IPI was sent but before the IPI was

>> delivered.

> What happens if the SENDUIPI is issued when the target task is not on

> the CPU? How is that any different from the above?



This didn't get covered in the other thread. Thought, I would clarify 
this a bit more.

A notification IPI is sent from the CPU that executes SENDUIPI if the 
target task is running (SN is 0).

If the target task is not running SN bit in the UPID is set, which 
prevents any notification interrupts from being generated.

However, it is possible that SN is 0 when SENDUIPI was executed which 
generates the notification IPI. But when the IPI arrives on receiver 
CPU, SN has been set, the task state has been saved and UINV has been 
cleared.

A kernel interrupt is detected in this case. I have a sample that demos 
this. I'll fix the current code and then send out the results.


>> The kernel doesn't need to do anything in this case other than receiving

>> the interrupt and clearing the local APIC. The user interrupt is always

>> stored in the receiver's UPID before the IPI is generated. When the

>> receiver gets scheduled back the interrupt would be delivered based on

>> its UPID.

> So why on earth is that vector reaching the CPU at all?


You covered this in the other thread.

>> +#ifdef CONFIG_X86_USER_INTERRUPTS

>> +	seq_printf(p, "%*s: ", prec, "UIS");

> No point in printing that when user interrupts are not available/enabled

> on the system.

>

Will fix this.

Thanks,

Sohil

Sohil Mehta Sept. 27, 2021, 8:42 p.m. UTC | #21

On 9/23/2021 3:24 PM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:

>> SENDUIPI is a special ring-3 instruction that makes a supervisor mode

>> memory access to the UPID and UITT memory. Currently, KPTI needs to be

>> off for User IPIs to work.  Processors that support user interrupts are

>> not affected by Meltdown so the auto mode of KPTI will default to off.

>>

>> Users who want to force enable KPTI will need to wait for a later

>> version of this patch series that is compatible with KPTI. We need to

>> allocate the UPID and UITT structures from a special memory region that

>> has supervisor access but it is mapped into userspace. The plan is to

>> implement a mechanism similar to LDT.

> Seriously?


Are questioning why we should add KPTI support if the hardware is not 
affected by Meltdown?

or

Why use an LDT like mechanism to do this?

I have listed this as one of the opens in the cover letter as well. I am 
not sure if users who force enable PTI would really care about User 
Interrupts.

Any input here would be helpful.

>

>> +	if (!cpu_feature_enabled(X86_FEATURE_UINTR))

>> +		goto disable_uintr;

>> +

>> +	/* checks the current processor's cpuid bits: */

>> +	if (!cpu_has(c, X86_FEATURE_UINTR))

>> +		goto disable_uintr;

>> +

>> +	/*

>> +	 * User Interrupts currently doesn't support PTI. For processors that

>> +	 * support User interrupts PTI in auto mode will default to off.  Need

>> +	 * this check only for users who have force enabled PTI.

>> +	 */

>> +	if (boot_cpu_has(X86_FEATURE_PTI)) {

>> +		pr_info_once("x86: User Interrupts (UINTR) not enabled. Please disable PTI using 'nopti' kernel parameter\n");

> That message does not make sense. The admin has explicitly added 'pti'

> to the kernel command line on a CPU which is not affected. So why would

> he now have to add 'nopti' ?


Yup. I'll fix this and other issues in this patch.

I thought the user should know why UINTR has been disabled. In 
hindsight, this would have been better covered in the sample Readme or 
something similar.


Thanks,

Sohil

Thomas Gleixner Sept. 28, 2021, 8:11 a.m. UTC | #22

Sohil,

On Mon, Sep 27 2021 at 12:07, Sohil Mehta wrote:
> On 9/26/2021 5:39 AM, Thomas Gleixner wrote:

>

> The User-interrupt notification processing moves all the pending 

> interrupts from UPID.PIR to the UIRR.


Indeed that makes sense. Should have thought about that myself.

>> Also the restore portion on the way back to user space has to be coupled

>> more tightly:

>>

>> arch_exit_to_user_mode_prepare()

>> {

>>          ...

>>          if (unlikely(ti_work & _TIF_UPID))

>>          	uintr_restore_upid(ti_work & _TIF_NEED_FPU_LOAD);

>>          if (unlikely(ti_work & _TIF_NEED_FPU_LOAD))

>>          	switch_fpu_return();

>> }

>

> I am assuming _TIF_UPID would be set everytime SN is set and XSTATE is 

> saved.


Yes.

>> upid_set_ndst(upid)

>> {

>> 	apicid = __this_cpu_read(x86_cpu_to_apicid);

>>

>>          if (x2apic_enabled())

>>              upid->ndst.x2apic = apicid;

>>          else

>>              upid->ndst.apic = apicid;

>> }

>>

>> uintr_restore_upid(bool xrstors_pending)

>> {

>>          clear_thread_flag(TIF_UPID);

>>          

>> 	// Update destination

>>          upid_set_ndst(upid);

>>

>>          // Do we need something stronger here?

>>          barrier();

>>

>>          clear_bit(SN, upid->status);

>>

>>          // Any SENDUIPI after this point sends to this CPU

>>             

>>          // Any bit which was set in upid->pir after SN was set

>>          // and/or UINV was cleared by XSAVES up to the point

>>          // where SN was cleared above is not reflected in UIRR.

>>

>> 	// As this runs with interrupts disabled the current state

>>          // of upid->pir can be read and used for restore. A SENDUIPI

>>          // which sets a bit in upid->pir after that read will send

>>          // the notification vector which is going to be handled once

>>          // the task reenables interrupts on return to user space.

>>          // If the SENDUIPI set the bit before the read then the

>>          // notification vector handling will just observe the same

>>          // PIR state.

>>

>>          // Needs to be a locked access as there might be a

>>          // concurrent SENDUIPI modiying it.

>>          pir = read_locked(upid->pir);

>>

>>          if (xrstors_pending)) {

>>          	// Update the saved xstate for xrstors

>>             	current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR;

>

> XSAVES saves the UINV value into the XSTATE buffer. I am not sure if we 

> need this again. Is it because it could have been overwritten by calling 

> XSAVES twice?


Yes that can happen AFAICT. I haven't done a deep analysis, but this
needs to looked at.

>>                  current->xstate.uintr.uirr = pir;

>

> I believe PIR should be ORed. There could be some bits already set in 

> the UIRR.

>

> Also, shouldn't UPID->PIR be cleared? If not, we would detect these 

> interrupts all over again during the next ring transition.


Right. So that PIR read above needs to be a locked cmpxchg().

>>          } else {

>>                  // Manually restore UIRR and UINV

>>                  wrmsrl(IA32_UINTR_RR, pir);

> I believe read-modify-write here as well.


Sigh, yes.

Thanks,

        tglx

Sohil Mehta Sept. 28, 2021, 11:08 p.m. UTC | #23

On 9/24/2021 4:04 AM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:

>> Currently, the task wait list is global one. To make the implementation

>> scalable there is a need to move to a distributed per-cpu wait list.

> How are per cpu wait lists going to solve the problem?

Currently, the global wait list can be concurrently accessed by multiple 
cpus. If we have per-cpu wait lists then the UPID scanning only needs to 
happen on the local cpu's wait list.

After an application calls uintr_wait(), the notification interrupt will 
be delivered only to the cpu where the task blocked. In this case, we 
can reduce the UPID search list and probably get rid of the global 
spinlock as well.

Though, I am not sure how much impact this would have vs. the problem of 
scanning the entire wait list.

>> +

>> +/*

>> + * Handler for UINTR_KERNEL_VECTOR.

>> + */

>> +DEFINE_IDTENTRY_SYSVEC(sysvec_uintr_kernel_notification)

>> +{

>> +	/* TODO: Add entry-exit tracepoints */

>> +	ack_APIC_irq();

>> +	inc_irq_stat(uintr_kernel_notifications);

>> +

>> +	uintr_wake_up_process();

> So this interrupt happens for any of those notifications. How are they

> differentiated?

Unfortunately, there is no help from the hardware here to identify the 
intended target.

When a task blocks we:
* switch the UINV to a kernel NV.
* leave SN as 0
* leave UPID.NDST to the current cpu
* add the task to a wait list

When the notification interrupt arrives:
* Scan the entire wait list to check if the ON bit is set for any UPID 
(very inefficient)
* Set SN to 1 for that task.
* Change the UINV to user NV.
* Remove the task from the list and make it runnable.

We could end up detecting multiple tasks that have the ON bit set. The 
notification interrupt for any task that has ON set is expected to 
arrive soon anyway. So no harm done here.

The main issue here is we would end up scanning the entire list for 
every interrupt. Not sure if there any way we could optimize this?

> Again. We have proper wait primitives.

I'll use proper wait primitives next time.
>> +	return -EINTR;

>> +}

>> +

>> +/*

>> + * Runs in interrupt context.

>> + * Scan through all UPIDs to check if any interrupt is on going.

>> + */

>> +void uintr_wake_up_process(void)

>> +{

>> +	struct uintr_upid_ctx *upid_ctx, *tmp;

>> +	unsigned long flags;

>> +

>> +	spin_lock_irqsave(&uintr_wait_lock, flags);

>> +	list_for_each_entry_safe(upid_ctx, tmp, &uintr_wait_list, node) {

>> +		if (test_bit(UPID_ON, (unsigned long*)&upid_ctx->upid->nc.status)) {

>> +			set_bit(UPID_SN, (unsigned long *)&upid_ctx->upid->nc.status);

>> +			upid_ctx->upid->nc.nv = UINTR_NOTIFICATION_VECTOR;

>> +			upid_ctx->waiting = false;

>> +			wake_up_process(upid_ctx->task);

>> +			list_del(&upid_ctx->node);

> So any of these notification interrupts does a global mass wake up? How

> does that make sense?

The wake up happens only for the tasks that have a pending interrupt. 
They are going to be woken up soon anyways.

>> +/* Called when task is unregistering/exiting */

>> +static void uintr_remove_task_wait(struct task_struct *task)

>> +{

>> +	struct uintr_upid_ctx *upid_ctx, *tmp;

>> +	unsigned long flags;

>> +

>> +	spin_lock_irqsave(&uintr_wait_lock, flags);

>> +	list_for_each_entry_safe(upid_ctx, tmp, &uintr_wait_list, node) {

>> +		if (upid_ctx->task == task) {

>> +			pr_debug("wait: Removing task %d from wait\n",

>> +				 upid_ctx->task->pid);

>> +			upid_ctx->upid->nc.nv = UINTR_NOTIFICATION_VECTOR;

>> +			upid_ctx->waiting = false;

>> +			list_del(&upid_ctx->node);

>> +		}

> What? You have to do a global list walk to find the entry which you

> added yourself?

Duh! I could have gotten the upid_ctx from the task_struct itself. Will 
fix this.

Thanks,

Sohil

Sohil Mehta Sept. 28, 2021, 11:13 p.m. UTC | #24

On 9/25/2021 5:08 AM, Thomas Gleixner wrote:
> Aisde of that this is completely broken vs. CPU hotplug.

>

Thank you for pointing this out. I hadn't even considered CPU hotplug.

Thanks,
Sohil

Sohil Mehta Sept. 29, 2021, 1:09 a.m. UTC | #25

On 9/26/2021 7:41 AM, Thomas Gleixner wrote:
> On Mon, Sep 13 2021 at 13:01, Sohil Mehta wrote:

>> Add a new system call to allow applications to block in the kernel and

>> wait for user interrupts.

>>

>> <The current implementation doesn't support waking up from other

>> blocking system calls like sleep(), read(), epoll(), etc.

>>

>> uintr_wait() is a placeholder syscall while we decide on that

>> behaviour.>

> Which behaviour? You cannot integrate this into [clock_]nanosleep() by

> any means or wakeup something which is blocked in read(somefd) via

> SENDUIPI.

That is the (wishful) desire.

The idea is to have a behavior similar to signals for all or a subset of 
system calls. i.e. return an EINTR by interrupting the blocked syscall 
and possibly have a SA_RESTART type of mechanism.

Can we use the existing signal infrastructure to generate a temporary 
in-kernel signal upon detection of an pending user interrupt? The 
temporary signal doesn't need to be delivered to application but it 
would just be a mechanism to interrupt the blocked syscall.

I don't know anything about the signaling subsystem nor have I tried 
prototyping this. So, all this might be completely baseless.

> What you can do is implement read() and poll() support for the

> uintrfd. Anything else is just not going to fly.

>

> Adding support for read/poll is pretty much a straight forward variant

> of a correctly implemented wait()/wakeup() mechanism.

I tried doing this but I ran into a couple of issues.

1) uintrfd is mapped to a single vector (out of 64). But there is no 
easy hardware mechanism to wait for specific vectors. Waiting for one 
vector might mean waiting for all.

2) The scope of uintrfd is process wide. Also, it would be shared with 
senders. But the wait/wake mechanism is specific to the task that 
created the fd and has a UPID allocated.
As you mentioned below, relaying the pending interrupt information of 
another task would be very tricky.

> While poll()/read() support might be useful and poll() also provides a

> timeout, having an explicit (timed) wait mechanism might be interesting.

I prototyped uintr_wait() with the same intention to have an explicit 
timed yield mechanism. There is very little ambiguity about who is 
waiting for what and how we would deliver the interrupts.

> But that brings me to an interesting question. There are two cases:

>

>   1) The task installed a user space interrupt handler. Now it

>      want's to play nice and yield the CPU while waiting.

>

>      So it needs to reinstall the UINV vector on return to user and

>      update UIRR, but that'd be covered by the existing mechanism. Fine.

>

>   2) Task has no user space interrupt handler installed and just want's

>      to use that wait mechanism.

>

>      What is consuming the pending bit(s)?

>

>      If that's not a valid use case, then the wait has to check for that

>      and reject the syscall with EINVAL.

Yeah. I feel this is not a valid use case. But I am no application 
developer. I will try to seek more opinions here.

>      If it is valid, then how are the pending bits consumed and relayed to

>      user space?

This is very tricky. Because a task that owns the UPID might be 
consuming interrupts while the kernel tries to relay the pending 
interrupt information to another task.

> The same questions arise when you think about implementing poll/read

> support simply because the regular poll/read semantics are:

>

>    poll waits for the event and read consumes the event

> which would be similar to #2 above, but with an installed user space

> interrupt handler the return from the poll system call would consume the

> event immediately (assumed that UIF is set).

>

Yup. There is no read data associated with uintrfd. This might be 
confusing for the application.

Overall, I feel signal handler semantics fit better with User interrupts 
handlers. But as you mentioned there might be no easy way to achieve that.

Thanks again for providing your input on this.

Sohil

Andy Lutomirski Sept. 29, 2021, 3:30 a.m. UTC | #26

On Mon, Sep 13, 2021, at 1:01 PM, Sohil Mehta wrote:
> Add a new system call to allow applications to block in the kernel and

> wait for user interrupts.

>

...

>

> When the application makes this syscall the notification vector is

> switched to a new kernel vector. Any new SENDUIPI will invoke the kernel

> interrupt which is then used to wake up the process.

Any new SENDUIPI that happens to hit the target CPU's ucode at a time when the kernel vector is enabled will deliver the interrupt.  Any new SENDUIPI that happens to hit the target CPU's ucode at a time when a different UIPI-using task is running will *not* deliver the interrupt, unless I'm missing some magic.  Which means that wakeups will be missed, which I think makes this whole idea a nonstarter.

Am I missing something?

Andy Lutomirski Sept. 29, 2021, 4:31 a.m. UTC | #27

On Mon, Sep 13, 2021, at 1:01 PM, Sohil Mehta wrote:
> User Interrupts Introduction

> ============================

>

> User Interrupts (Uintr) is a hardware technology that enables delivering

> interrupts directly to user space.

>

> Today, virtually all communication across privilege boundaries happens by going

> through the kernel. These include signals, pipes, remote procedure calls and

> hardware interrupt based notifications. User interrupts provide the foundation

> for more efficient (low latency and low CPU utilization) versions of these

> common operations by avoiding transitions through the kernel.

>

...

I spent some time reviewing the docs (ISE) and contemplating how this all fits together, and I have a high level question:

Can someone give an example of a realistic workload that would benefit from SENDUIPI and precisely how it would use SENDUIPI?  Or an example of a realistic workload that would benefit from hypothetical device-initiated user interrupts and how it would use them?  I'm having trouble imagining something that wouldn't work as well or better by simply polling, at least on DMA-coherent architectures like x86.

(I can imagine some benefit to a hypothetical improved SENDUIPI with idential user semantics but that supported a proper interaction with the scheduler and blocking syscalls.  But that's not what's documented in the ISE...)

--Andy

Sohil Mehta Sept. 29, 2021, 4:56 a.m. UTC | #28

On 9/28/2021 8:30 PM, Andy Lutomirski wrote:
> On Mon, Sep 13, 2021, at 1:01 PM, Sohil Mehta wrote:

>> Add a new system call to allow applications to block in the kernel and

>> wait for user interrupts.

>>

> ...

>

>> When the application makes this syscall the notification vector is

>> switched to a new kernel vector. Any new SENDUIPI will invoke the kernel

>> interrupt which is then used to wake up the process.

> Any new SENDUIPI that happens to hit the target CPU's ucode at a time when the kernel vector is enabled will deliver the interrupt.  Any new SENDUIPI that happens to hit the target CPU's ucode at a time when a different UIPI-using task is running will *not* deliver the interrupt, unless I'm missing some magic.  Which means that wakeups will be missed, which I think makes this whole idea a nonstarter.

>

> Am I missing something?

The current kernel implementation reserves 2 notification vectors (NV) 
for the 2 states of a thread (running vs blocked).

NV-1 – used only for tasks that are running. (results in a user 
interrupt or a spurious kernel interrupt)

NV-2 – used only for a tasks that are blocked in the kernel. (always 
results in a kernel interrupt)

The UPID.UINV bits are switched between NV-1 and NV-2 based on the state 
of the task.

However, NV-1 is also programmed in the running task's MISC_MSR UINV 
bits. This is what tells the ucode that the notification vector received 
is for the user instead of the kernel.

NV-2 is never programmed in the MISC_MSR of a task. When NV-2 arrives on 
any cpu there is never a possibility of it being detected as a User 
Interrupt. It will always be delivered to the kernel.

Does this help clarify the above?

I just realized, we need to be careful when the notification vectors are 
switched in the UPID. Any pending vectors detected after the switch 
should abort the blocking call. The current code is wrong in a lot of 
places where it touches the UPID.

Thanks,
Sohil

Stefan Hajnoczi Sept. 30, 2021, 4:26 p.m. UTC | #29

On Mon, Sep 13, 2021 at 01:01:19PM -0700, Sohil Mehta wrote:
> User Interrupts Introduction

> ============================

> 

> User Interrupts (Uintr) is a hardware technology that enables delivering

> interrupts directly to user space.

> 

> Today, virtually all communication across privilege boundaries happens by going

> through the kernel. These include signals, pipes, remote procedure calls and

> hardware interrupt based notifications. User interrupts provide the foundation

> for more efficient (low latency and low CPU utilization) versions of these

> common operations by avoiding transitions through the kernel.

> 

> In the User Interrupts hardware architecture, a receiver is always expected to

> be a user space task. However, a user interrupt can be sent by another user

> space task, kernel or an external source (like a device).

> 

> In addition to the general infrastructure to receive user interrupts, this

> series introduces a single source: interrupts from another user task.  These

> are referred to as User IPIs.

> 

> The first implementation of User IPIs will be in the Intel processor code-named

> Sapphire Rapids. Refer Chapter 11 of the Intel Architecture instruction set

> extensions for details of the hardware architecture [1].

> 

> Series-reviewed-by: Tony Luck <tony.luck@intel.com>

> 

> Main goals of this RFC

> ======================

> - Introduce this upcoming technology to the community.

> This cover letter includes a hardware architecture summary along with the

> software architecture and kernel design choices. This post is a bit long as a

> result. Hopefully, it helps answer more questions than it creates :) I am also

> planning to talk about User Interrupts next week at the LPC Kernel summit.

> 

> - Discuss potential use cases.

> We are starting to look at actual usages and libraries (like libevent[2] and

> liburing[3]) that can take advantage of this technology. Unfortunately, we

> don't have much to share on this right now. We need some help from the

> community to identify usages that can benefit from this. We would like to make

> sure the proposed APIs work for the eventual consumers.

> 

> - Get early feedback on the software architecture.

> We are hoping to get some feedback on the direction of overall software

> architecture - starting with User IPI, extending it for kernel-to-user

> interrupt notifications and external interrupts in the future. 

> 

> - Discuss some of the main architecture opens.

> There is lot of work that still needs to happen to enable this technology. We

> are looking for some input on future patches that would be of interest. Here

> are some of the big opens that we are looking to resolve.

> * Should Uintr interrupt all blocking system calls like sleep(), read(),

>   poll(), etc? If so, should we implement an SA_RESTART type of mechanism

>   similar to signals? - Refer Blocking for interrupts section below.

> 

> * Should the User Interrupt Target table (UITT) be shared between threads of a

>   multi-threaded application or maybe even across processes? - Refer Sharing

>   the UITT section below.

> 

> Why care about this? - Micro benchmark performance

> ==================================================

> There is a ~9x or higher performance improvement using User IPI over other IPC

> mechanisms for event signaling.

> 

> Below is the average normalized latency for a 1M ping-pong IPC notifications

> with message size=1.

> 

> +------------+-------------------------+

> | IPC type   |   Relative Latency      |

> |            |(normalized to User IPI) |

> +------------+-------------------------+

> | User IPI   |                     1.0 |

> | Signal     |                    14.8 |

> | Eventfd    |                     9.7 |


Is this the bi-directional eventfd benchmark?
https://github.com/intel/uintr-ipc-bench/blob/linux-rfc-v1/source/eventfd/eventfd-bi.c

Two things stand out:

1. The server and client threads are racing on the same eventfd.
   Eventfds aren't bi-directional! The eventfd_wait() function has code
   to write the value back, which is a waste of CPU cycles and hinders
   progress. I've never seen eventfd used this way in real applications.
   Can you use two separate eventfds?

2. The fd is in blocking mode and the task may be descheduled, so we're
   measuring eventfd read/write latency plus scheduler/context-switch
   latency. A fairer comparison against user interrupts would be to busy
   wait on a non-blocking fd so the scheduler/context-switch latency is
   mostly avoided. After all, the uintrfd-bi.c benchmark does this in
   uintrfd_wait():

     // Keep spinning until the interrupt is received
     while (!uintr_received[token]);

Stefan Hajnoczi Sept. 30, 2021, 4:30 p.m. UTC | #30

On Tue, Sep 28, 2021 at 09:31:34PM -0700, Andy Lutomirski wrote:
> On Mon, Sep 13, 2021, at 1:01 PM, Sohil Mehta wrote:

> > User Interrupts Introduction

> > ============================

> >

> > User Interrupts (Uintr) is a hardware technology that enables delivering

> > interrupts directly to user space.

> >

> > Today, virtually all communication across privilege boundaries happens by going

> > through the kernel. These include signals, pipes, remote procedure calls and

> > hardware interrupt based notifications. User interrupts provide the foundation

> > for more efficient (low latency and low CPU utilization) versions of these

> > common operations by avoiding transitions through the kernel.

> >

> 

> ...

> 

> I spent some time reviewing the docs (ISE) and contemplating how this all fits together, and I have a high level question:

> 

> Can someone give an example of a realistic workload that would benefit from SENDUIPI and precisely how it would use SENDUIPI?  Or an example of a realistic workload that would benefit from hypothetical device-initiated user interrupts and how it would use them?  I'm having trouble imagining something that wouldn't work as well or better by simply polling, at least on DMA-coherent architectures like x86.


I was wondering the same thing. One thing came to mind:

An application that wants to be *interrupted* from what it's doing
rather than waiting until the next polling point. For example,
applications that are CPU-intensive and have green threads. I can't name
a real application like this though :P.

Stefan

Sohil Mehta Sept. 30, 2021, 5:24 p.m. UTC | #31

On 9/30/2021 9:30 AM, Stefan Hajnoczi wrote:
> On Tue, Sep 28, 2021 at 09:31:34PM -0700, Andy Lutomirski wrote:

>>

>> I spent some time reviewing the docs (ISE) and contemplating how this all fits together, and I have a high level question:

>>

>> Can someone give an example of a realistic workload that would benefit from SENDUIPI and precisely how it would use SENDUIPI?  Or an example of a realistic workload that would benefit from hypothetical device-initiated user interrupts and how it would use them?  I'm having trouble imagining something that wouldn't work as well or better by simply polling, at least on DMA-coherent architectures like x86.

> I was wondering the same thing. One thing came to mind:

>

> An application that wants to be *interrupted* from what it's doing

> rather than waiting until the next polling point. For example,

> applications that are CPU-intensive and have green threads. I can't name

> a real application like this though :P.


Thank you Stefan and Andy for giving this some thought.

We are consolidating the information internally on where and how exactly 
we expect to see benefits with real workloads for the various sources of 
User Interrupts. It will take a few days to get back on this one.


> (I can imagine some benefit to a hypothetical improved SENDUIPI with idential user semantics but that supported a proper interaction with the scheduler and blocking syscalls.  But that's not what's documented in the ISE...)


Andy, can you please provide some more context/details on this? Is this 
regarding the blocking syscalls discussion (in patch 11) or something else?


Thanks,
Sohil

Andy Lutomirski Sept. 30, 2021, 5:26 p.m. UTC | #32

On Thu, Sep 30, 2021, at 10:24 AM, Sohil Mehta wrote:
> On 9/30/2021 9:30 AM, Stefan Hajnoczi wrote:

>> On Tue, Sep 28, 2021 at 09:31:34PM -0700, Andy Lutomirski wrote:

>>>

>>> I spent some time reviewing the docs (ISE) and contemplating how this all fits together, and I have a high level question:

>>>

>>> Can someone give an example of a realistic workload that would benefit from SENDUIPI and precisely how it would use SENDUIPI?  Or an example of a realistic workload that would benefit from hypothetical device-initiated user interrupts and how it would use them?  I'm having trouble imagining something that wouldn't work as well or better by simply polling, at least on DMA-coherent architectures like x86.

>> I was wondering the same thing. One thing came to mind:

>>

>> An application that wants to be *interrupted* from what it's doing

>> rather than waiting until the next polling point. For example,

>> applications that are CPU-intensive and have green threads. I can't name

>> a real application like this though :P.

>

> Thank you Stefan and Andy for giving this some thought.

>

> We are consolidating the information internally on where and how exactly 

> we expect to see benefits with real workloads for the various sources of 

> User Interrupts. It will take a few days to get back on this one.


Thanks!

>

>

>> (I can imagine some benefit to a hypothetical improved SENDUIPI with idential user semantics but that supported a proper interaction with the scheduler and blocking syscalls.  But that's not what's documented in the ISE...)

>

> Andy, can you please provide some more context/details on this? Is this 

> regarding the blocking syscalls discussion (in patch 11) or something else?

>


Yes, and I'll follow up there.  I hereby upgrade my opinion of SENDUIPI wakeups to "probably doable but maybe not in a nice way."

Andy Lutomirski Sept. 30, 2021, 6:08 p.m. UTC | #33

On Tue, Sep 28, 2021, at 9:56 PM, Sohil Mehta wrote:
> On 9/28/2021 8:30 PM, Andy Lutomirski wrote:

>> On Mon, Sep 13, 2021, at 1:01 PM, Sohil Mehta wrote:

>>> Add a new system call to allow applications to block in the kernel and

>>> wait for user interrupts.

>>>

>> ...

>>

>>> When the application makes this syscall the notification vector is

>>> switched to a new kernel vector. Any new SENDUIPI will invoke the kernel

>>> interrupt which is then used to wake up the process.

>> Any new SENDUIPI that happens to hit the target CPU's ucode at a time when the kernel vector is enabled will deliver the interrupt.  Any new SENDUIPI that happens to hit the target CPU's ucode at a time when a different UIPI-using task is running will *not* deliver the interrupt, unless I'm missing some magic.  Which means that wakeups will be missed, which I think makes this whole idea a nonstarter.

>>

>> Am I missing something?

>

>

> The current kernel implementation reserves 2 notification vectors (NV) 

> for the 2 states of a thread (running vs blocked).

>

> NV-1 – used only for tasks that are running. (results in a user 

> interrupt or a spurious kernel interrupt)

>

> NV-2 – used only for a tasks that are blocked in the kernel. (always 

> results in a kernel interrupt)

>

> The UPID.UINV bits are switched between NV-1 and NV-2 based on the state 

> of the task.

Aha, cute.  So NV-1 is only sent if the target is directly paying attention and, assuming all the atomics are done right, NV-2 will be sent for tasks that are asleep.

Logically, I think these are the possible states for a receiving task:

1. Running.  SENDUIPI will actually deliver the event directly (or not if uintr is masked).  If the task just stopped running and the atomics are right, then the schedule-out code can, I think, notice.

2. Not running, but either runnable or not currently waiting for uintr (e.g. blocked in an unrelated syscall).  This is straightforward -- no IPI or other action is needed other than setting the uintr-pending bit.

3. Blocked and waiting for uintr.  For this to work right, anyone trying to send with SENDUIPI (or maybe a vdso or similar clever wrapper around it) needs to result in either a fault or an IPI so the kernel can process the wakeup.

(Note that, depending on how fancy we get with file descriptors and polling, we need to watch out for the running-and-also-waiting-for-kernel-notification state.  That one will never work right.)

3 is the nasty case, and your patch makes it work with this NV-2 trick.  The trick is a bit gross for a couple reasons.  First, it conveys no useful information to the kernel except that an unknown task did SENDUIPI and maybe that the target was most recently on a given CPU.  So a big list search is needed.  Also, it hits an essentially arbitrary and possibly completely innocent victim CPU and task, and people doing any sort of task isolation workload will strongly dislike this.  For some of those users, "strongly" may mean "treat system as completely failed, fail over to something else and call expensive tech support."  So we can't do that.

I think we have three choices:

Use a fancy wrapper around SENDUIPI.  This is probably a bad idea.

Treat the NV-2 as a real interrupt and honor affinity settings.  This will be annoying and slow, I think, if it's even workable at all.

Handle this case with faults instead of interrupts.  We could set a reserved bit in UPID so that SENDUIPI results in #GP, decode it, and process it.  This puts the onus on the actual task causing trouble, which is nice, and it lets us find the UPID and target directly instead of walking all of them.  I don't know how well it would play with hypothetical future hardware-initiated uintrs, though.

Thomas Gleixner Sept. 30, 2021, 7:29 p.m. UTC | #34

On Thu, Sep 30 2021 at 11:08, Andy Lutomirski wrote:
> On Tue, Sep 28, 2021, at 9:56 PM, Sohil Mehta wrote:

> I think we have three choices:

>

> Use a fancy wrapper around SENDUIPI.  This is probably a bad idea.

>

> Treat the NV-2 as a real interrupt and honor affinity settings.  This

> will be annoying and slow, I think, if it's even workable at all.

We can make it a real interrupt in form of a per CPU interrupt, but
affinity settings are not really feasible because the affinity is in the
UPID.ndst field. So, yes we can target it to some CPU, but that's racy.

> Handle this case with faults instead of interrupts.  We could set a

> reserved bit in UPID so that SENDUIPI results in #GP, decode it, and

> process it.  This puts the onus on the actual task causing trouble,

> which is nice, and it lets us find the UPID and target directly

> instead of walking all of them.  I don't know how well it would play

> with hypothetical future hardware-initiated uintrs, though.

I thought about that as well and dismissed it due to the hardware
initiated ones but thinking more about it, those need some translation
unit (e.g. irq remapping) anyway, so it might be doable to catch those
as well. So we could just ignore them for now and go for the #GP trick
and deal with the device initiated ones later when they come around :)

But even with that we still need to keep track of the armed ones per CPU
so we can handle CPU hotunplug correctly. Sigh...

Thanks,

        tglx

Andy Lutomirski Sept. 30, 2021, 10:01 p.m. UTC | #35

On Thu, Sep 30, 2021, at 12:29 PM, Thomas Gleixner wrote:
> On Thu, Sep 30 2021 at 11:08, Andy Lutomirski wrote:

>> On Tue, Sep 28, 2021, at 9:56 PM, Sohil Mehta wrote:

>> I think we have three choices:

>>

>> Use a fancy wrapper around SENDUIPI.  This is probably a bad idea.

>>

>> Treat the NV-2 as a real interrupt and honor affinity settings.  This

>> will be annoying and slow, I think, if it's even workable at all.

>

> We can make it a real interrupt in form of a per CPU interrupt, but

> affinity settings are not really feasible because the affinity is in the

> UPID.ndst field. So, yes we can target it to some CPU, but that's racy.

>

>> Handle this case with faults instead of interrupts.  We could set a

>> reserved bit in UPID so that SENDUIPI results in #GP, decode it, and

>> process it.  This puts the onus on the actual task causing trouble,

>> which is nice, and it lets us find the UPID and target directly

>> instead of walking all of them.  I don't know how well it would play

>> with hypothetical future hardware-initiated uintrs, though.

>

> I thought about that as well and dismissed it due to the hardware

> initiated ones but thinking more about it, those need some translation

> unit (e.g. irq remapping) anyway, so it might be doable to catch those

> as well. So we could just ignore them for now and go for the #GP trick

> and deal with the device initiated ones later when they come around :)


Sounds good to me. In the long run, if Intel wants device initiated fancy interrupts to work well, they need a new design.

>

> But even with that we still need to keep track of the armed ones per CPU

> so we can handle CPU hotunplug correctly. Sigh...


I don’t think any real work is needed. We will only ever have armed UPIDs (with notification interrupts enabled) for running tasks, and hot-unplugged CPUs don’t have running tasks.  We do need a way to drain pending IPIs before we offline a CPU, but that’s a separate problem and may be unsolvable for all I know. Is there a magic APIC operation to wait until all initiated IPIs targeting the local CPU arrive?  I guess we can also just mask the notification vector so that it won’t crash us if we get a stale IPI after going offline.

>

> Thanks,

>

>         tglx

Thomas Gleixner Oct. 1, 2021, 12:01 a.m. UTC | #36

On Thu, Sep 30 2021 at 15:01, Andy Lutomirski wrote:
> On Thu, Sep 30, 2021, at 12:29 PM, Thomas Gleixner wrote:

>>

>> But even with that we still need to keep track of the armed ones per CPU

>> so we can handle CPU hotunplug correctly. Sigh...

>

> I don’t think any real work is needed. We will only ever have armed

> UPIDs (with notification interrupts enabled) for running tasks, and

> hot-unplugged CPUs don’t have running tasks.

That's not the problem. The problem is the wait for uintr case where the
task is obviously not running:

CPU 1
     upid = T1->upid;
     upid->vector = UINTR_WAIT_VECTOR;
     upid->ndst = local_apic_id();
     ...
     do {
         ....
         schedule();
     }

CPU 0
    unplug CPU 1

    SENDUPI(index)
        // Hardware does:
        tblentry = &ttable[index];
        upid = tblentry->upid;
        upid->pir |= tblentry->uv;
        send_IPI(upid->vector, upid->ndst);

So SENDUPI will send the IPI to the APIC ID provided by T1->upid.ndst
which points to the offlined CPU 1 and therefore is obviously going to
/dev/null. IOW, lost wakeup...

> We do need a way to drain pending IPIs before we offline a CPU, but

> that’s a separate problem and may be unsolvable for all I know. Is

> there a magic APIC operation to wait until all initiated IPIs

> targeting the local CPU arrive?  I guess we can also just mask the

> notification vector so that it won’t crash us if we get a stale IPI

> after going offline.

All of this is solved already otherwise CPU hot unplug would explode in
your face every time. The software IPI send side is carefully
synchronized vs. hotplug (at least in theory). May I ask you politely to
make yourself familiar with all that before touting "We do need..." based
on random assumptions?

The above SENDUIPI vs. CPU hotplug scenario is the same problem as we
have with regular device interrupts which are targeted at an outgoing
CPU. We have magic mechanisms in place to handle that to the extent
possible, but due to the insanity of X86 interrupt handling mechanics
that still leaves a very tiny hole which might cause a lost and
subsequently stale interrupt. Nothing we can fix in software.

So on CPU offline the hotplug code walks through all device interrupts
and checks whether they are targeted at the outgoing CPU. If so they are
rerouted to an online CPU with lots of care to make the possible race
window as small as it gets. That's nowadays only a problem on systems
where interrupt remapping is not available or disabled via commandline.

For tasks which just have the user interrupt armed there is no problem
because SENDUPI modifies UPID->PIR which is reevaluated when the task
which got migrated to an online CPU is going back to user space.

The uintr_wait() syscall creates the very same problem as we have with
device interrupts. Which means we need to make that wait thing:

     upid = T1->upid;
     upid->vector = UINTR_WAIT_VECTOR;
     upid->ndst = local_apic_id();
     list_add(this_cpu_ptr(pcp_uintrs), upid->pcp_uintr);
     ...
     do {
         ....
         schedule();
     }
     list_del_init(upid->pcp_uintr);

and the hotplug code do:

    for_each_entry_safe(upid, this_cpu_ptr(pcp_uintrs), ...) {
       list_del(upid->pcp_uintr);
       upid->ndst = apic_id_of_random_online_cpu();
       if (do_magic_checks_whether_ipi_is_pending())
         send_ipi(upid->vector, upid->ndst);
    }

See?

We could map that to the interrupt subsystem by creating a virtual
interrupt domain for this, but that would make uintr_wait() look like
this:

     irq = uintr_alloc_irq();
     request_irq(irq, ......);
     upid = T1->upid;
     upid->vector = UINTR_WAIT_VECTOR;
     upid->ndst = local_apic_id();
     list_add(this_cpu_ptr(pcp_uintrs), upid->pcp_uintr);
     ...
     do {
         ....
         schedule();
     }
     list_del_init(upid->pcp_uintr);
     free_irq(irq);

But the benefit of that is dubious as it creates overhead on both sides
of the sleep and the only real purpose of the irq request would be to
handle CPU hotunplug without the above per CPU list mechanics.

Welcome to my wonderful world!

Thanks,

        tglx

Sohil Mehta Oct. 1, 2021, 12:40 a.m. UTC | #37

On 9/30/2021 9:26 AM, Stefan Hajnoczi wrote:
> On Mon, Sep 13, 2021 at 01:01:19PM -0700, Sohil Mehta wrote:

>> +------------+-------------------------+

>> | IPC type   |   Relative Latency      |

>> |            |(normalized to User IPI) |

>> +------------+-------------------------+

>> | User IPI   |                     1.0 |

>> | Signal     |                    14.8 |

>> | Eventfd    |                     9.7 |

> Is this the bi-directional eventfd benchmark?

> https://github.com/intel/uintr-ipc-bench/blob/linux-rfc-v1/source/eventfd/eventfd-bi.c


Yes. I have left it unmodified from the original source. But, I should 
have looked at it more closely.

> Two things stand out:

>

> 1. The server and client threads are racing on the same eventfd.

>     Eventfds aren't bi-directional! The eventfd_wait() function has code

>     to write the value back, which is a waste of CPU cycles and hinders

>     progress. I've never seen eventfd used this way in real applications.

>     Can you use two separate eventfds?


Sure. I can do that.


> 2. The fd is in blocking mode and the task may be descheduled, so we're

>     measuring eventfd read/write latency plus scheduler/context-switch

>     latency. A fairer comparison against user interrupts would be to busy

>     wait on a non-blocking fd so the scheduler/context-switch latency is

>     mostly avoided. After all, the uintrfd-bi.c benchmark does this in

>     uintrfd_wait():

>

>       // Keep spinning until the interrupt is received

>       while (!uintr_received[token]);


That makes sense. I'll give this a try and send out the updated results.

Thanks,
Sohil

Andy Lutomirski Oct. 1, 2021, 4:41 a.m. UTC | #38

On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:
> On Thu, Sep 30 2021 at 15:01, Andy Lutomirski wrote:

>> On Thu, Sep 30, 2021, at 12:29 PM, Thomas Gleixner wrote:

>>>

>>> But even with that we still need to keep track of the armed ones per CPU

>>> so we can handle CPU hotunplug correctly. Sigh...

>>

>> I don’t think any real work is needed. We will only ever have armed

>> UPIDs (with notification interrupts enabled) for running tasks, and

>> hot-unplugged CPUs don’t have running tasks.

>

> That's not the problem. The problem is the wait for uintr case where the

> task is obviously not running:

>

> CPU 1

>      upid = T1->upid;

>      upid->vector = UINTR_WAIT_VECTOR;

>      upid->ndst = local_apic_id();

>      ...

>      do {

>          ....

>          schedule();

>      }

>

> CPU 0

>     unplug CPU 1

>

>     SENDUPI(index)

>         // Hardware does:

>         tblentry = &ttable[index];

>         upid = tblentry->upid;

>         upid->pir |= tblentry->uv;

>         send_IPI(upid->vector, upid->ndst);

>

> So SENDUPI will send the IPI to the APIC ID provided by T1->upid.ndst

> which points to the offlined CPU 1 and therefore is obviously going to

> /dev/null. IOW, lost wakeup...

Yes, but I don't think this is how we should structure this.

CPU 1
 upid->vector = UINV;
 upid->ndst = local_apic_id()
 exit to usermode;
 return from usermode;
 ...

 schedule();
 fpu__save_crap [see below]:
   if (this task is waiting for a uintr) {
     upid->resv0 = 1;  /* arm #GP */
   } else {
     upid->sn = 1;
   }

>

>> We do need a way to drain pending IPIs before we offline a CPU, but

>> that’s a separate problem and may be unsolvable for all I know. Is

>> there a magic APIC operation to wait until all initiated IPIs

>> targeting the local CPU arrive?  I guess we can also just mask the

>> notification vector so that it won’t crash us if we get a stale IPI

>> after going offline.

>

> All of this is solved already otherwise CPU hot unplug would explode in

> your face every time. The software IPI send side is carefully

> synchronized vs. hotplug (at least in theory). May I ask you politely to

> make yourself familiar with all that before touting "We do need..." based

> on random assumptions?

I'm aware that the software send IPI side is synchronized against hotplug.  But SENDUIPI is not unless we're going to have the CPU offline code IPI every other CPU to make sure that their SENDUIPIs have completed -- we don't control the SENDUIPI code.

After reading the ISE docs again, I think it might be possible to use the ON bit to synchronize.  In the schedule-out path, if we discover that ON = 1, then there is an IPI in flight to us.  In theory, we could wait for it, although actually doing so could be a mess.  That's why I'm asking whether there's a way to tell the APIC to literally wait for all IPIs that are *already sent* to be delivered.

>

> The above SENDUIPI vs. CPU hotplug scenario is the same problem as we

> have with regular device interrupts which are targeted at an outgoing

> CPU. We have magic mechanisms in place to handle that to the extent

> possible, but due to the insanity of X86 interrupt handling mechanics

> that still leaves a very tiny hole which might cause a lost and

> subsequently stale interrupt. Nothing we can fix in software.

>

> So on CPU offline the hotplug code walks through all device interrupts

> and checks whether they are targeted at the outgoing CPU. If so they are

> rerouted to an online CPU with lots of care to make the possible race

> window as small as it gets. That's nowadays only a problem on systems

> where interrupt remapping is not available or disabled via commandline.

>

> For tasks which just have the user interrupt armed there is no problem

> because SENDUPI modifies UPID->PIR which is reevaluated when the task

> which got migrated to an online CPU is going back to user space.

>

> The uintr_wait() syscall creates the very same problem as we have with

> device interrupts. Which means we need to make that wait thing:

>

>      upid = T1->upid;

>      upid->vector = UINTR_WAIT_VECTOR;

This is exactly what I'm suggesting we *don't* do.  Instead we set a reserved bit, we decode SENDUIPI in the #GP handler, and we emulate, in-kernel, the notification process for non-running tasks.

Now that I read the docs some more, I'm seriously concerned about this XSAVE design.  XSAVES with UINTR is destructive -- it clears UINV.  If we actually use this, then the whole last_cpu "preserve the state in registers" optimization goes out the window.  So does anything that happens to assume that merely saving the state doesn't destroy it on respectable modern CPUs  XRSTORS will #GP if you XRSTORS twice, which makes me nervous and would need a serious audit of our XRSTORS paths.

This is gross.

--Andy

Pavel Machek Oct. 1, 2021, 8:19 a.m. UTC | #39

Hi!

> Instructions

> ------------

> senduipi <index> - send a user IPI to a target task based on the UITT index.

> 

> clui - Mask user interrupts by clearing UIF (User Interrupt Flag).

> 

> stui - Unmask user interrupts by setting UIF.

> 

> testui - Test current value of UIF.

> 

> uiret - return from a user interrupt handler.


Are other CPU vendors allowed to implement compatible instructions?

If not, we should probably have VDSO entries so kernel can abstract
differences between CPUs.

> Untrusted processes

> -------------------

> The current implementation expects only trusted and cooperating processes to

> communicate using user interrupts. Coordination is expected between processes

> for a connection teardown. In situations where coordination doesn't happen

> (say, due to abrupt process exit), the kernel would end up keeping shared

> resources (like UPID) allocated to avoid faults.


Keeping resources allocated after process exit is a no-no.

Best regards,
								Pavel
-- 
http://www.livejournal.com/~pavelmachek

Thomas Gleixner Oct. 1, 2021, 9:56 a.m. UTC | #40

On Thu, Sep 30 2021 at 21:41, Andy Lutomirski wrote:
> On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:

>> All of this is solved already otherwise CPU hot unplug would explode in

>> your face every time. The software IPI send side is carefully

>> synchronized vs. hotplug (at least in theory). May I ask you politely to

>> make yourself familiar with all that before touting "We do need..." based

>> on random assumptions?

>

> I'm aware that the software send IPI side is synchronized against

> hotplug.  But SENDUIPI is not unless we're going to have the CPU

> offline code IPI every other CPU to make sure that their SENDUIPIs

> have completed -- we don't control the SENDUIPI code.

That's correct, but on CPU hot unplug _all_ running tasks have been
migrated to an online CPU _before_ the APIC is turned off. So they all
went through schedule() which set the UPID->SN bit. That's obviously
racy, but that has to be handled in exit to user mode anyway because
that's not different from any other migration or preemption. So that's
_not_ a problem at all.

The problem only exists if we can't do the #GP trick for tasks which are
sitting in uintr_wait(). Because then we _have_ to be careful vs. a
concurrent SENDUPI. But that'd be not any different from the problem
vs. device interrupts which we have today.

If we can use #GP then there is no problem at all and we avoid all the
nasty stuff vs. hotplug and avoid the list walk etc.

> After reading the ISE docs again, I think it might be possible to use

> the ON bit to synchronize.  In the schedule-out path, if we discover

> that ON = 1, then there is an IPI in flight to us.  In theory, we

> could wait for it, although actually doing so could be a mess.  That's

> why I'm asking whether there's a way to tell the APIC to literally

> wait for all IPIs that are *already sent* to be delivered.

You could busy poll with interrupts enabled, but that does not solve
anything. What guarantees that after APIC.IRR is clear no further IPI is
sent? Nothing at all. But again, that's not any different from device
interrupts and we already handle that today:

      cpu down()
      ...
      disable interrupts();
      for_each_interrupt_affine_to_cpu(irq) {
      	change_affinity_to_online_cpu(irq, new_target_cpu);
        // Did device send to the old vector?
        if (APIC.IRR & vector_bit(old_vector))
           send_IPI(new_target_cpu, new_vector);
      }

So for uintr_wait() w/o #GP we'd need to do:

      for_each_waiter_on_cpu(w) {
           move_waiter_to_new_target_cpu_wait_list(w);
           w->ndest = new_target_cpu;
           if (w->ON)
              send_IPI(new_target_cpu, UIWAIT_VECTOR);
      }

>> The uintr_wait() syscall creates the very same problem as we have with

>> device interrupts. Which means we need to make that wait thing:

>>

>>      upid = T1->upid;

>>      upid->vector = UINTR_WAIT_VECTOR;

>

> This is exactly what I'm suggesting we *don't* do.  Instead we set a

> reserved bit, we decode SENDUIPI in the #GP handler, and we emulate,

> in-kernel, the notification process for non-running tasks.

Yes, under the assumption that we can use #GP without breaking device
delivery.

> Now that I read the docs some more, I'm seriously concerned about this

> XSAVE design.  XSAVES with UINTR is destructive -- it clears UINV.  If

> we actually use this, then the whole last_cpu "preserve the state in

> registers" optimization goes out the window.  So does anything that

> happens to assume that merely saving the state doesn't destroy it on

> respectable modern CPUs XRSTORS will #GP if you XRSTORS twice, which

> makes me nervous and would need a serious audit of our XRSTORS paths.

I have no idea what you are fantasizing about. You can XRSTORS five
times in a row as long as your XSTATE memory image is correct.

If you don't want to use XSAVES to manage UINTR then you have to manualy
fiddle with the MSRs and UIF in schedule() and return to user space.

Also keeping UINV alive when scheduling out creates a life time issue
vs. UPID:

CPU 0   CPU 1                   CPU2
        T1 -> schedule         // UPID is live in UINTR MSRs
        do_stuff_in_kernel()
        local_irq_disable();
                                SENDUIPI(T1 -> CPU1)
pull T1
T1 exits
free UPID

        local_irq_enable();
        ucode handles UINV -> UAF

Clearing UINV prevents the ucode from handling the IPI and fiddling with
UPID. The CPU will forward the IPI vector to the kernel which acks it
and does nothing else, i.e. it's a spurious interrupt.

Coming back to state preserving. All what needs to be done for a
situation where the rest of the XSTATE is live in the registers, i.e.
the T -> kthread -> T scheduling scenario, is to restore UINV on exit to
user mode and handle UPID.PIR which might contain newly set bits which
are obviously not yet in UPID.IRR. That can be done by MSR fiddling or
by issuing an self IPI on the UINV vector which will be handled in ucode
on the first user space instruction after return.

When the FPU has to be restored then the state has to be updated in the
XSAVE memory image before doing XRSTORS.

Thanks,

        tglx

Andy Lutomirski Oct. 1, 2021, 3:13 p.m. UTC | #41

On Fri, Oct 1, 2021, at 2:56 AM, Thomas Gleixner wrote:
> On Thu, Sep 30 2021 at 21:41, Andy Lutomirski wrote:

>> On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:

>

>> Now that I read the docs some more, I'm seriously concerned about this

>> XSAVE design.  XSAVES with UINTR is destructive -- it clears UINV.  If

>> we actually use this, then the whole last_cpu "preserve the state in

>> registers" optimization goes out the window.  So does anything that

>> happens to assume that merely saving the state doesn't destroy it on

>> respectable modern CPUs XRSTORS will #GP if you XRSTORS twice, which

>> makes me nervous and would need a serious audit of our XRSTORS paths.

>

> I have no idea what you are fantasizing about. You can XRSTORS five

> times in a row as long as your XSTATE memory image is correct.

I'm just reading TFM, which is some kind of dystopian fantasy.

11.8.2.4 XRSTORS

Before restoring the user-interrupt state component, XRSTORS verifies that UINV is 0. If it is not, XRSTORS
causes a general-protection fault (#GP) before loading any part of the user-interrupt state component. (UINV
is IA32_UINTR_MISC[39:32]; XRSTORS does not check the contents of the remainder of that MSR.)

So if UINV is set in the memory image and you XRSTORS five times in a row, the first one will work assuming UINV was zero.  The second one will #GP.  And:

11.8.2.3 XSAVES
After saving the user-interrupt state component, XSAVES clears UINV. (UINV is IA32_UINTR_MISC[39:32];
XSAVES does not modify the remainder of that MSR.)

So if we're running a UPID-enabled user task and we switch to a kernel thread, we do XSAVES and UINV is cleared.  Then we switch back to the same task and don't do XRSTORS (or otherwise write IA32_UINTR_MISC) and UINV is still clear.

And we had better clear UINV when running a kernel thread because the UPID might get freed or the kernel thread might do some CPL3 shenanigans (via EFI, perhaps? I don't know if any firmwares actually do this).

So all this seems to put UINV into the "independent" category of feature along with LBR.  And the 512-byte wastes from extra copies of the legacy area and the loss of the XMODIFIED optimization will just be collateral damage.

Stefan Hajnoczi Oct. 1, 2021, 4:35 p.m. UTC | #42

On Thu, Sep 30, 2021 at 10:24:24AM -0700, Sohil Mehta wrote:
> 

> On 9/30/2021 9:30 AM, Stefan Hajnoczi wrote:

> > On Tue, Sep 28, 2021 at 09:31:34PM -0700, Andy Lutomirski wrote:

> > > 

> > > I spent some time reviewing the docs (ISE) and contemplating how this all fits together, and I have a high level question:

> > > 

> > > Can someone give an example of a realistic workload that would benefit from SENDUIPI and precisely how it would use SENDUIPI?  Or an example of a realistic workload that would benefit from hypothetical device-initiated user interrupts and how it would use them?  I'm having trouble imagining something that wouldn't work as well or better by simply polling, at least on DMA-coherent architectures like x86.

> > I was wondering the same thing. One thing came to mind:

> > 

> > An application that wants to be *interrupted* from what it's doing

> > rather than waiting until the next polling point. For example,

> > applications that are CPU-intensive and have green threads. I can't name

> > a real application like this though :P.

> 

> Thank you Stefan and Andy for giving this some thought.

> 

> We are consolidating the information internally on where and how exactly we

> expect to see benefits with real workloads for the various sources of User

> Interrupts. It will take a few days to get back on this one.

One possible use case came to mind in QEMU's TCG just-in-time compiler:

QEMU's TCG threads execute translated code. There are events that
require interrupting these threads. Today a check is performed at the
start of every translated block. Most of the time the check is false and
it's a waste of CPU.

User interrupts can eliminate the need for checks by interrupting TCG
threads when events occur.

I don't know whether this will improve performance or how feasible it is
to implement, but I've added people who might have ideas. (For a summary
of user interrupts, see
https://lwn.net/SubscriberLink/871113/60652640e11fc5df/.)

Stefan

Richard Henderson Oct. 1, 2021, 4:41 p.m. UTC | #43

On 10/1/21 12:35 PM, Stefan Hajnoczi wrote:
> QEMU's TCG threads execute translated code. There are events that

> require interrupting these threads. Today a check is performed at the

> start of every translated block. Most of the time the check is false and

> it's a waste of CPU.

> 

> User interrupts can eliminate the need for checks by interrupting TCG

> threads when events occur.


We used to use interrupts, and stopped because we need to wait until the guest is in a 
stable state.  The guest is always in a stable state at the beginning of each TB.

See 378df4b2375.


r~

Sohil Mehta Oct. 1, 2021, 6:04 p.m. UTC | #44

On 10/1/2021 8:13 AM, Andy Lutomirski wrote:
>

> I'm just reading TFM, which is some kind of dystopian fantasy.

>

> 11.8.2.4 XRSTORS

>

> Before restoring the user-interrupt state component, XRSTORS verifies that UINV is 0. If it is not, XRSTORS

> causes a general-protection fault (#GP) before loading any part of the user-interrupt state component. (UINV

> is IA32_UINTR_MISC[39:32]; XRSTORS does not check the contents of the remainder of that MSR.)

>

> So if UINV is set in the memory image and you XRSTORS five times in a row, the first one will work assuming UINV was zero.  The second one will #GP.  And:

>

> 11.8.2.3 XSAVES

> After saving the user-interrupt state component, XSAVES clears UINV. (UINV is IA32_UINTR_MISC[39:32];

> XSAVES does not modify the remainder of that MSR.)

>

> So if we're running a UPID-enabled user task and we switch to a kernel thread, we do XSAVES and UINV is cleared.  Then we switch back to the same task and don't do XRSTORS (or otherwise write IA32_UINTR_MISC) and UINV is still clear.

Andy,

I am still catching up with the rest of the discussion but I wanted to 
provide some input here.

Have you had a chance to look at the discussion on this topic in patch 5?
https://lore.kernel.org/lkml/87bl4fcxz8.ffs@tglx/
The pseudo code Thomas provided and my comments on the same cover the 
above situation.

The UINV bits in the IA32_UINTR_MISC act as an on/off switch for 
detecting user interrupts (i.e. moving them from UPID.PIR to UIRR). When 
XSAVES saves UIRR into memory we want the switch to atomically turn off 
to stop detecting additional interrupts. When we restore the state back 
the hardware wants to be sure the switch is off before writing to UIRR. 
If not, the UIRR state could potentially be overwritten.

That's how I understand the XSAVES/XRSTORS behavior. I can confirm with 
the hardware architects if you want more details here.

Regarding the #GP trick proposal, I am planning to get some feedback 
from the hardware folks to see if any potential issues could arise.

I am on a pre-planned break next week. I apologize (in advance) for the 
delay in responding.

Thanks,
Sohil

Thomas Gleixner Oct. 1, 2021, 9:29 p.m. UTC | #45

On Fri, Oct 01 2021 at 08:13, Andy Lutomirski wrote:

> On Fri, Oct 1, 2021, at 2:56 AM, Thomas Gleixner wrote:

>> On Thu, Sep 30 2021 at 21:41, Andy Lutomirski wrote:

>>> On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:

>>

>>> Now that I read the docs some more, I'm seriously concerned about this

>>> XSAVE design.  XSAVES with UINTR is destructive -- it clears UINV.  If

>>> we actually use this, then the whole last_cpu "preserve the state in

>>> registers" optimization goes out the window.  So does anything that

>>> happens to assume that merely saving the state doesn't destroy it on

>>> respectable modern CPUs XRSTORS will #GP if you XRSTORS twice, which

>>> makes me nervous and would need a serious audit of our XRSTORS paths.

>>

>> I have no idea what you are fantasizing about. You can XRSTORS five

>> times in a row as long as your XSTATE memory image is correct.

>

> I'm just reading TFM, which is some kind of dystopian fantasy.

>

> 11.8.2.4 XRSTORS

>

> Before restoring the user-interrupt state component, XRSTORS verifies

> that UINV is 0. If it is not, XRSTORS causes a general-protection

> fault (#GP) before loading any part of the user-interrupt state

> component. (UINV is IA32_UINTR_MISC[39:32]; XRSTORS does not check the

> contents of the remainder of that MSR.)


Duh. I was staring at the SDM and searching for a hint. Stupid me!

> So if UINV is set in the memory image and you XRSTORS five times in a

> row, the first one will work assuming UINV was zero.  The second one

> will #GP.


Yes. I can see what you mean now :)

> 11.8.2.3 XSAVES

> After saving the user-interrupt state component, XSAVES clears UINV. (UINV is IA32_UINTR_MISC[39:32];

> XSAVES does not modify the remainder of that MSR.)

>

> So if we're running a UPID-enabled user task and we switch to a kernel

> thread, we do XSAVES and UINV is cleared.  Then we switch back to the

> same task and don't do XRSTORS (or otherwise write IA32_UINTR_MISC)

> and UINV is still clear.


Yes, that has to be mopped up on the way to user space.

> And we had better clear UINV when running a kernel thread because the

> UPID might get freed or the kernel thread might do some CPL3

> shenanigans (via EFI, perhaps? I don't know if any firmwares actually

> do this).


Right. That's what happens already with the current pile.

> So all this seems to put UINV into the "independent" category of

> feature along with LBR.  And the 512-byte wastes from extra copies of

> the legacy area and the loss of the XMODIFIED optimization will just

> be collateral damage.


So we'd end up with two XSAVES on context switch. We can simply do:

        XSAVES();
        fpu.state.xtsate.uintr.uinv = 0;

which allows to do as many XRSTORS in a row as we want. Only the final
one on the way to user space will have to restore the real vector if the
register state is not valid:

       if (fpu_state_valid()) {
            if (needs_uinv(current)
               wrmsrl(UINV, vector);
       } else {
            if (needs_uinv(current)
               fpu.state.xtsate.uintr.uinv = vector;
            XRSTORS();
       }

Hmm?

Thanks,

        tglx

Sohil Mehta Oct. 1, 2021, 11 p.m. UTC | #46

On 10/1/2021 2:29 PM, Thomas Gleixner wrote:
> So we'd end up with two XSAVES on context switch. We can simply do:

>          XSAVES();

>          fpu.state.xtsate.uintr.uinv = 0;

I am a bit confused. Do we need to set UINV to 0 explicitly?

If XSAVES gets called twice during context switch then the UINV in the 
XSTATE buffer automatically gets set to 0. Since XSAVES saves the 
current UINV value in the MISC_MSR which was already set to 0 by the 
previous XSAVES.

Though, this probably happens due to pure luck than intentional design :)

> which allows to do as many XRSTORS in a row as we want. Only the final

> one on the way to user space will have to restore the real vector if the

> register state is not valid:

>

>         if (fpu_state_valid()) {

>              if (needs_uinv(current)

>                 wrmsrl(UINV, vector);

>         } else {

>              if (needs_uinv(current)

>                 fpu.state.xtsate.uintr.uinv = vector;

>              XRSTORS();

>         }

I might have missed some subtle difference. Has this logic changed from 
what you previously suggested for arch_exit_to_user_mode_prepare()?

        if (xrstors_pending)) {
             // Update the saved xstate for xrstors
             // Unconditionally update the UINV since it could have been 
overwritten by calling XSAVES twice.
                current->xstate.uintr.uinv = UINTR_NOTIFICATION_VECTOR;
                 current->xstate.uintr.uirr |= pir;
         } else {
                 // Manually restore UIRR and UINV
                 rdmsrl(IA32_UINTR_RR, uirr);
                 wrmsrl(IA32_UINTR_RR, uirr | pir);

             misc.val64 = 0;
                 misc.uittsz = current->uintr->uittsz;
                 misc.uinv = UINTR_NOTIFICATION_VECTOR;
                 wrmsrl(IA32_UINTR_MISC, misc.val64);
         }

> Hmm?

The one case I can see this failing is if there was another XRSTORS 
after the "final" restore in arch_exit_to_user_mode_prepare()? I think 
that is not possible but I am not an expert on this. Did I misunderstand 
something?

Thanks,
Sohil

Andy Lutomirski Oct. 1, 2021, 11:04 p.m. UTC | #47

> On Oct 1, 2021, at 2:29 PM, Thomas Gleixner <tglx@linutronix.de> wrote:

> 

> On Fri, Oct 01 2021 at 08:13, Andy Lutomirski wrote:

> 

>>> On Fri, Oct 1, 2021, at 2:56 AM, Thomas Gleixner wrote:

>>> On Thu, Sep 30 2021 at 21:41, Andy Lutomirski wrote:

>>>>> On Thu, Sep 30, 2021, at 5:01 PM, Thomas Gleixner wrote:

>>> 

>>>> Now that I read the docs some more, I'm seriously concerned about this

>>>> XSAVE design.  XSAVES with UINTR is destructive -- it clears UINV.  If

>>>> we actually use this, then the whole last_cpu "preserve the state in

>>>> registers" optimization goes out the window.  So does anything that

>>>> happens to assume that merely saving the state doesn't destroy it on

>>>> respectable modern CPUs XRSTORS will #GP if you XRSTORS twice, which

>>>> makes me nervous and would need a serious audit of our XRSTORS paths.

>>> 

>>> I have no idea what you are fantasizing about. You can XRSTORS five

>>> times in a row as long as your XSTATE memory image is correct.

>> 

>> I'm just reading TFM, which is some kind of dystopian fantasy.

>> 

>> 11.8.2.4 XRSTORS

>> 

>> Before restoring the user-interrupt state component, XRSTORS verifies

>> that UINV is 0. If it is not, XRSTORS causes a general-protection

>> fault (#GP) before loading any part of the user-interrupt state

>> component. (UINV is IA32_UINTR_MISC[39:32]; XRSTORS does not check the

>> contents of the remainder of that MSR.)

> 

> Duh. I was staring at the SDM and searching for a hint. Stupid me!

> 

>> So if UINV is set in the memory image and you XRSTORS five times in a

>> row, the first one will work assuming UINV was zero.  The second one

>> will #GP.

> 

> Yes. I can see what you mean now :)

> 

>> 11.8.2.3 XSAVES

>> After saving the user-interrupt state component, XSAVES clears UINV. (UINV is IA32_UINTR_MISC[39:32];

>> XSAVES does not modify the remainder of that MSR.)

>> 

>> So if we're running a UPID-enabled user task and we switch to a kernel

>> thread, we do XSAVES and UINV is cleared.  Then we switch back to the

>> same task and don't do XRSTORS (or otherwise write IA32_UINTR_MISC)

>> and UINV is still clear.

> 

> Yes, that has to be mopped up on the way to user space.

> 

>> And we had better clear UINV when running a kernel thread because the

>> UPID might get freed or the kernel thread might do some CPL3

>> shenanigans (via EFI, perhaps? I don't know if any firmwares actually

>> do this).

> 

> Right. That's what happens already with the current pile.

> 

>> So all this seems to put UINV into the "independent" category of

>> feature along with LBR.  And the 512-byte wastes from extra copies of

>> the legacy area and the loss of the XMODIFIED optimization will just

>> be collateral damage.

> 

> So we'd end up with two XSAVES on context switch. We can simply do:

> 

>        XSAVES();

>        fpu.state.xtsate.uintr.uinv = 0;


Could work. As long as UINV is armed, RR can change at any time (maybe just when IF=1? The manual is unclear).  But the first XSAVES disarms UINV, so maybe this won’t confuse any callers.

> 

> which allows to do as many XRSTORS in a row as we want. Only the final

> one on the way to user space will have to restore the real vector if the

> register state is not valid:

> 

>       if (fpu_state_valid()) {

>            if (needs_uinv(current)

>               wrmsrl(UINV, vector);

>       } else {

>            if (needs_uinv(current)

>               fpu.state.xtsate.uintr.uinv = vector;

>            XRSTORS();

>       }

> 

> Hmm?


I like it better than anything else I’ve seen.

> 

> Thanks,

> 

>        tglx

Sohil Mehta Nov. 18, 2021, 9:44 p.m. UTC | #48

On 11/15/2021 7:49 PM, Prakash Sangappa wrote:
> 

> Here are some use cases received from our Databases(Oracle) group.

Thank you Prakash for providing the potential use cases. This would 
really help with the design and validation of the UINTR APIs.

> 
> Aim is to use user interrupts as one mechanism, for fast IPC and to signal
> target thread blocked in the kernel in a system call.
> i.e replace use of signals with user interrupts.
> 

Mimicking this signal behavior would likely add some complexity to the 
implementation. Since there is interest, we'll work on prototyping this 
to evaluate tradeoffs and present them here.

> Following enhancements with respect to sharing UITT table will be beneficial.
> 
> Oracle DB creates large number of multithreaded processes. A thread in a
> process may need to communicate(using user interrupts) with another
> thread in any other process. Current proposal of receiver sending an FD
> per vector to each of the sender will be an overhead. Also every sender
> process/thread allocating a sender table for storing same receiver UPIDs
> will be duplication resulting in wasted memory.
> > In addition to the current FD based registration approach, having a way
> for a group of DB processes to share a sender(UITT) table and  allowing
> each of the receiver threads to directly register itself in the shared UITT
> table,  will be efficient. For this the receiver need not create an fd. The
> receiver’s UPID index in UITT got from the registration will  be shared
> with all senders via shared memory(IPC).
> 

Sharing the UITT between tasks of the same process would be relatively 
easier compared to the sharing the UITT across processes. We would need 
a scalable mechanism to authenticate the sharing of this kernel resource 
across the process boundary.

I am working on a proposal for this. I'll send it out once I have 
something concrete.

> DB maintains a process table of all the DB processes/threads in the shared
> memory. The receiver can register itself in the shared UITT table and store
> its UPID index in the process table. Sender will lookup target process from
> the process table to get the UITT index and send the user interrupt.
> 

Thanks,
Sohil

Sohil Mehta Nov. 18, 2021, 10:19 p.m. UTC | #49

On 10/1/2021 1:19 AM, Pavel Machek wrote:
> Hi!
> 

Thank you for reviewing the patches!

>> Instructions
>> ------------
>> senduipi <index> - send a user IPI to a target task based on the UITT index.
>>
>> clui - Mask user interrupts by clearing UIF (User Interrupt Flag).
>>
>> stui - Unmask user interrupts by setting UIF.
>>
>> testui - Test current value of UIF.
>>
>> uiret - return from a user interrupt handler.
> 
> Are other CPU vendors allowed to implement compatible instructions?
> 
> If not, we should probably have VDSO entries so kernel can abstract
> differences between CPUs.
> 

Yes, we are evaluating VDSO support for this.

>> Untrusted processes
>> -------------------
>> The current implementation expects only trusted and cooperating processes to
>> communicate using user interrupts. Coordination is expected between processes
>> for a connection teardown. In situations where coordination doesn't happen
>> (say, due to abrupt process exit), the kernel would end up keeping shared
>> resources (like UPID) allocated to avoid faults.
> 
> Keeping resources allocated after process exit is a no-no.
> 

I meant the resource is still tracked via the shared file descriptor, so 
it will eventually get freed when the FD release happens. I am planning 
to include better documentation on lifetime rules of these shared 
resources next time.

Thanks,
Sohil

Chrisma Pakha Dec. 22, 2021, 4:17 p.m. UTC | #50

On 9/13/21 4:01 PM, Sohil Mehta wrote:
> User Interrupts Introduction
> ============================
>
> User Interrupts (Uintr) is a hardware technology that enables delivering
> interrupts directly to user space.
>
> Today, virtually all communication across privilege boundaries happens by going
> through the kernel. These include signals, pipes, remote procedure calls and
> hardware interrupt based notifications. User interrupts provide the foundation
> for more efficient (low latency and low CPU utilization) versions of these
> common operations by avoiding transitions through the kernel.
>
> In the User Interrupts hardware architecture, a receiver is always expected to
> be a user space task. However, a user interrupt can be sent by another user
> space task, kernel or an external source (like a device).
>
> In addition to the general infrastructure to receive user interrupts, this
> series introduces a single source: interrupts from another user task.  These
> are referred to as User IPIs.
>
> The first implementation of User IPIs will be in the Intel processor code-named
> Sapphire Rapids. Refer Chapter 11 of the Intel Architecture instruction set
> extensions for details of the hardware architecture [1].
>
> Series-reviewed-by: Tony Luck<tony.luck@intel.com>
>
> Main goals of this RFC
> ======================
> - Introduce this upcoming technology to the community.
> This cover letter includes a hardware architecture summary along with the
> software architecture and kernel design choices. This post is a bit long as a
> result. Hopefully, it helps answer more questions than it creates :) I am also
> planning to talk about User Interrupts next week at the LPC Kernel summit.
>
> - Discuss potential use cases.
> We are starting to look at actual usages and libraries (like libevent[2] and
> liburing[3]) that can take advantage of this technology. Unfortunately, we
> don't have much to share on this right now. We need some help from the
> community to identify usages that can benefit from this. We would like to make
> sure the proposed APIs work for the eventual consumers.
>
> - Get early feedback on the software architecture.
> We are hoping to get some feedback on the direction of overall software
> architecture - starting with User IPI, extending it for kernel-to-user
> interrupt notifications and external interrupts in the future.
>
> - Discuss some of the main architecture opens.
> There is lot of work that still needs to happen to enable this technology. We
> are looking for some input on future patches that would be of interest. Here
> are some of the big opens that we are looking to resolve.
> * Should Uintr interrupt all blocking system calls like sleep(), read(),
>    poll(), etc? If so, should we implement an SA_RESTART type of mechanism
>    similar to signals? - Refer Blocking for interrupts section below.
>
> * Should the User Interrupt Target table (UITT) be shared between threads of a
>    multi-threaded application or maybe even across processes? - Refer Sharing
>    the UITT section below.
>
> Why care about this? - Micro benchmark performance
> ==================================================
> There is a ~9x or higher performance improvement using User IPI over other IPC
> mechanisms for event signaling.
>
> Below is the average normalized latency for a 1M ping-pong IPC notifications
> with message size=1.
>
> +------------+-------------------------+
> | IPC type   |   Relative Latency      |
> |            |(normalized to User IPI) |
> +------------+-------------------------+
> | User IPI   |                     1.0 |
> | Signal     |                    14.8 |
> | Eventfd    |                     9.7 |
> | Pipe       |                    16.3 |
> | Domain     |                    17.3 |
> +------------+-------------------------+
>
> Results have been estimated based on tests on internal hardware with Linux
> v5.14 + User IPI patches.
>
> Original benchmark:https://github.com/goldsborough/ipc-bench
> Updated benchmark:https://github.com/intel/uintr-ipc-bench/tree/linux-rfc-v1
>
> *Performance varies by use, configuration and other factors.
>
> How it works underneath? - Hardware Summary
> ===========================================
> User Interrupts is a posted interrupt delivery mechanism. The interrupts are
> first posted to a memory location and then delivered to the receiver when they
> are running with CPL=3.
>
> Kernel managed architectural data structures
> --------------------------------------------
> UPID: User Posted Interrupt Descriptor - Holds receiver interrupt vector
> information and notification state (like an ongoing notification, suppressed
> notifications).
>
> UITT: User Interrupt Target Table - Stores UPID pointer and vector information
> for interrupt routing on the sender side. Referred by the senduipi instruction.
>
> The interrupt state of each task is referenced via MSRs which are saved and
> restored by the kernel during context switch.
>
> Instructions
> ------------
> senduipi <index> - send a user IPI to a target task based on the UITT index.
>
> clui - Mask user interrupts by clearing UIF (User Interrupt Flag).
>
> stui - Unmask user interrupts by setting UIF.
>
> testui - Test current value of UIF.
>
> uiret - return from a user interrupt handler.
>
> User IPI
> --------
> When a User IPI sender executes 'senduipi <index>', the hardware refers the
> UITT table entry pointed by the index and posts the interrupt vector (63-0)
> into the receiver's UPID.
>
> If the receiver is running (CPL=3), the sender cpu would send a physical IPI to
> the receiver's cpu. On the receiver side this IPI is detected as a User
> Interrupt. The User Interrupt handler for the receiver is invoked and the
> vector number (63-0) is pushed onto the stack.
>
> Upon execution of 'uiret' in the interrupt handler, the control is transferred
> back to instruction that was interrupted.
>
> Refer Chapter 11 of the Intel Architecture instruction set extensions [1] for
> more details.
>
> Application interface - Software Architecture
> =============================================
> User Interrupts (Uintr) is an opt-in feature (unlike signals). Applications
> wanting to use Uintr are expected to register themselves with the kernel using
> the Uintr related system calls. A Uintr receiver is always a userspace task. A
> Uintr sender can be another userspace task, kernel or a device.
>
> 1) A receiver can register/unregister an interrupt handler using the Uintr
> receiver related syscalls.
> 		uintr_register_handler(handler, flags)
> 		uintr_unregister_handler(flags)
>
> 2) A syscall also allows a receiver to register a vector and create a user
> interrupt file descriptor - uintr_fd.
> 		uintr_fd = uintr_create_fd(vector, flags)
>
> Uintr can be useful in some of the usages where eventfd or signals are used for
> frequent userspace event notifications. The semantics of uintr_fd are somewhat
> similar to an eventfd() or the write end of a pipe.
>
> 3) Any sender with access to uintr_fd can use it to deliver events (in this
> case - interrupts) to a receiver. A sender task can manage its connection with
> the receiver using the sender related syscalls based on uintr_fd.
> 		uipi_index = uintr_register_sender(uintr_fd, flags)
>
> Using an FD abstraction provides a secure mechanism to connect with a receiver.
> The FD sharing and isolation mechanisms put in place by the kernel would extend
> to Uintr as well.
>
> 4a) After the initial setup, a sender task can use the SENDUIPI instruction
> along with the uipi_index to generate user IPIs without any kernel
> intervention.
> 		SENDUIPI <uipi_index>
>
> If the receiver is running (CPL=3), then the user interrupt is delivered
> directly without a kernel transition. If the receiver isn't running the
> interrupt is delivered when the receiver gets context switched back. If the
> receiver is blocked in the kernel, the user interrupt is delivered to the
> kernel which then unblocks the intended receiver to deliver the interrupt.
>
> 4b) If the sender is the kernel or a device, the uintr_fd can be passed onto
> the related kernel entity to allow them to setup a connection and then generate
> a user interrupt for event delivery. <The exact details of this API are still
> being worked upon.>
>
> For details of the user interface and associated system calls refer the Uintr
> man-pages draft:
> https://github.com/intel/uintr-linux-kernel/tree/rfc-v1/tools/uintr/manpages.
> We have also included the same content as patch 1 of this series to make it
> easier to review.
>
> Refer the Uintr compiler programming guide [4] for details on Uintr integration
> with GCC and Binutils.
>
> Kernel design choices
> =====================
> Here are some of the reasons and trade-offs for the current design of the APIs.
>
> System call interface
> ---------------------
> Why a system call interface?: The 2 options we considered are using a char
> device at /dev or use system calls (current approach). A syscall approach
> avoids exposing a core cpu feature through a driver model. Also, we want to
> have a user interrupt FD per vector and share a single common interrupt handler
> among all vectors. This seems easier for the kernel and userspace to accomplish
> using a syscall based approach.
>
> Data sharing using user interrupts: Uintr doesn't include a mechanism to
> share/transmit data. The expectation is applications use existing data sharing
> mechanisms to share data and use Uintr only for signaling.
>
> An FD for each vector: A uintr_fd is assigned to each vector to allow fine
> grained priority and event management by the receiver. The alternative we
> considered was to allocate an FD to the interrupt handler and having that
> shared with the sender. However, that approach relies on the sender selecting
> the vector and moves the vector priority management to the sender. Also, if
> multiple senders want to send unique user interrupts they would need to
> coordinate the vector selection amongst them.
>
> Extending the APIs: Currently, the system calls are only extendable using the
> flags argument. We can add a variable size struct to some of the syscalls if
> needed.
>
> Extending existing mechanisms
> -----------------------------
> Uintr can be beneficial in some of the usages where eventfd() or signals are
> used. Since Uintr is hardware-dependent, thread-specific and bypasses the
> kernel in the fast path, it makes extending existing mechanisms harder.
>
> Main issues with extending signals:
> Signal handlers are defined significantly differently than a User interrupt
> handler. An application needs to save/restore registers in a user interrupt
> handler and call uiret to return from it. Also, signals can be process directed
> (or thread directed) but user interrupts are always thread directed.
>
> Comparison of signals with User Interrupts:
> +=====================+===========================+===========================+
> |                     | Signals                   | User Interrupts           |
> +=====================+===========================+===========================+
> | Stacks              | Has alt stacks            | Uses application stack    |
> |                     |                           | (alternate stack option   |
> |                     |                           | not yet enabled)          |
> +---------------------+---------------------------+---------------------------+
> | Registers state     | Kernel manages incl.      | App responsible (Use GCC  |
> |                     | FPU/XSTATE area           | 'interrupt' attribute for |
> |                     |                           | general purpose registers)|
> +---------------------+---------------------------+---------------------------+
> | Blocking/Masking    | sigprocmask(2)/sa_mask    | CLUI instruction (No per  |
> |                     |                           | vector masking)           |
> +---------------------+---------------------------+---------------------------+
> | Direction           | Uni-directional           | Uni-directional           |
> +---------------------+---------------------------+---------------------------+
> | Post event          | kill(), signal(),         | SENDUIPI <index> - index  |
> |                     | sigqueue(), etc.          | derived from uintr_fd     |
> +---------------------+---------------------------+---------------------------+
> | Target              | Process-directed or       | Thread-directed           |
> |                     | thread-directed           |                           |
> +---------------------+---------------------------+---------------------------+
> | Fork/inheritance    | Empty signal set          | Nothing is inherited      |
> +---------------------+---------------------------+---------------------------+
> | Execv               | Pending signals preserved | Nothing is inherited      |
> +---------------------+---------------------------+---------------------------+
> | Order of delivery   | Undetermined              | High to low vector numbers|
> | for multiple signals|                           |                           |
> +---------------------+---------------------------+---------------------------+
> | Handler re-entry    | All signals except the    | No interrupts can cause   |
> |                     | one being handled         | handler re-entry.         |
> +---------------------+---------------------------+---------------------------+
> | Delivery feedback   | 0 or -1 based on whether  | No feedback on whether the|
> |                     | the signal was sent       | interrupt was sent or     |
> |                     |                           | received.                 |
> +---------------------+---------------------------+---------------------------+
>
> Main issues with extending eventfd():
> eventfd() has a counter value that is core to the API. User interrupts can't
> have an associated counter since the signaling happens at the user level and
> the hardware doesn't have a memory counter mechanism. Also, eventfd can be used
> for bi-directional signaling where as uintr_fd is uni-directional.
>
> Comparison of eventfd with uintr_fd:
> +====================+======================+==============================+
> |                    | Eventfd              | uintr_fd (User Interrupt FD) |
> +====================+======================+==============================+
> | Object             | Counter - uint64     | Receiver vector information  |
> +--------------------+----------------------+------------------------------+
> | Post event         | write() to eventfd   | SENDUIPI <index> - index     |
> |                    |                      | derived from uintr_fd        |
> +--------------------+----------------------+------------------------------+
> | Receive event      | read() on eventfd    | Implicit - Handler is        |
> |                    |                      | invoked with associated      |
> |                    |                      | vector.                      |
> +--------------------+----------------------+------------------------------+
> | Direction          | Bi-directional       | Uni-directional              |
> +--------------------+----------------------+------------------------------+
> | Data transmitted   | Counter - uint64     | None                         |
> +--------------------+----------------------+------------------------------+
> | Waiting for events | Poll() family of     | No per vector wait.          |
> |                    | syscalls             | uintr_wait() allows waiting  |
> |                    |                      | for all user interrupts      |
> +--------------------+----------------------+------------------------------+
>
> Security Model
> ==============
> User Interrupts is designed as an opt-in feature (unlike signals). The security
> model for user interrupts is intended to be similar to eventfd(). The general
> idea is that any sender with access to uintr_fd would be able to generate the
> associated interrupt vector for the receiver task that created the fd.
>
> Untrusted processes
> -------------------
> The current implementation expects only trusted and cooperating processes to
> communicate using user interrupts. Coordination is expected between processes
> for a connection teardown. In situations where coordination doesn't happen
> (say, due to abrupt process exit), the kernel would end up keeping shared
> resources (like UPID) allocated to avoid faults.
>
> Currently, a sender can easily cause a denial of service for the receiver by
> generating a storm of user interrupts. A user interrupt handler is invoked with
> interrupts disabled, but upon execution of uiret, interrupts get enabled again
> by the hardware. This can lead to the handler being invoked again before normal
> execution can resume. There isn't a hardware mechanism to mask specific
> interrupt vectors.
>
> To enable untrusted processes to communicate, we need to add a per-vector
> masking option through another syscall (or maybe IOCTL). However, this can add
> some complexity to the kernel code. A vector can only be masked by modifying
> the UITT entries at the source. We need to be careful about races while
> removing and restoring the UPID from the UITT.
>
> Resource limits
> ---------------
> The maximum number of receiver-sender connections would be limited by the
> maximum number of open file descriptors and the size of the UITT.
>
> The UITT size is chosen as 4kB fixed size arbitrarily right now. We plan to
> make it dynamic and configurable in size. RLIMIT_MEMLOCK or ENOMEM should be
> triggered when the size limits have been hit.
>
> Main Opens
> ==========
>
> Blocking for interrupts
> -----------------------
> User interrupts are delivered to applications immediately if they are running
> in userspace. If a receiver task has blocked in the kernel using the placeholder
> uintr_wait() syscall, the task would be woken up to deliver the user interrupt.
> However, if the task is blocked due to any other blocking calls like read(),
> sleep(), etc; the interrupt will only get delivered when the application gets
> scheduled again. We need to consider if applications need to receive User
> Interrupts as soon as they are posted (similar to signals) when they are
> blocked due to some other reason. Adding this capability would likely make the
> kernel implementation more complex.
>
> Interrupting system calls using User Interrupts would also mean we need to
> consider an SA_RESTART type of mechanism. We also need to evaluate if some of
> the signal handler related semantics in the kernel can be reused for User
> Interrupts.
>
> Sharing the User Interrupt Target Table (UITT)
> ----------------------------------------------
> The current implementation assigns a unique UITT to each task. This assumes
> that User interrupts are used for point-to-point communication between 2 tasks.
> Also, this keeps the kernel implementation relatively simple.
>
> However, there are of benefits to sharing the UITT between threads of a
> multi-threaded application. One, they would see a consistent view of the UITT.
> i.e. SENDUIPI <index> would mean the same on all threads of the application.
> Also, each thread doesn't have to register itself using the common uintr_fd.
> This would simplify the userspace setup and make efficient use of kernel
> memory. The potential downside is that the kernel implementation to allocate,
> modify, expand and free the UITT would be more complex.
>
> A similar argument can be made for a set of processes that do a lot of IPC
> amongst them. They would prefer to have a shared UITT that lets them target any
> process from any process. With the current file descriptor based approach, the
> connection setup can be time consuming and somewhat cumbersome. We need to
> evaluate if this can be made simpler as well.
>
> Kernel page table isolation (KPTI)
> ----------------------------------
> SENDUIPI is a special ring-3 instruction that makes a supervisor mode memory
> access to the UPID and UITT memory. The current patches need KPTI to be
> disabled for User IPIs to work. To make User IPI work with KPTI, we need to
> allocate these structures from a special memory region that has supervisor
> access but it is mapped into userspace. The plan is to implement a mechanism
> similar to LDT.
>
> Processors that support user interrupts are not affected by Meltdown so the
> auto mode of KPTI will default to off. Users who want to force enable KPTI will
> need to wait for a later version of this patch series to use user interrupts.
> Please let us know if you want the development of these patches to be
> prioritized (or deprioritized).
>
> FAQs
> ====
> Q: What happens if a process is "surprised" by a user interrupt?
> A: For tasks that haven't registered with the kernel and requested for user
> interrupts aren't expected or able to receive to user interrupts.
>
> Q: Do user interrupts affect kernel scheduling?
> A: No. If a task is blocked waiting for user interrupts, when the kernel
> receives a notification on behalf of that task we only put it back on the
> runqueue. Delivery of a user interrupt in no way changes the scheduling
> priorities of a task.
>
> Q: Does the sender get to know if the interrupt was delivered?
> A: No. User interrupts only provides a posted interrupt delivery mechanism. If
> applications need to rely on whether the interrupt was delivered they should
> consider a userspace mechanism for feedback (like a shared memory counter or a
> user interrupt back to the sender).
>
> Q: Why is there no feedback on interrupt delivery?
> A: Being a posted interrupt delivery mechanism, the interrupt delivery
> happens in 2 steps:
> 1) The interrupt information is stored in a memory location (UPID).
> 2) The physical interrupt is delivered to the interrupt receiver.
>
> The 2nd step could happen immediately, after an extended period, or it might
> never happen based on the state of the receiver after step 1. (The receiver
> could have disabled interrupts, have been context switched out or it might have
> crashed during that time.) This makes it very hard for the hardware to reliably
> provide feedback upon execution of SENDUIPI.
>
> Q: Can user interrupts be nested?
> A: Yes. Using STUI instruction in the interrupt handler would allow new user
> interrupts to be delivered. However, there no TPR(thread priority register)
> like mechanism to allow only higher priority interrupts. Any user interrupt can
> be taken when nesting is enabled.
>
> Q: Can a task receive all pending user interrupts in one go?
> A: No. The hardware allows only one vector to be processed at a time. If a task
> is interested in knowing all the interrupts that are pending then we could add
> a syscall that provides the pending interrupts information.
>
> Q: Do the processes need to be pinned to a cpu?
> A: No. User interrupts will be routed correctly to whichever cpu the receiver
> is running on. The kernel updates the cpu information in the UPID during
> context switch.
>
> Q: Why are UPID and UITT allocated by the kernel?
> A: If allocated by user space, applications could misuse the UPID and UITT to
> write to unauthorized memory and generate interrupts on any cpu. The UPID and
> UITT are allocated by the kernel and accessed by the hardware with supervisor
> privilege.
>
> Patch structure for this series
> ===============================
> - Man-pages and Kernel documentation (patch 1,2)
> - Hardware enumeration (patch 3, 4)
> - User IPI kernel vector reservation (patch 5)
> - Syscall interface for interrupt receiver, sender and vector
>    management(uintr_fd) (patch 6-12)
> - Basic selftests (patch 13)
>
> Along with the patches in this RFC, there are additional tests and samples that
> are available at:
> https://github.com/intel/uintr-linux-kernel/tree/rfc-v1
>
> Links
> =====
> [1]:https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
> [2]:https://libevent.org/
> [3]:https://github.com/axboe/liburing
> [4]:https://github.com/intel/uintr-compiler-guide/blob/uintr-gcc-11.1/UINTR-compiler-guide.pdf
>
> Sohil Mehta (13):
>    x86/uintr/man-page: Include man pages draft for reference
>    Documentation/x86: Add documentation for User Interrupts
>    x86/cpu: Enumerate User Interrupts support
>    x86/fpu/xstate: Enumerate User Interrupts supervisor state
>    x86/irq: Reserve a user IPI notification vector
>    x86/uintr: Introduce uintr receiver syscalls
>    x86/process/64: Add uintr task context switch support
>    x86/process/64: Clean up uintr task fork and exit paths
>    x86/uintr: Introduce vector registration and uintr_fd syscall
>    x86/uintr: Introduce user IPI sender syscalls
>    x86/uintr: Introduce uintr_wait() syscall
>    x86/uintr: Wire up the user interrupt syscalls
>    selftests/x86: Add basic tests for User IPI
>
>   .../admin-guide/kernel-parameters.txt         |   2 +
>   Documentation/x86/index.rst                   |   1 +
>   Documentation/x86/user-interrupts.rst         | 107 +++
>   arch/x86/Kconfig                              |  12 +
>   arch/x86/entry/syscalls/syscall_32.tbl        |   6 +
>   arch/x86/entry/syscalls/syscall_64.tbl        |   6 +
>   arch/x86/include/asm/cpufeatures.h            |   1 +
>   arch/x86/include/asm/disabled-features.h      |   8 +-
>   arch/x86/include/asm/entry-common.h           |   4 +
>   arch/x86/include/asm/fpu/types.h              |  20 +-
>   arch/x86/include/asm/fpu/xstate.h             |   3 +-
>   arch/x86/include/asm/hardirq.h                |   4 +
>   arch/x86/include/asm/idtentry.h               |   5 +
>   arch/x86/include/asm/irq_vectors.h            |   6 +-
>   arch/x86/include/asm/msr-index.h              |   8 +
>   arch/x86/include/asm/processor.h              |   8 +
>   arch/x86/include/asm/uintr.h                  |  76 ++
>   arch/x86/include/uapi/asm/processor-flags.h   |   2 +
>   arch/x86/kernel/Makefile                      |   1 +
>   arch/x86/kernel/cpu/common.c                  |  61 ++
>   arch/x86/kernel/cpu/cpuid-deps.c              |   1 +
>   arch/x86/kernel/fpu/core.c                    |  17 +
>   arch/x86/kernel/fpu/xstate.c                  |  20 +-
>   arch/x86/kernel/idt.c                         |   4 +
>   arch/x86/kernel/irq.c                         |  51 +
>   arch/x86/kernel/process.c                     |  10 +
>   arch/x86/kernel/process_64.c                  |   4 +
>   arch/x86/kernel/uintr_core.c                  | 880 ++++++++++++++++++
>   arch/x86/kernel/uintr_fd.c                    | 300 ++++++
>   include/linux/syscalls.h                      |   8 +
>   include/uapi/asm-generic/unistd.h             |  15 +-
>   kernel/sys_ni.c                               |   8 +
>   scripts/checksyscalls.sh                      |   6 +
>   tools/testing/selftests/x86/Makefile          |  10 +
>   tools/testing/selftests/x86/uintr.c           | 147 +++
>   tools/uintr/manpages/0_overview.txt           | 265 ++++++
>   tools/uintr/manpages/1_register_receiver.txt  | 122 +++
>   .../uintr/manpages/2_unregister_receiver.txt  |  62 ++
>   tools/uintr/manpages/3_create_fd.txt          | 104 +++
>   tools/uintr/manpages/4_register_sender.txt    | 121 +++
>   tools/uintr/manpages/5_unregister_sender.txt  |  79 ++
>   tools/uintr/manpages/6_wait.txt               |  59 ++
>   42 files changed, 2626 insertions(+), 8 deletions(-)
>   create mode 100644 Documentation/x86/user-interrupts.rst
>   create mode 100644 arch/x86/include/asm/uintr.h
>   create mode 100644 arch/x86/kernel/uintr_core.c
>   create mode 100644 arch/x86/kernel/uintr_fd.c
>   create mode 100644 tools/testing/selftests/x86/uintr.c
>   create mode 100644 tools/uintr/manpages/0_overview.txt
>   create mode 100644 tools/uintr/manpages/1_register_receiver.txt
>   create mode 100644 tools/uintr/manpages/2_unregister_receiver.txt
>   create mode 100644 tools/uintr/manpages/3_create_fd.txt
>   create mode 100644 tools/uintr/manpages/4_register_sender.txt
>   create mode 100644 tools/uintr/manpages/5_unregister_sender.txt
>   create mode 100644 tools/uintr/manpages/6_wait.txt
>
>
> base-commit: 6880fa6c56601bb8ed59df6c30fd390cc5f6dd8f

Hi All,
My apologies if this email was sent twice.
I was not sure if my previous email followed the proper reply instruction.
I resent this email using the first reply method (Saving the mbox file, 
importing it into my mail client, and using reply-to-all from there).
The following is our understanding of the proposed User Interrupt.

----------------------------------------------------------------------------------------

We have been exploring how user-level interrupts (UIs) can be used to
improve performance and programmability in several different areas:
e.g., parallel programming, memory management, I/O, and floating-point
libraries.  Before we venture into the discussion here, we want to
make sure we understand the proposed model. We describe our
understanding below in four sections:

1. Current target use cases
2. Preparing for future use cases
3. Basic Understanding
4. Multi-threaded parallel programming example

If people on this thread could either confirm or point out our
misunderstandings, we would appreciate it.

# Current Use Cases

The Current RFC is focused on sending an interrupt from one user-space
thread (UST) to another user-space thread (UST2UST).  These threads
could be in different processes, as long as the sender has access to
the receiver's User Interrupt File Descriptor (uifd).  Based on our
understanding, UIs are currently targeted as a low overhead
alternative for the current IPC mechanisms.

# Preparing for future use cases

Based on the RFC, we are aware that allowing a device and the kernel
to send a UI is still in development.  Both these cases would support
imprecise interrupts.  We can see a clear use case for the Device to
user-space thread (D2UST) UI, for example, supporting a fast way for a
GPU to inform a thread that it has finished executing a particular
kernel. If someone could point out an example for Kernel to
user-space thread (K2UST) UI, we would appreciate it.

In our work, we have also been exploring precise UIs from the
currently running thread.  We call these CPU to UST (CPU2UST) UIs.
For example, a SIGSEGV generated by writing to a read-only page, a
SIGFPE generated by dividing a number by zero.

- QUESTION: Is there is a rough draft/plan that we can refer to that 
describes the
current thinking on these three cases.

- QUESTION: Are there use cases for K2UST, or is K2UST the same as CPU2UST?

# Basic Understanding

First, we would like to make sure that our understanding of the 
terminology and the data structures is correct.

- User Interrupt Vector (UIV): The identity of the user interrupt.
- User Interrupt Target Table (UITT):
   This allows the sender to locate the "address" of the receiver 
through the uifd.
- ui_frame: Argument passed to the UI handler. It contains a stack 
pointer, saved flags, and an instruction pointer.
- Sender: The thread that issues the `_senduipi`.
- Receiver: The thread that receives the UI from the sender.

Below outlines our understanding of the current API for UIs.

- Each thread that can receive UIs has exactly one handler
   registered with `uintr_register_handler` (a syscall).
- Each thread that registers a handler calls `uintr_create_fd` for
   every user-level interrupt vector (UIV) that they expect to receive.
- The only information delivered to the handler is the UIV.
- There are 64 UIVs that can be used per thread.
- A thread that wants to send a UI must register the receiver's uifd 
with `uintr_register_sender`  (a syscall).
   This returns an index the sender uses to locate the receiver.
- `_senduipi(index)` sends a user interrupt to a particular destination.
   The sender's UITT and index determine the destination.
- A thread uses `_stui` (and `_clui`) to enable (and disable) the 
reception of UIs.
- As for now, there is no mechanism to mask a particular UIV.
- A UI is delivered to the receiver immediately only if it is currently 
running.
- If a thread executes the `uintr_wait()`, it will be scheduled only 
after receiving a UI.
   There is no guarantee on the delay between the processor receiving 
the UI and when the thread is scheduled.
- If a thread is the target of a UI and another thread is running, or 
the target thread is blocked in the kernel,
   then the target thread will handle the UI when it is next scheduled.
- Ordinary interrupts (interrupt delivered with CPL=0) have a higher 
priority over user interrupts.
- UI handler only saves general-purpose registers (e.g., do not save 
floating-point registers).
- User Interrupts with higher UIV are given a higher priority than those 
with smaller UIV.

## Private UITT

The Current RFC focuses on a private UITT where each thread has its own
UITT.  Thus, different threads executing `_senduipi(index1)` with the
same `index1` may cause different receiver threads to be interrupted.

In many cases, the receiver of an interrupt needs to know which thread
sent the interrupt. If we understand the proposal correctly, there are
only 64 user-level interrupt vectors (UIVs), and the UIV is the only
information transmitted to the receiver. The UIV itself allows the
receiver to distinguish different senders through careful management
of the receiver's UIV.

- QUESTION: Given the code below where the same UIV is registered twice:
```c
   uintr_fd1 = uintr_create_fd(vector1, flags)
   uintr_fd2 = uintr_create_fd(vector1, flags)
```
Would `uintr_fd1` be the same as `uintr_fd2`, or would it be registered 
with a different index in the UITT table?

- QUESTION: If it is registered in a different index, would the
   receiver be able to distinguish the sender if `uintr_fd1` and
   `uintr_fd2` are used from two different threads?

- QUESTION: What is the intended future use of the `flags` argument?

## Shared UITT

In the case of the shared UITT model, all the threads share the same
UITT and thus, if two different threads execute `_senduipi(index)`
with the same index, they would both cause an interrupt in the
same destination/receiver.

- QUESTION: Since both threads use the same entry (same
   destination/receiver), does this mean that the receiver will not be
   able to distinguish the sender of the interrupt?

# Multi-threaded parallel programming example

One of the uses for UIs that we have been exploring is combining the
message-passing and shared memory models for parallel programming.  In
our approach, message-passing is used for synchronization and shared
memory for data sharing.  The message passing part of the programming
pattern is based loosely on Active Messages (See ISCA92), where a
particular thread can turn off/on interrupts to ignore incoming
messages so they can execute critical sections without having to
notify any other threads in the system.

- QUESTION: Is there any data on the performance impact of `_stui` and 
`_clui`?

----------------------------------------------------------------------------------------


Thank you.
Best regards,
Chrisma

Sohil Mehta Jan. 7, 2022, 2:08 a.m. UTC | #51

Hi Chrisma,

On 12/22/2021 8:17 AM, Chrisma Pakha wrote:
> 
> The following is our understanding of the proposed User Interrupt.
> 

Thank you for giving this some thought.

> 
> We have been exploring how user-level interrupts (UIs) can be used to
> improve performance and programmability in several different areas:
> e.g., parallel programming, memory management, I/O, and floating-point
> libraries.

Can you please share more details on this? It would really help improve 
the API design.

> 
> # Current Use Cases
> 
> The Current RFC is focused on sending an interrupt from one user-space
> thread (UST) to another user-space thread (UST2UST).  These threads
> could be in different processes, as long as the sender has access to
> the receiver's User Interrupt File Descriptor (uifd).  Based on our
> understanding, UIs are currently targeted as a low overhead
> alternative for the current IPC mechanisms.
> 

That's correct.

> # Preparing for future use cases
> > If someone could point out an example for Kernel to
> user-space thread (K2UST) UI, we would appreciate it.
> 

The idea here is improve the kernel-to-user event notification latency. 
Theoretically, this can be useful when the kernel sees event completion 
on one cpu but it want to signal (notify) a thread actively running on 
some other CPU. The receiver thread can save some cycles by avoiding 
ring transitions to receive the event.

IO_URING is one of the examples for kernel-to-user event notifications. 
We are evaluating whether providing a UINTR based completion mechanism 
can have benefit over eventfd based completions. The benefits in 
practice are yet to be measured and proven.

> In our work, we have also been exploring precise UIs from the
> currently running thread.  We call these CPU to UST (CPU2UST) UIs.
> For example, a SIGSEGV generated by writing to a read-only page, a
> SIGFPE generated by dividing a number by zero.
> 

It is definitely possible in future to delivery CPU events as User 
Interrupts. The hardware architecture for this is still being worked on 
internally.

Though our focus isn't on exceptions being delivered as User Interrupts. 
Do you have details on what type of benefit is expected?

> - QUESTION: Is there is a rough draft/plan that we can refer to that 
> describes the
> current thinking on these three cases.
> 
> - QUESTION: Are there use cases for K2UST, or is K2UST the same as CPU2UST?
> 

No, K2UST isn't the same as CPU2UST. We would expect limited benefits 
from K2UST but on the other hand CPU2UST can provide significant speedup 
since it avoids the kernel completely.

Unfortunately, due to the large scope of the feature, the hardware 
architecture development is happening in stages. I don't have detailed 
plans for each of the sources of User Interrupts.

Here is our rough plan:

1. Provide a common infrastructure to receive User Interrupts. This is 
independent of the source of the interrupt. The intention here is to 
keep the software APIs generic and extendable so that future sources can 
be added without causing much disturbance to the older APIs.

2. Introduce various sources of User Interrupts in stages:

UST2UST - This RFC. Available in the upcoming Sapphire Rapids processor.

K2UST - Also available in upcoming Sapphire Rapids. Working towards 
proving the value before sending something out.

D2UST - Future processor. Hardware architecture being worked on 
internally. Not much to share right now.

CPU2UST - Future processor. Hardware architecture being worked on 
internally. Not much to share right now.

> # Basic Understanding
> 

The overall description you have mentioned below looks good to me. I 
have added some minor comments for clarification.

Also, the abbreviations that you have used are somewhat different from 
the ones I have used in the patches.

> First, we would like to make sure that our understanding of the 
> terminology and the data structures is correct.
> 
> - User Interrupt Vector (UIV): The identity of the user interrupt.
> - User Interrupt Target Table (UITT):
>    This allows the sender to locate the "address" of the receiver 
> through the uifd.

The UITT refers to the 'UPID' address which is different from the uifd 
that you mention below.

> Below outlines our understanding of the current API for UIs.
> 

All of the statements below seem accurate.

However, some of the restrictions below are due to hardware design and 
some are mainly due to the software implementation. The software design 
and APIs might change significantly as this patch series evolves.

Please feel free to provide input wherever you think the APIs can be 
improved.

> - Each thread that can receive UIs has exactly one handler
>    registered with `uintr_register_handler` (a syscall).
> - Each thread that registers a handler calls `uintr_create_fd` for
>    every user-level interrupt vector (UIV) that they expect to receive.
> - The only information delivered to the handler is the UIV.
> - There are 64 UIVs that can be used per thread.

Though only one generic handler is registered with the hardware, an 
application can choose to implement 64 unique sub-handlers in user space 
based on each unique UIV.

> - A thread that wants to send a UI must register the receiver's uifd 
> with `uintr_register_sender`  (a syscall).
>    This returns an index the sender uses to locate the receiver.
> - `_senduipi(index)` sends a user interrupt to a particular destination.
>    The sender's UITT and index determine the destination.
> - A thread uses `_stui` (and `_clui`) to enable (and disable) the 
> reception of UIs.
> - As for now, there is no mechanism to mask a particular UIV.
> - A UI is delivered to the receiver immediately only if it is currently 
> running.
> - If a thread executes the `uintr_wait()`, it will be scheduled only 
> after receiving a UI.
>    There is no guarantee on the delay between the processor receiving 
> the UI and when the thread is scheduled.
> - If a thread is the target of a UI and another thread is running, or 
> the target thread is blocked in the kernel,
>    then the target thread will handle the UI when it is next scheduled.
> - Ordinary interrupts (interrupt delivered with CPL=0) have a higher 
> priority over user interrupts.
> - UI handler only saves general-purpose registers (e.g., do not save 
> floating-point registers).

The saving and restoring of the registers is done by gcc when the muintr 
flag along with the 'interrupt' attribute is used. Applications can 
choose to save floating point registers as part of the interrupt handler 
as well.

To make it easier for applications we are working on implementing a thin 
library that can help with some of this common functionality like saving 
floating point registers or redirecting to 64 sub-handlers.

> - User Interrupts with higher UIV are given a higher priority than those 
> with smaller UIV.
> 
> ## Private UITT
> 
> The Current RFC focuses on a private UITT where each thread has its own
> UITT.  Thus, different threads executing `_senduipi(index1)` with the
> same `index1` may cause different receiver threads to be interrupted.
> 

That's right.

> In many cases, the receiver of an interrupt needs to know which thread
> sent the interrupt. If we understand the proposal correctly, there are
> only 64 user-level interrupt vectors (UIVs), and the UIV is the only
> information transmitted to the receiver. The UIV itself allows the
> receiver to distinguish different senders through careful management
> of the receiver's UIV.
> 

That's correct. User Interrupts mainly provide a door bell mechanism 
with the actual data expected to be shared through some existing mechanism.

If multiple senders want to share the same interrupt vector then they 
would have to rely on some sort of shared memory (or similar) mechanism 
to relay the relevant information to the receiver. This would likely 
come with some latency cost.

> - QUESTION: Given the code below where the same UIV is registered twice:
> ```c
>    uintr_fd1 = uintr_create_fd(vector1, flags)
>    uintr_fd2 = uintr_create_fd(vector1, flags)
> ```
> Would `uintr_fd1` be the same as `uintr_fd2`, or would it be registered 
> with a different index in the UITT table?

In the current design, if the same thread tries to register the same 
vector again the second uintr_create_fd() would fail with a EBUSY error 
code.

> 
> - QUESTION: If it is registered in a different index, would the
>    receiver be able to distinguish the sender if `uintr_fd1` and
>    `uintr_fd2` are used from two different threads?
> 
> - QUESTION: What is the intended future use of the `flags` argument?
> 

In the uintr_create_fd() call, flags would be used to provide options 
such as O_CLOEXEC. In general, I added flags argument to all the system 
calls to keep them extendable when new boolean options need to be added.

> ## Shared UITT
> 
> In the case of the shared UITT model, all the threads share the same
> UITT and thus, if two different threads execute `_senduipi(index)`
> with the same index, they would both cause an interrupt in the
> same destination/receiver.
> 
> - QUESTION: Since both threads use the same entry (same
>    destination/receiver), does this mean that the receiver will not be
>    able to distinguish the sender of the interrupt?
> 

Yes. However this is true even in case of a private UITT. This isn't 
because the senders used the same UITT index rather it is the result of 
the senders generating the same UIVs.

For example, even if a receiver created 2 FDs with 2 unique vectors.

	uintr_fd1 = uintr_create_fd(vector1, flags)
	uintr_fd2 = uintr_create_fd(vector2, flags)

In case of the a private UITT, both sender threads can register 
themselves with uintr_fd1. They might get different uitt indexes 
returned to them. But when they generate a User interrupt using their 
respective index, the end result would be the same. The receiver will 
see the same vector1 being generated. There is no way for the receiver 
to distinguish the sender without some additional information being 
shared somewhere.

> # Multi-threaded parallel programming example
> 
> One of the uses for UIs that we have been exploring is combining the
> message-passing and shared memory models for parallel programming.  In
> our approach, message-passing is used for synchronization and shared
> memory for data sharing.  The message passing part of the programming
> pattern is based loosely on Active Messages (See ISCA92), where a
> particular thread can turn off/on interrupts to ignore incoming
> messages so they can execute critical sections without having to
> notify any other threads in the system.
> 

This look like a good fit for the User IPI (UST2UST) implementation in 
this RFC. Have you had a chance to evaluate the current API design for 
this usage?

Also, is any of the above work publicly available?

> - QUESTION: Is there any data on the performance impact of `_stui` and 
> `_clui`?
> 

_stui and _clui are expected to have very minimal overhead since they 
only modify a local flag. I'll try to measure this next time I am doing 
some performance measurement.

Thanks,
Sohil

Chrisma Pakha Jan. 17, 2022, 1:14 a.m. UTC | #52

Hi Sohil,

Thank you for your reply and the clarification.

>>
>> We have been exploring how user-level interrupts (UIs) can be used to
>> improve performance and programmability in several different areas:
>> e.g., parallel programming, memory management, I/O, and floating-point
>> libraries.
>
> Can you please share more details on this? It would really help 
> improve the API design.
>
Of course! Below we describe a few use cases for both user-level 
interrupts (UIs) and user-level exceptions (UEs). We realize that the 
current proposal is targeted towards UIs, but we also describe some UEs 
use cases because we believe handling exceptions without going through 
the kernel may provide even more of a benefit than UIs. We hope these 
use cases can influence the direction of the API so that it can be made 
forward compatible for future hardware revisions.

To be clear, we distinguish between interrupts (generated from an 
external source, such as another core or Device) that are most likely 
imprecise and asynchronous and exceptions (generated by the currently 
executing program) that need to be precise and synchronous.

# UST2UST

A UI is a mechanism to allow two or more threads to communicate with one 
another asynchronously without requiring the intervention of the kernel 
or a change in privilege. We believe that having UIs can help integrate 
the shared memory model and message passing model for multicore 
processors. This integration makes it easier to build parallel programs, 
allowing developers to take advantage of both models. The shared memory 
model provides an easy way to share data between threads, while the 
message passing model can be used for synchronization between threads.

In the following section, we will describe two use cases for UIs.
- We show how UIs can be used to improve parallel program performance by 
reducing the overhead of exposing parallelism.
- We show how UIs can be used to build efficient active messages.

Both of the use cases we present below require the receiver of a UI to 
know which thread issued it. At the end of the email we describe how we 
would implement this using the current API and suggest an alternative, 
and possibly more streamlined approach.

## Lazy Work Stealing

One of the hurdles in writing parallel programs is ensuring that the 
cost of parallelizing the code does not become a bottleneck in program 
performance. Some of these overheads come from unnecessarily exposing 
too much parallelism, even if all cores are busy. One mechanism to 
reduce this overhead is to lazily expose parallelism only when it is 
needed. This can be done through stack unwinding (similar to how 
Exception Handling works). Whenever a thread (thief) asks for work from 
another thread (victim), the victim will perform stack unwinding, 
creating the work for the thief. This approach to lazy thread creation 
requires some mechanism for the thief to ask for work.

We have implemented a prototype compiler and runtime for this mechanism. 
Our runtime requires a mechanism for the thief to signal the victim when 
it needs work. We implement this signaling through polls because the 
current IPI mechanism is too expensive to use. However, requiring the 
victim to poll can introduce excessive polling overhead and/or introduce 
significant latency between the request and the response. The compiler 
tries to keep the overhead of polling low (<5%) while still ensuring 
that the latency between a work-stealing request and its response is as 
low as possible. Currently, we essentially only poll for work requests 
in the function prologue, keeping the overhead to about 2% of execution 
time on average. This works well for almost all applications. However, 
in some applications, this can add 100 of microseconds of latency to the 
response of a work-stealing request.

One reason we use polling today, instead of the victim just taking work, 
is that there are points in the program where work-stealing is not 
allowed. So, in addition to having an inexpensive mechanism to request 
work, we need an inexpensive method to disallow the requests. IOW, the 
compiler only inserts polls at points where it is safe to do so. With 
the UI mechanism in the proposed API, we could signal a work-stealing 
request with a UST2UST UI and disallow such requests by disabling 
interrupts. One nice advantage of the current proposal is that disabling 
interrupts is a *local* operation, making it very inexpensive and not 
causing any interference with the rest of the threads. In other words, 
an important benefit of the proposed UI mechanism is that we can ensure 
atomicity (with respect to work stealing) without having to do any 
global communication.

## Implementing Active Message

Active messages can efficiently support the message passing parallel 
programming model. With the proposed API, the UI could signal that an AM 
is being delivered while shared memory data structures could be used for 
the payload. As described in the above use case, this would allow 
receiving threads to provide atomicity by disabling interrupts without 
any global communication.

Clearly, having a shared address space makes data access and management 
easier for parallel programming. On the other hand, controlling access 
to that data can often be cleaner to implement in message passing 
models. Dogan et al. have shown promising improvements by using explicit 
messaging hardware to accelerate Machine Learning and Graphs workloads 
(see [DHAKWETAL17,DAKK19]). Explicit messaging is used as a 
synchronization mechanism and has better scalability than shared 
memory-based synchronization. The current proposal would support this 
model integration with significantly lower overheads and lower latencies 
compared to what are available on today's machines.

------------------------------------------------------------------------
# D2UST

Applications that frequently interact with external devices can benefit 
from UIs. To achieve high performance, conventional IO approaches 
through the kernel are not appealing as it incurs high overhead. It 
requires context switching and data transfer from kernel-space to 
user-space, possibly polluting the cache and TLB. One improvement to 
bypass the kernel is by pinning pages to specific physical addresses, 
where these pages act as a buffer between user-space and device. 
However, since the device cannot directly interrupt the UST, the UST 
needs to poll to check if the data from the device is available. 
However, having the UST poll can easily erase any potential performance 
improvements offered by bypassing the kernel in the first place. 
Allowing a device to interrupt a UST under the proposed API will 
eliminate the need to poll and support atomicity as required which could 
significantly improve application performance.

This would be particularly useful when an application uses a GPU as an 
accelerator for parallel computation and CPU for serial computation (see 
[WBSACETAL08]). An example would be K-means. Finding which clusters each 
point belongs to is computed in the GPU (in parallel), while computing 
the mean is computed in the CPU (in serial). As this process is 
iterative, there are multiple computation transitions between the CPU 
and GPU. Without UIs, the only real option is to poll for GPU task 
completion, complicating control flow if there is also other work for 
the CPU thread to do. With UIs, keeping the GPU busy can be handled by 
the UI handler. The result would be cleaner code and better load 
balancing. To make this work, the D2UST interrupt will have to ensure 
that the process that started the task on the GPU is the same one that 
is currently running. When a different process is running, the interrupt 
will have to be saved by the kernel so it can be delivered to the UST 
when is is next scheduled.

------------------------------------------------------------------------

# CPU2UST

Providing a low cost user-level exception mechanism could fundamentally 
change the approach to implementing many algorithms. Examples range 
across many common tasks, e.g., checking for valid pointers, 
preprocessing floating-point data, garbage collection, etc. Today, due 
to the high cost of exception handling, programmers go to great lengths 
to ensure that exceptions do not happen. Unfortunately, this leads to 
more code and often less performance. Below we describe different 
scenarios where UEs could potentially reduce programming effort and/or 
improve performance.

## API for CPU2UST

For the examples, below we propose a small modification to the proposed 
API to support exceptions. We propose that a handler be registered for a 
particular fault to distinguish the exception type. Potentially, the 
`flags` argument could hold the `signum`, or a bit in the `flags` 
argument could indicate that a third parameter was being included with 
the `signum`. We suggest including `signum` in the current API for 
future use.

```
int uintr_register_handler(u64 handler_address, unsigned int flags, int 
signum);
```

Since each handler is registered for a particular exception, the handler 
itself would only have one argument, a pointer to the `__uintr_frame`. 
In some cases, the handler might need the `error_code` information 
(e.g., for a page-fault), which could be obtained using a new function, 
`unsigned long long __get_ue_errorcode(void)`.

```
__attribute__ ((interrupt))
void
handlerFunction (struct __uintr_frame *ui_frame)
{
   // Get error code if needed
   // unsigned long long error_code = __get_ue_errorcode();
   ...
}
```

We envision four ways for the user handler to manipulate the thread's 
state. Here we assume that a UE is handled by the thread that causes the 
exception.

1. Continuing the faulting thread of control.
2. Suspending a faulting thread or continuing another thread in the same 
process.
3. Deferring processing of the fault back to the kernel.
4. Or, finally, terminating the thread of control.

In case 1, where the faulting thread is continued, the handler can 
simply use the uiret (It could potentially modify the return address on 
the stack to change where execution continues). For case 2, we do not 
have a proposed API yet, but potentially some set of functions that 
extend pthreads might be appropriate. For case 3, the handler would use 
a trap to signal that the kernel should continue processing the 
exception. The compiler would have to restore registers appropriately 
before the trap is executed.

## Binary rewriting

Binary rewriting is a valuable technique for debugging, optimizing, 
repairing, emulating, and hardening (tightening security) a program 
[WMUW19]. One implementation of binary rewriting is to replace the 
probed instructions (instrumentation points) with a redirect instruction 
(either jump or trap) to the patch instructions. Most developers use 
jump instructions instead of traps due to their lower cost. However, 
because instructions have variable encoding lengths, inserting jump 
instructions requires care, e.g., "instruction punning" [CSDN17] with a 
combination of padding and eviction [DGR20]. On the other hand, the trap 
instruction is only a single byte, allowing it to replace any patched 
instruction. If the trap can be made inexpensive, this would potentially 
allow a simpler approach to binary rewriting without control flow recovery.

## Binary Emulation for forward/backward compatibility

Some processor families have an all encompassing ISA of which only a 
subset is implemented in hardware for some instances of the family. 
Applications built for the processor family either have to be recompiled 
for each instance or software emulation must handle the unimplemented 
instructions. If there is a UE for the illegal instruction fault, this 
can potentially be made inexpensive enough to avoid recompilation. 
Furthermore, it could be a way to handle legacy code and allow future 
generations to avoid the older crufty instructions that are no longer 
commonly used.

## Floating-Point Performance

Today, floating-point algorithms often preprocess the data in order to 
avoid underflow (or overflow) exceptions. If UEs were low enough cost, 
it is possible that these time consuming data preparation steps could be 
removed and only run if an exception was generated. A simple example is 
the calculation of the Root Mean Square (RMS) of a vector [HFT94,H96]. 
The common approach to calculating a vector's RMS is to scan the input 
vector and then potentially scale it to avoid underflow/overflow. For 
many applications, the common case is that the data does not require 
rescaling. In those cases, one could calculate the RMS on the unscaled 
data and only scale it if a UE was generated.

## Memory: garbage collection and watch points

User-level Page Fault exceptions (ULPF) is one essential component for 
improving the performance of a wide variety of applications. For 
example, in [AL20], we describe a solution that shows how ULPFs when 
combined with a mechanism that allows the user a limited ability to 
change a page's permissions without kernel intervention, can be used to 
implement an unlimited number of efficient software watchpoints. Our 
experiments were performed using GEM5, where we made changes to the MMU 
and TLB. However, Intel's Memory Protection Keys for User (MPK) [MPK17] 
combined with UE could also potentially do the trick.

Another example of an application that could benefit from ULPF is 
Concurrent Garbage Collection. Concurrent Garbage collection allows both 
the program (aka mutator) threads and the collector to run in parallel. 
To implement concurrent GC, a read barrier or write barrier is often 
needed (these are GC terms and should not be confused with hardware 
memory barriers). These barriers ensure that the GC invariants are 
maintained before a read or write operations. The write barrier prevents 
the GC from reclaiming a live object that was recently accessed by the 
mutator (in the case of a concurrent mark-sweep) [BDS91]. The read 
barrier prevents the mutator from reading stale objects (in the case of 
concurrent mark-compact) [AL91]. Both read and write barriers can be 
implemented using ULPF. The programmer can use the permission bit in the 
user-level page tables to cheaply turn on/off memory protection (e.g., 
inside the handler).

Belay et al. [BBMTMETAL12] has shown how to implement Boehm GC[BDS91] (a 
mostly parallel mark-sweep GC used in the Mono project [Mono18] and 
Objective-C [Objc15]) on their platform, Dune. Dune is a platform that 
allows user-space direct access to exceptions and privileged hardware 
features. The results show both speedup and slowdown, where the slowdown 
is attributed to their platform's inherent overhead. On the other hand, 
Click et al. [CTW05] and Tene et al. [TIW11] have built a custom system 
to build a Pauseless GC. This custom system allows fast page fault 
handling. The mechanism described in [AL20] could be extended to 
implement a similar approach.

# References

- [AL91] Appel, Andrew W. and Li, Kai, Virtual Memory Primitives for 
User Programs (1991)
- [AL20] Li, Qingyang, User Level Page Faults (2020), 
http://reports-archive.adm.cs.cmu.edu/anon/2020/CMU-CS-20-124.pdf
- [BBMTMETAL12] Belay, Adam and Bittau, Andrea and Mashtizadeh, Ali and 
Terei, David and Mazi\`{e}res, David and Kozyrakis, Christos, Dune: Safe 
User-Level Access to Privileged CPU Features (2012)
- [BDS91] Boehm, Hans-J. and Demers, Alan J. and Shenker, Scott, Mostly 
Parallel Garbage Collection (1991)
- [CSDN17] Chamith, Buddhika and Svensson, Bo Joel and Dalessandro, Luke 
and Newton, Ryan R., Instruction Punning: Lightweight Instrumentation 
for X86-64 (2017)
- [CTW05] Click, Cliff and Tene, Gil and Wolf, Michael, The Pauseless GC 
Algorithm (2005)
- [DAKK19] Dogan, Halit and Ahmad, Masab and Kahne, Brian and Khan, 
Omer, Accelerating Synchronization Using Moving Compute to Data Model at 
1,000-core Multicore Scale (2019)
- [DHAKWETAL17] Dogan, Halit and Hijaz, Farrukh and Ahmad, Masab and 
Kahne, Brian and Wilson, Peter and Khan, Omer, Accelerating Graph and 
Machine Learning Workloads Using a Shared Memory Multicore Architecture 
with Auxiliary Support for in-Hardware Explicit Messaging (2017)
- [DGR20] Duck, Gregory J. and Gao, Xiang and Roychoudhury, Abhik, 
Binary Rewriting without Control Flow Recovery (2020)
- [ECGS92] von Eicken, Thorsten and Culler, David E. and Goldstein, Seth 
Copen and Schauser, Klaus Erik, Active Messages: A Mechanism for 
Integrated Communication and Computation (1992)
- [H96] Hauser, John R., Handling Floating-Point Exceptions in Numeric 
Programs (1996)
- [HFT94] Hull, T. E. and Fairgrieve, Thomas F. and Tang, Ping-Tak 
Peter, Implementing Complex Elementary Functions Using Exception 
Handling (1994)
- [Mono18] https://www.mono-project.com/docs/advanced/runtime/ (2018)
- [MPK17] 
https://www.kernel.org/doc/Documentation/x86/protection-keys.txt (2017)
- [Objc15] 
https://gcc.gnu.org/onlinedocs/gcc-4.8.5/gcc/Garbage-Collection.html (2015)
- [TIW11] Tene, Gil and Iyengar, Balaji and Wolf, Michael, C4: The 
Continuously Concurrent Compacting Collector (2011)
- [WBSACETAL08] Wong, Henry and Bracy, Anne and Schuchman, Ethan and 
Aamodt, Tor M. and Collins, Jamison D. and Wang, Perry H. and Chinya, 
Gautham and Groen, Ankur Khandelwal and Jiang, Hong and Wang, Hong , 
Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor (2008)
- [WMUW19] Wenzl, Matthias and Merzdovnik, Georg and Ullrich, Johanna 
and Weippl, Edgar, From Hack to Elaborate Technique—A Survey on Binary 
Rewriting (2019)

>> # Preparing for future use cases
>> If someone could point out an example for Kernel to
>> user-space thread (K2UST) UI, we would appreciate it.
>>
>
> The idea here is improve the kernel-to-user event notification 
> latency. Theoretically, this can be useful when the kernel sees event 
> completion on one cpu but it want to signal (notify) a thread actively 
> running on some other CPU. The receiver thread can save some cycles by 
> avoiding ring transitions to receive the event.
>
> IO_URING is one of the examples for kernel-to-user event 
> notifications. We are evaluating whether providing a UINTR based 
> completion mechanism can have benefit over eventfd based completions. 
> The benefits in practice are yet to be measured and proven.
>
Thank you for the clarification.

- QUESTION: If the processor has D2UST capability, would this allow the 
device to directly send the interrupt to the target process (the process 
that initiates the I/O through io_uring) instead of the kernel?

>> In our work, we have also been exploring precise UIs from the
>> currently running thread.  We call these CPU to UST (CPU2UST) UIs.
>> For example, a SIGSEGV generated by writing to a read-only page, a
>> SIGFPE generated by dividing a number by zero.
>>
>
> It is definitely possible in future to delivery CPU events as User 
> Interrupts. The hardware architecture for this is still being worked 
> on internally.
>
> Though our focus isn't on exceptions being delivered as User 
> Interrupts. Do you have details on what type of benefit is expected?
>
Described in the use-cases we mentioned above.

>> - QUESTION: Is there is a rough draft/plan that we can refer to that 
>> describes the
>> current thinking on these three cases.
>>
>> - QUESTION: Are there use cases for K2UST, or is K2UST the same as 
>> CPU2UST?
>>
>
> No, K2UST isn't the same as CPU2UST. We would expect limited benefits 
> from K2UST but on the other hand CPU2UST can provide significant 
> speedup since it avoids the kernel completely.
>
> Unfortunately, due to the large scope of the feature, the hardware 
> architecture development is happening in stages. I don't have detailed 
> plans for each of the sources of User Interrupts.
>
> Here is our rough plan:
>
> 1. Provide a common infrastructure to receive User Interrupts. This is 
> independent of the source of the interrupt. The intention here is to 
> keep the software APIs generic and extendable so that future sources 
> can be added without causing much disturbance to the older APIs.
>
> 2. Introduce various sources of User Interrupts in stages:
>
> UST2UST - This RFC. Available in the upcoming Sapphire Rapids processor.
>
> K2UST - Also available in upcoming Sapphire Rapids. Working towards 
> proving the value before sending something out.
>
> D2UST - Future processor. Hardware architecture being worked on 
> internally. Not much to share right now.
>
> CPU2UST - Future processor. Hardware architecture being worked on 
> internally. Not much to share right now.

Thank you for the update, really appreciate it.


>
> The saving and restoring of the registers is done by gcc when the 
> muintr flag along with the 'interrupt' attribute is used. Applications 
> can choose to save floating point registers as part of the interrupt 
> handler as well.
>
> To make it easier for applications we are working on implementing a 
> thin library that can help with some of this common functionality like 
> saving floating point registers or redirecting to 64 sub-handlers.

- QUESTION: Would this thin library also provide a mechanism to share 
data between sender and receiver through shared memory (similar to 
implementing Active message)?
- QUESTION: Is there a plan in the future to allow data to be 
transmitted along with the interrupt?

>> # Multi-threaded parallel programming example
>>
>> One of the uses for UIs that we have been exploring is combining the
>> message-passing and shared memory models for parallel programming.  In
>> our approach, message-passing is used for synchronization and shared
>> memory for data sharing.  The message passing part of the programming
>> pattern is based loosely on Active Messages (See ISCA92), where a
>> particular thread can turn off/on interrupts to ignore incoming
>> messages so they can execute critical sections without having to
>> notify any other threads in the system.
>>
>
> This look like a good fit for the User IPI (UST2UST) implementation in 
> this RFC. Have you had a chance to evaluate the current API design for 
> this usage?

Our approach requires point-to-point communication to implement the 
UST2UST use cases described above. From my understanding, the current 
API requires (n-1)*n descriptors to enable point-to-point communication 
(assuming a private UITT). Here, each receiver assigns a vector to the 
UI file descriptor (uifd) and shares it with the appropriate sender. 
This way, the receivers know the sender based on the vector.

Have other approaches to handling the case where the receiver needs to 
know the sender's identity been explored? In particular, approaches that 
do not require n^2 descriptors be created? In the context of the RFC, 
one possibility we have thought about would be where the sender assigns 
a vector to uifd (maybe based on its cpuid) and shares this information 
to all receivers. This would possibly only require n descriptors.

> Also, is any of the above work publicly available?

Not yet. We are still working on it and hope to update you on it.

Best regards,
Chrisma and Seth

[RFC,00/13] x86 User Interrupts support

Message

Comments