mbox series

[net,0/5] hv_netvsc: Fix error "nvsp_rndis_pkt_complete error status: 2"

Message ID 20250513000604.1396-1-mhklinux@outlook.com
Headers show
Series hv_netvsc: Fix error "nvsp_rndis_pkt_complete error status: 2" | expand

Message

mhkelley58@gmail.com May 13, 2025, 12:05 a.m. UTC
From: Michael Kelley <mhklinux@outlook.com>

Starting with commit dca5161f9bd0 in the 6.3 kernel, the Linux driver
for Hyper-V synthetic networking (netvsc) occasionally reports
"nvsp_rndis_pkt_complete error status: 2".[1] This error indicates
that Hyper-V has rejected a network packet transmit request from the
guest, and the outgoing network packet is dropped. Higher level
network protocols presumably recover and resend the packet so there is
no functional error, but performance is slightly impacted. Commit
dca5161f9bd0 is not the cause of the error -- it only added reporting
of an error that was already happening without any notice. The error
has presumably been present since the netvsc driver was originally
introduced into Linux.

This patch set fixes the root cause of the problem, which is that the
netvsc driver in Linux may send an incorrectly formatted VMBus message
to Hyper-V when transmitting the network packet. The incorrect
formatting occurs when the rndis header of the VMBus message crosses a
page boundary due to how the Linux skb head memory is aligned. In such
a case, two PFNs are required to describe the location of the rndis
header, even though they are contiguous in guest physical address
(GPA) space. Hyper-V requires that two PFNs be in a single "GPA range"
data struture, but current netvsc code puts each PFN in its own GPA
range, which Hyper-V rejects as an error in the case of the rndis
header.

The incorrect formatting occurs only for larger packets that netvsc
must transmit via a VMBus "GPA Direct" message. There's no problem
when netvsc transmits a smaller packet by copying it into a pre-
allocated send buffer slot because the pre-allocated slots don't have
page crossing issues.

After commit 14ad6ed30a10 in the 6.14 kernel, the error occurs much
more frequently in VMs with 16 or more vCPUs. It may occur every few
seconds, or even more frequently, in a ssh session that outputs a lot
of text. Commit 14ad6ed30a10 subtly changes how skb head memory is
allocated, making it much more likely that the rndis header will cross
a page boundary when the vCPU count is 16 or more.  The changes in
commit 14ad6ed30a10 are perfectly valid -- they just had the side
effect of making the netvsc bug more prominent.

One fix is to check for adjacent PFNs in vmbus_sendpacket_pagebuffer()
and just combine them into a single GPA range. Such a fix is very
contained. But conceptually it is fixing the problem at the wrong
level. So this patch set takes the broader approach of maintaining
the already known grouping of contiguous PFNs at a higher level in
the netvsc driver code, and propagating that grouping down to the
creation of the VMBus message to send to Hyper-V. Maintaining the
grouping fixes this problem, and has the added benefit of allowing
netvsc_dma_map() to make fewer calls to dma_map_single() to do bounce
buffering in CoCo VMs.

Patch 1 is a preparatory change to allow vmbus_sendpacket_mpb_desc()
to specify multiple GPA ranges. In current code
vmbus_sendpacket_mpb_desc() is used only by the storvsc synthetic SCSI
driver, and it always creates a single GPA range.

Patch 2 updates the netvsc driver to use vmbus_sendpacket_mpb_desc()
instead of vmbus_sendpacket_pagebuffer(). Because the higher levels of
netvsc still don't group contiguous PFNs, this patch is functionally
neutral. The VMBus message to Hyper-V still has many GPA ranges, each
with a single PFN. But it lays the groundwork for the next patch.

Patch 3 changes the higher levels of netvsc to preserve the already
known grouping of contiguous PFNs. When the contiguous groupings are
passed to vmbus_sendpacket_mpb_desc(), GPA ranges containing multiple
PFNs are produced, as expected by Hyper-V. This is point at which the
core problem is fixed.

Patches 4 and 5 remove code that is no longer necessary after the
previous patches.

These changes provide a net reduction of about 65 lines of code, which
is an added benefit.

These changes have been tested in normal VMs, in SEV-SNP and TDX CoCo
VMs, and in Dv6-series VMs where the netvsp implementation is in the
OpenHCL paravisor instead of the Hyper-V host.

These changes are built against kernel version 6.15-rc6.

[1] https://bugzilla.kernel.org/show_bug.cgi?id=217503

Michael Kelley (5):
  Drivers: hv: Allow vmbus_sendpacket_mpb_desc() to create multiple
    ranges
  hv_netvsc: Use vmbus_sendpacket_mpb_desc() to send VMBus messages
  hv_netvsc: Preserve contiguous PFN grouping in the page buffer array
  hv_netvsc: Remove rmsg_pgcnt
  Drivers: hv: vmbus: Remove vmbus_sendpacket_pagebuffer()

 drivers/hv/channel.c              | 65 ++-----------------------------
 drivers/net/hyperv/hyperv_net.h   | 13 ++++++-
 drivers/net/hyperv/netvsc.c       | 57 ++++++++++++++++++++++-----
 drivers/net/hyperv/netvsc_drv.c   | 62 +++++++----------------------
 drivers/net/hyperv/rndis_filter.c | 24 +++---------
 drivers/scsi/storvsc_drv.c        |  1 +
 include/linux/hyperv.h            |  7 ----
 7 files changed, 83 insertions(+), 146 deletions(-)

Comments

Simon Horman May 14, 2025, 9:37 a.m. UTC | #1
On Mon, May 12, 2025 at 05:06:01PM -0700, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> netvsc currently uses vmbus_sendpacket_pagebuffer() to send VMBus
> messages. This function creates a series of GPA ranges, each of which
> contains a single PFN. However, if the rndis header in the VMBus
> message crosses a page boundary, the netvsc protocol with the host
> requires that both PFNs for the rndis header must be in a single "GPA
> range" data structure, which isn't possible with
> vmbus_sendpacket_pagebuffer(). As the first step in fixing this, add a
> new function netvsc_build_mpb_array() to build a VMBus message with
> multiple GPA ranges, each of which may contain multiple PFNs. Use
> vmbus_sendpacket_mpb_desc() to send this VMBus message to the host.
> 
> There's no functional change since higher levels of netvsc don't
> maintain or propagate knowledge of contiguous PFNs. Based on its
> input, netvsc_build_mpb_array() still produces a separate GPA range
> for each PFN and the behavior is the same as with
> vmbus_sendpacket_pagebuffer(). But the groundwork is laid for a
> subsequent patch to provide the necessary grouping.
> 
> Cc: <stable@vger.kernel.org> # 6.1.x
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> ---
>  drivers/net/hyperv/netvsc.c | 50 +++++++++++++++++++++++++++++++++----
>  1 file changed, 45 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index d6f5b9ea3109..6d1705f87682 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -1055,6 +1055,42 @@ static int netvsc_dma_map(struct hv_device *hv_dev,
>  	return 0;
>  }
>  
> +/* Build an "array" of mpb entries describing the data to be transferred
> + * over VMBus. After the desc header fields, each "array" entry is variable
> + * size, and each entry starts after the end of the previous entry. The
> + * "offset" and "len" fields for each entry imply the size of the entry.
> + *
> + * The pfns are in HV_HYP_PAGE_SIZE, because all communication with Hyper-V
> + * uses that granularity, even if the system page size of the guest is larger.
> + * Each entry in the input "pb" array must describe a contiguous range of
> + * guest physical memory so that the pfns are sequential if the range crosses
> + * a page boundary. The offset field must be < HV_HYP_PAGE_SIZE.

Hi Michael,

Is there a guarantee that this constraint is met. And moreover, is there a
guarantee that all of the entries will fit in desc? I am slightly concerned
that there may be an overrun lurking here.

...
Michael Kelley May 14, 2025, 3:44 p.m. UTC | #2
From: Simon Horman <horms@kernel.org> Sent: Wednesday, May 14, 2025 2:38 AM
> 
> On Mon, May 12, 2025 at 05:06:01PM -0700, mhkelley58@gmail.com wrote:
> > From: Michael Kelley <mhklinux@outlook.com>
> >
> > netvsc currently uses vmbus_sendpacket_pagebuffer() to send VMBus
> > messages. This function creates a series of GPA ranges, each of which
> > contains a single PFN. However, if the rndis header in the VMBus
> > message crosses a page boundary, the netvsc protocol with the host
> > requires that both PFNs for the rndis header must be in a single "GPA
> > range" data structure, which isn't possible with
> > vmbus_sendpacket_pagebuffer(). As the first step in fixing this, add a
> > new function netvsc_build_mpb_array() to build a VMBus message with
> > multiple GPA ranges, each of which may contain multiple PFNs. Use
> > vmbus_sendpacket_mpb_desc() to send this VMBus message to the host.
> >
> > There's no functional change since higher levels of netvsc don't
> > maintain or propagate knowledge of contiguous PFNs. Based on its
> > input, netvsc_build_mpb_array() still produces a separate GPA range
> > for each PFN and the behavior is the same as with
> > vmbus_sendpacket_pagebuffer(). But the groundwork is laid for a
> > subsequent patch to provide the necessary grouping.
> >
> > Cc: <stable@vger.kernel.org> # 6.1.x
> > Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> > ---
> >  drivers/net/hyperv/netvsc.c | 50 +++++++++++++++++++++++++++++++++----
> >  1 file changed, 45 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> > index d6f5b9ea3109..6d1705f87682 100644
> > --- a/drivers/net/hyperv/netvsc.c
> > +++ b/drivers/net/hyperv/netvsc.c
> > @@ -1055,6 +1055,42 @@ static int netvsc_dma_map(struct hv_device *hv_dev,
> >  	return 0;
> >  }
> >
> > +/* Build an "array" of mpb entries describing the data to be transferred
> > + * over VMBus. After the desc header fields, each "array" entry is variable
> > + * size, and each entry starts after the end of the previous entry. The
> > + * "offset" and "len" fields for each entry imply the size of the entry.
> > + *
> > + * The pfns are in HV_HYP_PAGE_SIZE, because all communication with Hyper-V
> > + * uses that granularity, even if the system page size of the guest is larger.
> > + * Each entry in the input "pb" array must describe a contiguous range of
> > + * guest physical memory so that the pfns are sequential if the range crosses
> > + * a page boundary. The offset field must be < HV_HYP_PAGE_SIZE.
> 
> Hi Michael,
> 
> Is there a guarantee that this constraint is met. And moreover, is there a
> guarantee that all of the entries will fit in desc? I am slightly concerned
> that there may be an overrun lurking here.
> 

It is indeed up to the caller to ensure that the pb array is properly
constructed. netvsc_build_mpb_array() doesn't do additional validation.
There are only two sources of the pb array, both of which do the right
thing, so additional validation seemed redundant.

An overrun is a concern, but again the callers do the right thing. As
described in my response to Patch 3 of the series, netvsc_xmit()
counts the number of pages ahead of time, and makes sure the count is
within the limit of the amount space allocated in the "desc" argument
to netvsc_build_mpb_array().

Michael
patchwork-bot+netdevbpf@kernel.org May 15, 2025, 3 a.m. UTC | #3
Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Mon, 12 May 2025 17:05:59 -0700 you wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> Starting with commit dca5161f9bd0 in the 6.3 kernel, the Linux driver
> for Hyper-V synthetic networking (netvsc) occasionally reports
> "nvsp_rndis_pkt_complete error status: 2".[1] This error indicates
> that Hyper-V has rejected a network packet transmit request from the
> guest, and the outgoing network packet is dropped. Higher level
> network protocols presumably recover and resend the packet so there is
> no functional error, but performance is slightly impacted. Commit
> dca5161f9bd0 is not the cause of the error -- it only added reporting
> of an error that was already happening without any notice. The error
> has presumably been present since the netvsc driver was originally
> introduced into Linux.
> 
> [...]

Here is the summary with links:
  - [net,1/5] Drivers: hv: Allow vmbus_sendpacket_mpb_desc() to create multiple ranges
    https://git.kernel.org/netdev/net/c/380b75d30786
  - [net,2/5] hv_netvsc: Use vmbus_sendpacket_mpb_desc() to send VMBus messages
    https://git.kernel.org/netdev/net/c/4f98616b855c
  - [net,3/5] hv_netvsc: Preserve contiguous PFN grouping in the page buffer array
    https://git.kernel.org/netdev/net/c/41a6328b2c55
  - [net,4/5] hv_netvsc: Remove rmsg_pgcnt
    https://git.kernel.org/netdev/net/c/5bbc644bbf4e
  - [net,5/5] Drivers: hv: vmbus: Remove vmbus_sendpacket_pagebuffer()
    https://git.kernel.org/netdev/net/c/45a442fe369e

You are awesome, thank you!
Simon Horman May 15, 2025, 10:50 a.m. UTC | #4
On Wed, May 14, 2025 at 03:44:35PM +0000, Michael Kelley wrote:
> From: Simon Horman <horms@kernel.org> Sent: Wednesday, May 14, 2025 2:38 AM
> > 
> > On Mon, May 12, 2025 at 05:06:01PM -0700, mhkelley58@gmail.com wrote:
> > > From: Michael Kelley <mhklinux@outlook.com>
> > >
> > > netvsc currently uses vmbus_sendpacket_pagebuffer() to send VMBus
> > > messages. This function creates a series of GPA ranges, each of which
> > > contains a single PFN. However, if the rndis header in the VMBus
> > > message crosses a page boundary, the netvsc protocol with the host
> > > requires that both PFNs for the rndis header must be in a single "GPA
> > > range" data structure, which isn't possible with
> > > vmbus_sendpacket_pagebuffer(). As the first step in fixing this, add a
> > > new function netvsc_build_mpb_array() to build a VMBus message with
> > > multiple GPA ranges, each of which may contain multiple PFNs. Use
> > > vmbus_sendpacket_mpb_desc() to send this VMBus message to the host.
> > >
> > > There's no functional change since higher levels of netvsc don't
> > > maintain or propagate knowledge of contiguous PFNs. Based on its
> > > input, netvsc_build_mpb_array() still produces a separate GPA range
> > > for each PFN and the behavior is the same as with
> > > vmbus_sendpacket_pagebuffer(). But the groundwork is laid for a
> > > subsequent patch to provide the necessary grouping.
> > >
> > > Cc: <stable@vger.kernel.org> # 6.1.x
> > > Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> > > ---
> > >  drivers/net/hyperv/netvsc.c | 50 +++++++++++++++++++++++++++++++++----
> > >  1 file changed, 45 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> > > index d6f5b9ea3109..6d1705f87682 100644
> > > --- a/drivers/net/hyperv/netvsc.c
> > > +++ b/drivers/net/hyperv/netvsc.c
> > > @@ -1055,6 +1055,42 @@ static int netvsc_dma_map(struct hv_device *hv_dev,
> > >  	return 0;
> > >  }
> > >
> > > +/* Build an "array" of mpb entries describing the data to be transferred
> > > + * over VMBus. After the desc header fields, each "array" entry is variable
> > > + * size, and each entry starts after the end of the previous entry. The
> > > + * "offset" and "len" fields for each entry imply the size of the entry.
> > > + *
> > > + * The pfns are in HV_HYP_PAGE_SIZE, because all communication with Hyper-V
> > > + * uses that granularity, even if the system page size of the guest is larger.
> > > + * Each entry in the input "pb" array must describe a contiguous range of
> > > + * guest physical memory so that the pfns are sequential if the range crosses
> > > + * a page boundary. The offset field must be < HV_HYP_PAGE_SIZE.
> > 
> > Hi Michael,
> > 
> > Is there a guarantee that this constraint is met. And moreover, is there a
> > guarantee that all of the entries will fit in desc? I am slightly concerned
> > that there may be an overrun lurking here.
> > 
> 
> It is indeed up to the caller to ensure that the pb array is properly
> constructed. netvsc_build_mpb_array() doesn't do additional validation.
> There are only two sources of the pb array, both of which do the right
> thing, so additional validation seemed redundant.
> 
> An overrun is a concern, but again the callers do the right thing. As
> described in my response to Patch 3 of the series, netvsc_xmit()
> counts the number of pages ahead of time, and makes sure the count is
> within the limit of the amount space allocated in the "desc" argument
> to netvsc_build_mpb_array().

Thanks Michael,

I agree that is entirely reasonable for callers to be responsible
correctly constructing the pb array. And that it's not necessary
to add validation to netvsc_build_mpb_array().

Also, based on the above, I'm satisfied that the callers are correctly
constructing the pb array.

With the above clarified in my mind I'm now happy with this patch.

Reviewed-by: Simon Horman <horms@kernel.org>
Simon Horman May 15, 2025, 10:51 a.m. UTC | #5
On Mon, May 12, 2025 at 05:06:00PM -0700, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> vmbus_sendpacket_mpb_desc() is currently used only by the storvsc driver
> and is hardcoded to create a single GPA range. To allow it to also be
> used by the netvsc driver to create multiple GPA ranges, no longer
> hardcode as having a single GPA range. Allow the calling driver to
> specify the rangecount in the supplied descriptor.
> 
> Update the storvsc driver to reflect this new approach.
> 
> Cc: <stable@vger.kernel.org> # 6.1.x
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>

Reviewed-by: Simon Horman <horms@kernel.org>
Simon Horman May 15, 2025, 10:55 a.m. UTC | #6
On Mon, May 12, 2025 at 05:06:03PM -0700, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> init_page_array() now always creates a single page buffer array entry
> for the rndis message, even if the rndis message crosses a page
> boundary. As such, the number of page buffer array entries used for
> the rndis message must no longer be tracked -- it is always just 1.
> Remove the rmsg_pgcnt field and use "1" where the value is needed.
> 
> Cc: <stable@vger.kernel.org> # 6.1.x
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>

Reviewed-by: Simon Horman <horms@kernel.org>