[v5,0/6] migration/postcopy: Sync faulted addresses after network recovered

Message ID	20201019225720.172743-1-peterx@redhat.com
Headers	show Return-Path: <SRS0=qJMs=D2=nongnu.org=qemu-devel-bounces+qemu-devel=archiver.kernel.org@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EF2482240C From: Peter Xu <peterx@redhat.com> To: qemu-devel@nongnu.org Subject: [PATCH v5 0/6] migration/postcopy: Sync faulted addresses after network recovered Date: Mon, 19 Oct 2020 18:57:14 -0400 Message-Id: <20201019225720.172743-1-peterx@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Received-SPF: pass client-ip=63.128.21.124; envelope-from=peterx@redhat.com; helo=us-smtp-delivery-124.mimecast.com Precedence: list Cc: Peter Maydell <peter.maydell@linaro.org>, Thomas Huth <thuth@redhat.com>, "Dr . David Alan Gilbert" <dgilbert@redhat.com>, peterx@redhat.com, Juan Quintela <quintela@redhat.com> Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	migration/postcopy: Sync faulted addresses after network recovered \| expand [v5,0/6] migration/postcopy: Sync faulted addresses after network recovered [v5,1/6] migration: Pass incoming state into qemu_ufd_copy_ioctl() [v5,2/6] migration: Introduce migrate_send_rp_message_req_pages() [v5,3/6] migration: Maintain postcopy faulted addresses [v5,4/6] migration: Sync requested pages after postcopy recovery [v5,5/6] migration/postcopy: Release fd before going into 'postcopy-pause' [v5,6/6] migration-test: Only hide error if !QTEST_LOG

Message ID

20201019225720.172743-1-peterx@redhat.com

Headers

DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EF2482240C
From: Peter Xu <peterx@redhat.com>
To: qemu-devel@nongnu.org
Subject: [PATCH v5 0/6] migration/postcopy: Sync faulted addresses after
	network recovered
Date: Mon, 19 Oct 2020 18:57:14 -0400
Message-Id: <20201019225720.172743-1-peterx@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Received-SPF: pass client-ip=63.128.21.124; envelope-from=peterx@redhat.com; 
	helo=us-smtp-delivery-124.mimecast.com
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001,
	DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, 
	RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H5=0.001,
	RCVD_IN_MSPIKE_WL=0.001, 
	SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.23
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <https://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Cc: Peter Maydell <peter.maydell@linaro.org>, Thomas Huth <thuth@redhat.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>, peterx@redhat.com,
	Juan Quintela <quintela@redhat.com>
Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>

Series

migration/postcopy: Sync faulted addresses after network recovered | expand

Message

Peter Xu Oct. 19, 2020, 10:57 p.m. UTC

This is v5 of the series.  Probably my first series that got queued/unqueued
twice.

I found a bug in v4 that was about page sizes, however that didn't match with
PeterM's report on big endian hosts.  My manual reproduce on s390x also didn't
reproduce.  However after I ran the tree (with the fix) on travis (thanks
Thomas for suggesting this!) I noticed that s390x passed the test too:

https://travis-ci.com/github/xzpeter/qemu/builds/191031111

I still got two tests that got timed out, however I also noticed that the
master branch got similarly two tests that timed out:

https://travis-ci.com/github/xzpeter/qemu/builds/191012879

There's one difference on the failed test, however I really suspect it's
because the uncertainly of the travis host scheduling the tests, or the master
failures should really be a subset of my own branch (while it's not).

So I decided to give it a 3rd shot... PeterM: would it be easy/possible to kick
the CI again against this series for pulls?  I just don't want to break Dave's
pull the 3rd time. :)

I also picked up the other patch [1] that should also fix some rare failures on
postcopy recovery.  However I bet we haven't yet encounter it, at least not often.

v5 changelog:
- added one test patch for easier debugging for migration-test
- added one fix patch [1] for another postcopy race
- fixed a bug that could trigger when host/guest page size differs

[1] https://lore.kernel.org/qemu-devel/20201007183324.288379-1-peterx@redhat.com/

--------- v4 cover letter --------------

v4:
- use "void */ulong" instead of "uint64_t" where proper in patch 3/4 [Dave]

v3:
- fix build on 32bit hosts & rebase
- remove r-bs for the last 2 patches for Dave due to the changes

v2:
- add r-bs for Dave
- add patch "migration: Properly destroy variables on incoming side" as patch 1
- destroy page_request_mutex in migration_incoming_state_destroy() too [Dave]
- use WITH_QEMU_LOCK_GUARD in two places where we can [Dave]

We've seen conditional guest hangs on destination VM after postcopy recovered.
However the hang will resolve itself after a few minutes.

The problem is: after a postcopy recovery, the prioritized postcopy queue on
the source VM is actually missing.  So all the faulted threads before the
postcopy recovery happened will keep halted until (accidentally) the page got
copied by the background precopy migration stream.

The solution is to also refresh this information after postcopy recovery.  To
achieve this, we need to maintain a list of faulted addresses on the
destination node, so that we can resend the list when necessary.  This work is
done via patch 2-5.

With that, the last thing we need to do is to send this extra information to
source VM after recovered.  Very luckily, this synchronization can be
"emulated" by sending a bunch of page requests (although these pages have been
sent previously!) to source VM just like when we've got a page fault.  Even in
the 1st version of the postcopy code we'll handle duplicated pages well.  So
this fix does not even need a new capability bit and it'll work smoothly on old
QEMUs when we migrate from them to the new QEMUs.

Please review, thanks.

Peter Xu (6):
  migration: Pass incoming state into qemu_ufd_copy_ioctl()
  migration: Introduce migrate_send_rp_message_req_pages()
  migration: Maintain postcopy faulted addresses
  migration: Sync requested pages after postcopy recovery
  migration/postcopy: Release fd before going into 'postcopy-pause'
  migration-test: Only hide error if !QTEST_LOG

 migration/migration.c        | 55 ++++++++++++++++++++++++++++++----
 migration/migration.h        | 21 ++++++++++++-
 migration/postcopy-ram.c     | 25 ++++++++++++----
 migration/savevm.c           | 57 ++++++++++++++++++++++++++++++++++++
 migration/trace-events       |  3 ++
 tests/qtest/migration-test.c |  6 +++-
 6 files changed, 154 insertions(+), 13 deletions(-)

-- 
2.26.2

Comments

Dr. David Alan Gilbert Oct. 21, 2020, 2:23 p.m. UTC | #1

* Peter Xu (peterx@redhat.com) wrote:
> Maintain a list of faulted addresses on the destination host for which we're
> waiting on.  This is implemented using a GTree rather than a real list to make
> sure even there're plenty of vCPUs/threads that are faulting, the lookup will
> still be fast with O(log(N)) (because we'll do that after placing each page).
> It should bring a slight overhead, but ideally that shouldn't be a big problem
> simply because in most cases the requested page list will be short.
> 
> Actually we did similar things for postcopy blocktime measurements.  This patch
> didn't use that simply because:
> 
>   (1) blocktime measurement is towards vcpu threads only, but here we need to
>       record all faulted addresses, including main thread and external
>       thread (like, DPDK via vhost-user).
> 
>   (2) blocktime measurement will require UFFD_FEATURE_THREAD_ID, but here we
>       don't want to add that extra dependency on the kernel version since not
>       necessary.  E.g., we don't need to know which thread faulted on which
>       page, we also don't care about multiple threads faulting on the same
>       page.  But we only care about what addresses are faulted so waiting for a
>       page copying from src.
> 
>   (3) blocktime measurement is not enabled by default.  However we need this by
>       default especially for postcopy recover.
> 
> Another thing to mention is that this patch introduced a new mutex to serialize
> the receivedmap and the page_requested tree, however that serialization does
> not cover other procedures like UFFDIO_COPY.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  migration/migration.c    | 41 +++++++++++++++++++++++++++++++++++++++-
>  migration/migration.h    | 19 ++++++++++++++++++-
>  migration/postcopy-ram.c | 17 ++++++++++++++---
>  migration/trace-events   |  2 ++
>  4 files changed, 74 insertions(+), 5 deletions(-)
> 
> diff --git a/migration/migration.c b/migration/migration.c
> index b2dac6b39c..0b4fcff01f 100644
> --- a/migration/migration.c
> +++ b/migration/migration.c
> @@ -143,6 +143,13 @@ static int migration_maybe_pause(MigrationState *s,
>                                   int new_state);
>  static void migrate_fd_cancel(MigrationState *s);
>  
> +static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp)
> +{
> +    uintptr_t a = (uintptr_t) ap, b = (uintptr_t) bp;
> +
> +    return (a > b) - (a < b);
> +}
> +
>  void migration_object_init(void)
>  {
>      MachineState *ms = MACHINE(qdev_get_machine());
> @@ -165,6 +172,8 @@ void migration_object_init(void)
>      qemu_event_init(&current_incoming->main_thread_load_event, false);
>      qemu_sem_init(&current_incoming->postcopy_pause_sem_dst, 0);
>      qemu_sem_init(&current_incoming->postcopy_pause_sem_fault, 0);
> +    qemu_mutex_init(&current_incoming->page_request_mutex);
> +    current_incoming->page_requested = g_tree_new(page_request_addr_cmp);
>  
>      if (!migration_object_check(current_migration, &err)) {
>          error_report_err(err);
> @@ -240,6 +249,11 @@ void migration_incoming_state_destroy(void)
>  
>      qemu_event_reset(&mis->main_thread_load_event);
>  
> +    if (mis->page_requested) {
> +        g_tree_destroy(mis->page_requested);
> +        mis->page_requested = NULL;
> +    }
> +
>      if (mis->socket_address_list) {
>          qapi_free_SocketAddressList(mis->socket_address_list);
>          mis->socket_address_list = NULL;
> @@ -354,8 +368,33 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>  }
>  
>  int migrate_send_rp_req_pages(MigrationIncomingState *mis,
> -                              RAMBlock *rb, ram_addr_t start)
> +                              RAMBlock *rb, ram_addr_t start, uint64_t haddr)
>  {
> +    void *aligned = (void *)(uintptr_t)(haddr & qemu_real_host_page_mask);

Can you remind me, what happens here for hugepages?

Dave

> +    bool received;
> +
> +    WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) {
> +        received = ramblock_recv_bitmap_test_byte_offset(rb, start);
> +        if (!received && !g_tree_lookup(mis->page_requested, aligned)) {
> +            /*
> +             * The page has not been received, and it's not yet in the page
> +             * request list.  Queue it.  Set the value of element to 1, so that
> +             * things like g_tree_lookup() will return TRUE (1) when found.
> +             */
> +            g_tree_insert(mis->page_requested, aligned, (gpointer)1);
> +            mis->page_requested_count++;
> +            trace_postcopy_page_req_add(aligned, mis->page_requested_count);
> +        }
> +    }
> +
> +    /*
> +     * If the page is there, skip sending the message.  We don't even need the
> +     * lock because as long as the page arrived, it'll be there forever.
> +     */
> +    if (received) {
> +        return 0;
> +    }
> +
>      return migrate_send_rp_message_req_pages(mis, rb, start);
>  }
>  
> diff --git a/migration/migration.h b/migration/migration.h
> index e853ccf8b1..8d2d1ce839 100644
> --- a/migration/migration.h
> +++ b/migration/migration.h
> @@ -104,6 +104,23 @@ struct MigrationIncomingState {
>  
>      /* List of listening socket addresses  */
>      SocketAddressList *socket_address_list;
> +
> +    /* A tree of pages that we requested to the source VM */
> +    GTree *page_requested;
> +    /* For debugging purpose only, but would be nice to keep */
> +    int page_requested_count;
> +    /*
> +     * The mutex helps to maintain the requested pages that we sent to the
> +     * source, IOW, to guarantee coherent between the page_requests tree and
> +     * the per-ramblock receivedmap.  Note! This does not guarantee consistency
> +     * of the real page copy procedures (using UFFDIO_[ZERO]COPY).  E.g., even
> +     * if one bit in receivedmap is cleared, UFFDIO_COPY could have happened
> +     * for that page already.  This is intended so that the mutex won't
> +     * serialize and blocked by slow operations like UFFDIO_* ioctls.  However
> +     * this should be enough to make sure the page_requested tree always
> +     * contains valid information.
> +     */
> +    QemuMutex page_request_mutex;
>  };
>  
>  MigrationIncomingState *migration_incoming_get_current(void);
> @@ -332,7 +349,7 @@ void migrate_send_rp_shut(MigrationIncomingState *mis,
>  void migrate_send_rp_pong(MigrationIncomingState *mis,
>                            uint32_t value);
>  int migrate_send_rp_req_pages(MigrationIncomingState *mis, RAMBlock *rb,
> -                              ram_addr_t start);
> +                              ram_addr_t start, uint64_t haddr);
>  int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
>                                        RAMBlock *rb, ram_addr_t start);
>  void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis,
> diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c
> index 722034dc01..ca1daf0024 100644
> --- a/migration/postcopy-ram.c
> +++ b/migration/postcopy-ram.c
> @@ -684,7 +684,7 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb,
>                                          qemu_ram_get_idstr(rb), rb_offset);
>          return postcopy_wake_shared(pcfd, client_addr, rb);
>      }
> -    migrate_send_rp_req_pages(mis, rb, aligned_rbo);
> +    migrate_send_rp_req_pages(mis, rb, aligned_rbo, client_addr);
>      return 0;
>  }
>  
> @@ -979,7 +979,8 @@ retry:
>               * Send the request to the source - we want to request one
>               * of our host page sizes (which is >= TPS)
>               */
> -            ret = migrate_send_rp_req_pages(mis, rb, rb_offset);
> +            ret = migrate_send_rp_req_pages(mis, rb, rb_offset,
> +                                            msg.arg.pagefault.address);
>              if (ret) {
>                  /* May be network failure, try to wait for recovery */
>                  if (ret == -EIO && postcopy_pause_fault_thread(mis)) {
> @@ -1149,10 +1150,20 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState *mis, void *host_addr,
>          ret = ioctl(userfault_fd, UFFDIO_ZEROPAGE, &zero_struct);
>      }
>      if (!ret) {
> +        qemu_mutex_lock(&mis->page_request_mutex);
>          ramblock_recv_bitmap_set_range(rb, host_addr,
>                                         pagesize / qemu_target_page_size());
> +        /*
> +         * If this page resolves a page fault for a previous recorded faulted
> +         * address, take a special note to maintain the requested page list.
> +         */
> +        if (g_tree_lookup(mis->page_requested, host_addr)) {
> +            g_tree_remove(mis->page_requested, host_addr);
> +            mis->page_requested_count--;
> +            trace_postcopy_page_req_del(host_addr, mis->page_requested_count);
> +        }
> +        qemu_mutex_unlock(&mis->page_request_mutex);
>          mark_postcopy_blocktime_end((uintptr_t)host_addr);
> -
>      }
>      return ret;
>  }
> diff --git a/migration/trace-events b/migration/trace-events
> index 338f38b3dd..e4d5eb94ca 100644
> --- a/migration/trace-events
> +++ b/migration/trace-events
> @@ -162,6 +162,7 @@ postcopy_pause_return_path(void) ""
>  postcopy_pause_return_path_continued(void) ""
>  postcopy_pause_continued(void) ""
>  postcopy_start_set_run(void) ""
> +postcopy_page_req_add(void *addr, int count) "new page req %p total %d"
>  source_return_path_thread_bad_end(void) ""
>  source_return_path_thread_end(void) ""
>  source_return_path_thread_entry(void) ""
> @@ -272,6 +273,7 @@ postcopy_ram_incoming_cleanup_blocktime(uint64_t total) "total blocktime %" PRIu
>  postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64
>  postcopy_request_shared_page_present(const char *sharer, const char *rb, uint64_t rb_offset) "%s already %s offset 0x%"PRIx64
>  postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s"
> +postcopy_page_req_del(void *addr, int count) "resolved page req %p total %d"
>  
>  get_mem_fault_cpu_index(int cpu, uint32_t pid) "cpu: %d, pid: %u"
>  
> -- 
> 2.26.2
>

Peter Xu Oct. 21, 2020, 3:50 p.m. UTC | #2

On Wed, Oct 21, 2020 at 03:23:45PM +0100, Dr. David Alan Gilbert wrote:
> > @@ -354,8 +368,33 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,

> >  }

> >  

> >  int migrate_send_rp_req_pages(MigrationIncomingState *mis,

> > -                              RAMBlock *rb, ram_addr_t start)

> > +                              RAMBlock *rb, ram_addr_t start, uint64_t haddr)

> >  {

> > +    void *aligned = (void *)(uintptr_t)(haddr & qemu_real_host_page_mask);

> 

> Can you remind me, what happens here for hugepages?

Sure.  Previously it was:

  (haddr & (-qemu_target_page_size())

Now it is:

  (haddr & qemu_real_host_page_mask)

Basically we changed the psize alignment from guest to host.

The bug triggered previously on ppc64 where host_psize=64k, then when guest
psize is smaller, e.g., 4k, we can have some addr that aligned to 4k rather
than 64k, so we failed later on checking the host psize alignment (because this
pointer should point to a host page, so it should align with host psize).

-- 
Peter Xu

Dr. David Alan Gilbert Oct. 21, 2020, 5:42 p.m. UTC | #3

* Peter Xu (peterx@redhat.com) wrote:
> On Wed, Oct 21, 2020 at 03:23:45PM +0100, Dr. David Alan Gilbert wrote:

> > > @@ -354,8 +368,33 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,

> > >  }

> > >  

> > >  int migrate_send_rp_req_pages(MigrationIncomingState *mis,

> > > -                              RAMBlock *rb, ram_addr_t start)

> > > +                              RAMBlock *rb, ram_addr_t start, uint64_t haddr)

> > >  {

> > > +    void *aligned = (void *)(uintptr_t)(haddr & qemu_real_host_page_mask);

> > 

> > Can you remind me, what happens here for hugepages?

> 

> Sure.  Previously it was:

> 

>   (haddr & (-qemu_target_page_size())

> 

> Now it is:

> 

>   (haddr & qemu_real_host_page_mask)

> 

> Basically we changed the psize alignment from guest to host.

> 

> The bug triggered previously on ppc64 where host_psize=64k, then when guest

> psize is smaller, e.g., 4k, we can have some addr that aligned to 4k rather

> than 64k, so we failed later on checking the host psize alignment (because this

> pointer should point to a host page, so it should align with host psize).


But my question is what happens when we have say a 2MB hugepage?

Dave

> -- 

> Peter Xu

> 

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

Peter Xu Oct. 21, 2020, 6:04 p.m. UTC | #4

On Wed, Oct 21, 2020 at 06:42:19PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Wed, Oct 21, 2020 at 03:23:45PM +0100, Dr. David Alan Gilbert wrote:
> > > > @@ -354,8 +368,33 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis,
> > > >  }
> > > >  
> > > >  int migrate_send_rp_req_pages(MigrationIncomingState *mis,
> > > > -                              RAMBlock *rb, ram_addr_t start)
> > > > +                              RAMBlock *rb, ram_addr_t start, uint64_t haddr)
> > > >  {
> > > > +    void *aligned = (void *)(uintptr_t)(haddr & qemu_real_host_page_mask);
> > > 
> > > Can you remind me, what happens here for hugepages?
> > 
> > Sure.  Previously it was:
> > 
> >   (haddr & (-qemu_target_page_size())
> > 
> > Now it is:
> > 
> >   (haddr & qemu_real_host_page_mask)
> > 
> > Basically we changed the psize alignment from guest to host.
> > 
> > The bug triggered previously on ppc64 where host_psize=64k, then when guest
> > psize is smaller, e.g., 4k, we can have some addr that aligned to 4k rather
> > than 64k, so we failed later on checking the host psize alignment (because this
> > pointer should point to a host page, so it should align with host psize).
> 
> But my question is what happens when we have say a 2MB hugepage?

Oops, I definitely misread.

Good point, I think it can break hugepages.  So the mask should really be
"(qemu_ram_pagesize(rb) - 1)".

I'll fix and smoke it with some huge pages before another repost.

Thanks!