Message ID | 20201021212721.440373-1-peterx@redhat.com |
---|---|
Headers | show |
Series | migration/postcopy: Sync faulted addresses after network recovered | expand |
On 21/10/2020 23.27, Peter Xu wrote: > The errors are very useful when debugging qtest failures, especially when > QTEST_LOG=1 is set. Let's allow override MigrateStart.hide_stderr when > QTEST_LOG=1 is specified, because that means the user wants to be verbose. > > Not very nice to introduce the first QTEST_LOG env access in migration-test.c, > however it should be handy. Without this patch, I was hacking error_report() > when debugging such errors. Let's make things easier. > > Signed-off-by: Peter Xu <peterx@redhat.com> > --- > tests/qtest/migration-test.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/tests/qtest/migration-test.c b/tests/qtest/migration-test.c > index f410ec5996..f2142fbd3c 100644 > --- a/tests/qtest/migration-test.c > +++ b/tests/qtest/migration-test.c > @@ -464,6 +464,10 @@ static void migrate_postcopy_start(QTestState *from, QTestState *to) > } > > typedef struct { > + /* > + * QTEST_LOG=1 may override this. When QTEST_LOG=1, we always dump errors > + * unconditionally, because it means the user would like to be verbose. > + */ > bool hide_stderr; > bool use_shmem; > /* only launch the target process */ > @@ -557,7 +561,7 @@ static int test_migrate_start(QTestState **from, QTestState **to, > > g_free(bootpath); > > - if (args->hide_stderr) { > + if (!getenv("QTEST_LOG") && args->hide_stderr) { > ignore_stderr = "2>/dev/null"; > } else { > ignore_stderr = ""; Reviewed-by: Thomas Huth <thuth@redhat.com>
* Peter Xu (peterx@redhat.com) wrote: > Maintain a list of faulted addresses on the destination host for which we're > waiting on. This is implemented using a GTree rather than a real list to make > sure even there're plenty of vCPUs/threads that are faulting, the lookup will > still be fast with O(log(N)) (because we'll do that after placing each page). > It should bring a slight overhead, but ideally that shouldn't be a big problem > simply because in most cases the requested page list will be short. > > Actually we did similar things for postcopy blocktime measurements. This patch > didn't use that simply because: > > (1) blocktime measurement is towards vcpu threads only, but here we need to > record all faulted addresses, including main thread and external > thread (like, DPDK via vhost-user). > > (2) blocktime measurement will require UFFD_FEATURE_THREAD_ID, but here we > don't want to add that extra dependency on the kernel version since not > necessary. E.g., we don't need to know which thread faulted on which > page, we also don't care about multiple threads faulting on the same > page. But we only care about what addresses are faulted so waiting for a > page copying from src. > > (3) blocktime measurement is not enabled by default. However we need this by > default especially for postcopy recover. > > Another thing to mention is that this patch introduced a new mutex to serialize > the receivedmap and the page_requested tree, however that serialization does > not cover other procedures like UFFDIO_COPY. > > Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > --- > migration/migration.c | 41 +++++++++++++++++++++++++++++++++++++++- > migration/migration.h | 19 ++++++++++++++++++- > migration/postcopy-ram.c | 17 ++++++++++++++--- > migration/trace-events | 2 ++ > 4 files changed, 74 insertions(+), 5 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 255e69c8aa..e3a958b299 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -143,6 +143,13 @@ static int migration_maybe_pause(MigrationState *s, > int new_state); > static void migrate_fd_cancel(MigrationState *s); > > +static gint page_request_addr_cmp(gconstpointer ap, gconstpointer bp) > +{ > + uintptr_t a = (uintptr_t) ap, b = (uintptr_t) bp; > + > + return (a > b) - (a < b); > +} > + > void migration_object_init(void) > { > MachineState *ms = MACHINE(qdev_get_machine()); > @@ -165,6 +172,8 @@ void migration_object_init(void) > qemu_event_init(¤t_incoming->main_thread_load_event, false); > qemu_sem_init(¤t_incoming->postcopy_pause_sem_dst, 0); > qemu_sem_init(¤t_incoming->postcopy_pause_sem_fault, 0); > + qemu_mutex_init(¤t_incoming->page_request_mutex); > + current_incoming->page_requested = g_tree_new(page_request_addr_cmp); > > if (!migration_object_check(current_migration, &err)) { > error_report_err(err); > @@ -240,6 +249,11 @@ void migration_incoming_state_destroy(void) > > qemu_event_reset(&mis->main_thread_load_event); > > + if (mis->page_requested) { > + g_tree_destroy(mis->page_requested); > + mis->page_requested = NULL; > + } > + > if (mis->socket_address_list) { > qapi_free_SocketAddressList(mis->socket_address_list); > mis->socket_address_list = NULL; > @@ -354,8 +368,33 @@ int migrate_send_rp_message_req_pages(MigrationIncomingState *mis, > } > > int migrate_send_rp_req_pages(MigrationIncomingState *mis, > - RAMBlock *rb, ram_addr_t start) > + RAMBlock *rb, ram_addr_t start, uint64_t haddr) > { > + void *aligned = (void *)(uintptr_t)(haddr & (-qemu_ram_pagesize(rb))); > + bool received; > + > + WITH_QEMU_LOCK_GUARD(&mis->page_request_mutex) { > + received = ramblock_recv_bitmap_test_byte_offset(rb, start); > + if (!received && !g_tree_lookup(mis->page_requested, aligned)) { > + /* > + * The page has not been received, and it's not yet in the page > + * request list. Queue it. Set the value of element to 1, so that > + * things like g_tree_lookup() will return TRUE (1) when found. > + */ > + g_tree_insert(mis->page_requested, aligned, (gpointer)1); > + mis->page_requested_count++; > + trace_postcopy_page_req_add(aligned, mis->page_requested_count); > + } > + } > + > + /* > + * If the page is there, skip sending the message. We don't even need the > + * lock because as long as the page arrived, it'll be there forever. > + */ > + if (received) { > + return 0; > + } > + > return migrate_send_rp_message_req_pages(mis, rb, start); > } > > diff --git a/migration/migration.h b/migration/migration.h > index e853ccf8b1..8d2d1ce839 100644 > --- a/migration/migration.h > +++ b/migration/migration.h > @@ -104,6 +104,23 @@ struct MigrationIncomingState { > > /* List of listening socket addresses */ > SocketAddressList *socket_address_list; > + > + /* A tree of pages that we requested to the source VM */ > + GTree *page_requested; > + /* For debugging purpose only, but would be nice to keep */ > + int page_requested_count; > + /* > + * The mutex helps to maintain the requested pages that we sent to the > + * source, IOW, to guarantee coherent between the page_requests tree and > + * the per-ramblock receivedmap. Note! This does not guarantee consistency > + * of the real page copy procedures (using UFFDIO_[ZERO]COPY). E.g., even > + * if one bit in receivedmap is cleared, UFFDIO_COPY could have happened > + * for that page already. This is intended so that the mutex won't > + * serialize and blocked by slow operations like UFFDIO_* ioctls. However > + * this should be enough to make sure the page_requested tree always > + * contains valid information. > + */ > + QemuMutex page_request_mutex; > }; > > MigrationIncomingState *migration_incoming_get_current(void); > @@ -332,7 +349,7 @@ void migrate_send_rp_shut(MigrationIncomingState *mis, > void migrate_send_rp_pong(MigrationIncomingState *mis, > uint32_t value); > int migrate_send_rp_req_pages(MigrationIncomingState *mis, RAMBlock *rb, > - ram_addr_t start); > + ram_addr_t start, uint64_t haddr); > int migrate_send_rp_message_req_pages(MigrationIncomingState *mis, > RAMBlock *rb, ram_addr_t start); > void migrate_send_rp_recv_bitmap(MigrationIncomingState *mis, > diff --git a/migration/postcopy-ram.c b/migration/postcopy-ram.c > index 722034dc01..ca1daf0024 100644 > --- a/migration/postcopy-ram.c > +++ b/migration/postcopy-ram.c > @@ -684,7 +684,7 @@ int postcopy_request_shared_page(struct PostCopyFD *pcfd, RAMBlock *rb, > qemu_ram_get_idstr(rb), rb_offset); > return postcopy_wake_shared(pcfd, client_addr, rb); > } > - migrate_send_rp_req_pages(mis, rb, aligned_rbo); > + migrate_send_rp_req_pages(mis, rb, aligned_rbo, client_addr); > return 0; > } > > @@ -979,7 +979,8 @@ retry: > * Send the request to the source - we want to request one > * of our host page sizes (which is >= TPS) > */ > - ret = migrate_send_rp_req_pages(mis, rb, rb_offset); > + ret = migrate_send_rp_req_pages(mis, rb, rb_offset, > + msg.arg.pagefault.address); > if (ret) { > /* May be network failure, try to wait for recovery */ > if (ret == -EIO && postcopy_pause_fault_thread(mis)) { > @@ -1149,10 +1150,20 @@ static int qemu_ufd_copy_ioctl(MigrationIncomingState *mis, void *host_addr, > ret = ioctl(userfault_fd, UFFDIO_ZEROPAGE, &zero_struct); > } > if (!ret) { > + qemu_mutex_lock(&mis->page_request_mutex); > ramblock_recv_bitmap_set_range(rb, host_addr, > pagesize / qemu_target_page_size()); > + /* > + * If this page resolves a page fault for a previous recorded faulted > + * address, take a special note to maintain the requested page list. > + */ > + if (g_tree_lookup(mis->page_requested, host_addr)) { > + g_tree_remove(mis->page_requested, host_addr); > + mis->page_requested_count--; > + trace_postcopy_page_req_del(host_addr, mis->page_requested_count); > + } > + qemu_mutex_unlock(&mis->page_request_mutex); > mark_postcopy_blocktime_end((uintptr_t)host_addr); > - > } > return ret; > } > diff --git a/migration/trace-events b/migration/trace-events > index 338f38b3dd..e4d5eb94ca 100644 > --- a/migration/trace-events > +++ b/migration/trace-events > @@ -162,6 +162,7 @@ postcopy_pause_return_path(void) "" > postcopy_pause_return_path_continued(void) "" > postcopy_pause_continued(void) "" > postcopy_start_set_run(void) "" > +postcopy_page_req_add(void *addr, int count) "new page req %p total %d" > source_return_path_thread_bad_end(void) "" > source_return_path_thread_end(void) "" > source_return_path_thread_entry(void) "" > @@ -272,6 +273,7 @@ postcopy_ram_incoming_cleanup_blocktime(uint64_t total) "total blocktime %" PRIu > postcopy_request_shared_page(const char *sharer, const char *rb, uint64_t rb_offset) "for %s in %s offset 0x%"PRIx64 > postcopy_request_shared_page_present(const char *sharer, const char *rb, uint64_t rb_offset) "%s already %s offset 0x%"PRIx64 > postcopy_wake_shared(uint64_t client_addr, const char *rb) "at 0x%"PRIx64" in %s" > +postcopy_page_req_del(void *addr, int count) "resolved page req %p total %d" > > get_mem_fault_cpu_index(int cpu, uint32_t pid) "cpu: %d, pid: %u" > > -- > 2.26.2 >
* Peter Xu (peterx@redhat.com) wrote: > v6: > - fix page mask to use ramblock psize [Dave] > > v5: > - added one test patch for easier debugging for migration-test > - added one fix patch [1] for another postcopy race > - fixed a bug that could trigger when host/guest page size differs > > v4: > - use "void */ulong" instead of "uint64_t" where proper in patch 3/4 [Dave] > > v3: > - fix build on 32bit hosts & rebase > - remove r-bs for the last 2 patches for Dave due to the changes > > v2: > - add r-bs for Dave > - add patch "migration: Properly destroy variables on incoming side" as patch 1 > - destroy page_request_mutex in migration_incoming_state_destroy() too [Dave] > - use WITH_QEMU_LOCK_GUARD in two places where we can [Dave] > > We've seen conditional guest hangs on destination VM after postcopy recovered. > However the hang will resolve itself after a few minutes. > > The problem is: after a postcopy recovery, the prioritized postcopy queue on > the source VM is actually missing. So all the faulted threads before the > postcopy recovery happened will keep halted until (accidentally) the page got > copied by the background precopy migration stream. > > The solution is to also refresh this information after postcopy recovery. To > achieve this, we need to maintain a list of faulted addresses on the > destination node, so that we can resend the list when necessary. This work is > done via patch 2-5. > > With that, the last thing we need to do is to send this extra information to > source VM after recovered. Very luckily, this synchronization can be > "emulated" by sending a bunch of page requests (although these pages have been > sent previously!) to source VM just like when we've got a page fault. Even in > the 1st version of the postcopy code we'll handle duplicated pages well. So > this fix does not even need a new capability bit and it'll work smoothly on old > QEMUs when we migrate from them to the new QEMUs. > > Please review, thanks. Queued Dave > > Peter Xu (6): > migration: Pass incoming state into qemu_ufd_copy_ioctl() > migration: Introduce migrate_send_rp_message_req_pages() > migration: Maintain postcopy faulted addresses > migration: Sync requested pages after postcopy recovery > migration/postcopy: Release fd before going into 'postcopy-pause' > migration-test: Only hide error if !QTEST_LOG > > migration/migration.c | 55 ++++++++++++++++++++++++++++++---- > migration/migration.h | 21 ++++++++++++- > migration/postcopy-ram.c | 25 ++++++++++++---- > migration/savevm.c | 57 ++++++++++++++++++++++++++++++++++++ > migration/trace-events | 3 ++ > tests/qtest/migration-test.c | 6 +++- > 6 files changed, 154 insertions(+), 13 deletions(-) > > -- > 2.26.2 > > >