Message ID | 20200908203022.341615-1-peterx@redhat.com |
---|---|
Headers | show |
Series | migration/postcopy: Sync faulted addresses after network recovered | expand |
* Peter Xu (peterx@redhat.com) wrote: > In migration_incoming_state_destroy(), we've got a few variables that aren't > destroyed properly, namely: > > main_thread_load_event > postcopy_pause_sem_dst > postcopy_pause_sem_fault > rp_mutex > > Destroy them properly. > > Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > --- > migration/migration.c | 7 +++++-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 58a5452471..749d9b145b 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -238,12 +238,15 @@ void migration_incoming_state_destroy(void) > mis->postcopy_remote_fds = NULL; > } > > - qemu_event_reset(&mis->main_thread_load_event); > - > if (mis->socket_address_list) { > qapi_free_SocketAddressList(mis->socket_address_list); > mis->socket_address_list = NULL; > } > + > + qemu_event_destroy(&mis->main_thread_load_event); > + qemu_sem_destroy(&mis->postcopy_pause_sem_dst); > + qemu_sem_destroy(&mis->postcopy_pause_sem_fault); > + qemu_mutex_destroy(&mis->rp_mutex); > } > > static void migrate_generate_event(int new_state) > -- > 2.26.2 >
* Peter Xu (peterx@redhat.com) wrote: > v2: Queued > - add r-bs for Dave > - add patch "migration: Properly destroy variables on incoming side" as patch 1 > - destroy page_request_mutex in migration_incoming_state_destroy() too [Dave] > - use WITH_QEMU_LOCK_GUARD in two places where we can [Dave] > > We've seen conditional guest hangs on destination VM after postcopy recovered. > However the hang will resolve itself after a few minutes. > > The problem is: after a postcopy recovery, the prioritized postcopy queue on > the source VM is actually missing. So all the faulted threads before the > postcopy recovery happened will keep halted until (accidentally) the page got > copied by the background precopy migration stream. > > The solution is to also refresh this information after postcopy recovery. To > achieve this, we need to maintain a list of faulted addresses on the > destination node, so that we can resend the list when necessary. This work is > done via patch 2-5. > > With that, the last thing we need to do is to send this extra information to > source VM after recovered. Very luckily, this synchronization can be > "emulated" by sending a bunch of page requests (although these pages have been > sent previously!) to source VM just like when we've got a page fault. Even in > the 1st version of the postcopy code we'll handle duplicated pages well. So > this fix does not even need a new capability bit and it'll work smoothly on old > QEMUs when we migrate from them to the new QEMUs. > > Please review, thanks. > > Peter Xu (6): > migration: Properly destroy variables on incoming side > migration: Rework migrate_send_rp_req_pages() function > migration: Pass incoming state into qemu_ufd_copy_ioctl() > migration: Introduce migrate_send_rp_message_req_pages() > migration: Maintain postcopy faulted addresses > migration: Sync requested pages after postcopy recovery > > migration/migration.c | 79 +++++++++++++++++++++++++++++++++++----- > migration/migration.h | 23 +++++++++++- > migration/postcopy-ram.c | 46 ++++++++++------------- > migration/savevm.c | 57 +++++++++++++++++++++++++++++ > migration/trace-events | 3 ++ > 5 files changed, 171 insertions(+), 37 deletions(-) > > -- > 2.26.2 > > > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote: > * Peter Xu (peterx@redhat.com) wrote: > > v2: > > Queued Hi Peter, I've had to unqueue most of this; it doesn't like building on 32bit. I fixed up the trace_ stuff easily (_del can take a void*, add just needs to use PRIX64) but there are other places where it doesn't like the casting from pointers to uint64_t's etc. I've kept the first couple of commits. Dave > > - add r-bs for Dave > > - add patch "migration: Properly destroy variables on incoming side" as patch 1 > > - destroy page_request_mutex in migration_incoming_state_destroy() too [Dave] > > - use WITH_QEMU_LOCK_GUARD in two places where we can [Dave] > > > > We've seen conditional guest hangs on destination VM after postcopy recovered. > > However the hang will resolve itself after a few minutes. > > > > The problem is: after a postcopy recovery, the prioritized postcopy queue on > > the source VM is actually missing. So all the faulted threads before the > > postcopy recovery happened will keep halted until (accidentally) the page got > > copied by the background precopy migration stream. > > > > The solution is to also refresh this information after postcopy recovery. To > > achieve this, we need to maintain a list of faulted addresses on the > > destination node, so that we can resend the list when necessary. This work is > > done via patch 2-5. > > > > With that, the last thing we need to do is to send this extra information to > > source VM after recovered. Very luckily, this synchronization can be > > "emulated" by sending a bunch of page requests (although these pages have been > > sent previously!) to source VM just like when we've got a page fault. Even in > > the 1st version of the postcopy code we'll handle duplicated pages well. So > > this fix does not even need a new capability bit and it'll work smoothly on old > > QEMUs when we migrate from them to the new QEMUs. > > > > Please review, thanks. > > > > Peter Xu (6): > > migration: Properly destroy variables on incoming side > > migration: Rework migrate_send_rp_req_pages() function > > migration: Pass incoming state into qemu_ufd_copy_ioctl() > > migration: Introduce migrate_send_rp_message_req_pages() > > migration: Maintain postcopy faulted addresses > > migration: Sync requested pages after postcopy recovery > > > > migration/migration.c | 79 +++++++++++++++++++++++++++++++++++----- > > migration/migration.h | 23 +++++++++++- > > migration/postcopy-ram.c | 46 ++++++++++------------- > > migration/savevm.c | 57 +++++++++++++++++++++++++++++ > > migration/trace-events | 3 ++ > > 5 files changed, 171 insertions(+), 37 deletions(-) > > > > -- > > 2.26.2 > > > > > > > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Fri, Sep 25, 2020 at 12:50:26PM +0100, Dr. David Alan Gilbert wrote: > * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote: > > * Peter Xu (peterx@redhat.com) wrote: > > > v2: > > > > Queued > > Hi Peter, > I've had to unqueue most of this; it doesn't like building on 32bit. > I fixed up the trace_ stuff easily (_del can take a void*, add just > needs to use PRIX64) but there are other places where it doesn't like > the casting from pointers to uint64_t's etc. > > I've kept the first couple of commits. Thanks, Dave. I'll have a look later and repost.