Message ID | 20201019225720.172743-6-peterx@redhat.com |
---|---|
State | Superseded |
Headers | show |
Series | migration/postcopy: Sync faulted addresses after network recovered | expand |
* Peter Xu (peterx@redhat.com) wrote: > Logically below race could trigger with the old code: > > test program migration thread > ------------ ---------------- > wait_until('postcopy-pause') > postcopy_pause() > set_state('postcopy-pause') > do_postcopy_recover() > arm s->to_dst_file with new fd > release s->to_dst_file [1] > > Here [1] could have released the just-installed recoverying channel. Then the > migration could hang without really resuming. > > Instead, it should be very safe to release the fd before setting the state into > 'postcopy-pause', because there's no reason for any other thread to touch it > during 'postcopy-active'. > > Dave reported a very rare postcopy recovery hang that the migration-test > program waited for the migration to complete in migrate_postcopy_complete(). > We do suspect it's the same thing that we're gonna fix here. Hard to tell. > However since we've noticed this, fix this irrelevant of the hang report. > > Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> > Cc: Juan Quintela <quintela@redhat.com> > Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com> > --- > migration/migration.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/migration/migration.c b/migration/migration.c > index 0b4fcff01f..50df6251b7 100644 > --- a/migration/migration.c > +++ b/migration/migration.c > @@ -3182,9 +3182,6 @@ static MigThrError postcopy_pause(MigrationState *s) > while (true) { > QEMUFile *file; > > - migrate_set_state(&s->state, s->state, > - MIGRATION_STATUS_POSTCOPY_PAUSED); > - > /* Current channel is possibly broken. Release it. */ > assert(s->to_dst_file); > qemu_mutex_lock(&s->qemu_file_lock); > @@ -3195,6 +3192,9 @@ static MigThrError postcopy_pause(MigrationState *s) > qemu_file_shutdown(file); > qemu_fclose(file); > > + migrate_set_state(&s->state, s->state, > + MIGRATION_STATUS_POSTCOPY_PAUSED); > + > error_report("Detected IO failure for postcopy. " > "Migration paused."); > > -- > 2.26.2 >
diff --git a/migration/migration.c b/migration/migration.c index 0b4fcff01f..50df6251b7 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -3182,9 +3182,6 @@ static MigThrError postcopy_pause(MigrationState *s) while (true) { QEMUFile *file; - migrate_set_state(&s->state, s->state, - MIGRATION_STATUS_POSTCOPY_PAUSED); - /* Current channel is possibly broken. Release it. */ assert(s->to_dst_file); qemu_mutex_lock(&s->qemu_file_lock); @@ -3195,6 +3192,9 @@ static MigThrError postcopy_pause(MigrationState *s) qemu_file_shutdown(file); qemu_fclose(file); + migrate_set_state(&s->state, s->state, + MIGRATION_STATUS_POSTCOPY_PAUSED); + error_report("Detected IO failure for postcopy. " "Migration paused.");
Logically below race could trigger with the old code: test program migration thread ------------ ---------------- wait_until('postcopy-pause') postcopy_pause() set_state('postcopy-pause') do_postcopy_recover() arm s->to_dst_file with new fd release s->to_dst_file [1] Here [1] could have released the just-installed recoverying channel. Then the migration could hang without really resuming. Instead, it should be very safe to release the fd before setting the state into 'postcopy-pause', because there's no reason for any other thread to touch it during 'postcopy-active'. Dave reported a very rare postcopy recovery hang that the migration-test program waited for the migration to complete in migrate_postcopy_complete(). We do suspect it's the same thing that we're gonna fix here. Hard to tell. However since we've noticed this, fix this irrelevant of the hang report. Cc: Dr. David Alan Gilbert <dgilbert@redhat.com> Cc: Juan Quintela <quintela@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> --- migration/migration.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-)