Message ID | 20240418193528.41780-1-dwmw2@infradead.org |
---|---|
Headers | show |
Series | Cleaning up the KVM clock mess | expand |
On Thu, 2024-04-18 at 20:34 +0100, David Woodhouse wrote: > > KVM: x86: Remove periodic global clock updates > KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE Meh, I might have to put those back. They were originally introduced to cope with NTP frequency skew which is no longer a problem, but we now know there's a systemic skew even between the host CLOCK_MONOTONIC_RAW and the KVM clock as calculated via the guest's TSC. So at least when !ka->use_master_clock I think the forced resync does need to happen.
On 18/04/2024 20:34, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully > inadequate. It ignores TSC scaling, and ignores the fact that the host > TSC may differ from one host to the next (and in fact because of the way > the kernel calibrates it, it generally differs from one boot to the next > even on the same hardware). > > Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits, > and attempt to document the *awful* process that we're requiring userspace > to follow to merely preserve the TSC across migration. > > I may have thrown up in my mouth a little when writing that documentation. > It's an awful API. If we do this, we should be ashamed of ourselves. > (I also haven't tested the documented process yet). > > Let's use Simon's KVM_VCPU_TSC_VALUE instead. > https://lore.kernel.org/all/20230202165950.483430-1-sveith@amazon.de/ > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > Documentation/virt/kvm/devices/vcpu.rst | 115 ++++++++++++++++++------ > arch/x86/include/uapi/asm/kvm.h | 6 ++ > arch/x86/kvm/x86.c | 15 ++++ > 3 files changed, 109 insertions(+), 27 deletions(-) > Reviewed-by: Paul Durrant <paul@xen.org>
On Fri, 2024-04-19 at 16:49 +0100, Paul Durrant wrote: > On 18/04/2024 20:34, David Woodhouse wrote: > > From: David Woodhouse <dwmw@amazon.co.uk> > > > > The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully > > inadequate. It ignores TSC scaling, and ignores the fact that the host > > TSC may differ from one host to the next (and in fact because of the way > > the kernel calibrates it, it generally differs from one boot to the next > > even on the same hardware). > > > > Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits, > > and attempt to document the *awful* process that we're requiring userspace > > to follow to merely preserve the TSC across migration. > > > > I may have thrown up in my mouth a little when writing that documentation. > > It's an awful API. If we do this, we should be ashamed of ourselves. > > (I also haven't tested the documented process yet). > > > > Let's use Simon's KVM_VCPU_TSC_VALUE instead. > > https://lore.kernel.org/all/20230202165950.483430-1-sveith@amazon.de/ Oops, I meant to change that commit message. I've changed my mind, as KVM_VCPU_TSC_VALUE sets the TSC using the KVM clock as a reference... which is utterly backwards, since the KVM clock is *defined* in terms of the TSC. Will do so before posting it again. May also include the "loop over KVM_GET_CLOCK until tai_offset didn't change" part to fix the leap seconbd problem. I think we do need to implement a better way to get { TAI, host TSC } than abusing KVM_GET_CLOCK when we don't even *care* about the KVM clock. We need a way to migrate the arch counters on platforms other than x86 too. But I'm trying to leave that part for later. At least on x86 we *do* have a way to migrate the TSC accurately, even if it's awful. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > > --- > > Documentation/virt/kvm/devices/vcpu.rst | 115 ++++++++++++++++++------ > > arch/x86/include/uapi/asm/kvm.h | 6 ++ > > arch/x86/kvm/x86.c | 15 ++++ > > 3 files changed, 109 insertions(+), 27 deletions(-) > > > > Reviewed-by: Paul Durrant <paul@xen.org> >
On 18/04/2024 20:34, David Woodhouse wrote: > From: David Woodhouse <dwmw@amazon.co.uk> > > When in 'master clock mode' (i.e. when host and guest TSCs are behaving > sanely and in sync), the KVM clock is defined in terms of the guest TSC. > > When TSC scaling is used, calculating the KVM clock directly from *host* > TSC cycles leads to a systemic drift from the values calculated by the > guest from its TSC. > > Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers") > had a simple workaround for the specific case of Xen timers, as it had an > actual vCPU to hand and could use its scaling information. That commit > noted that it was broken for the general case of get_kvmclock_ns(), and > said "I'll come back to that". > > Since __get_kvmclock() is invoked without a specific CPU, it needs to > be able to find or generate the scaling values required to perform the > correct calculation. > > Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC, > so it isn't as complex as it might have been. > > In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in > kvm->arch.last_tsc_scaling_ratio. That is only protected by the > tsc_writE_lock, so in pvclock_update_vm_gtod_copy(), copy it into a ^ typo: capitalization of the 'E' > separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed > using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and > shift factors to convert to nanoseconds for the corresponding KVM clock, > just as kvm_guest_time_update() would. > > In __get_kvmclock(), which runs within a seqcount retry loop, use those > values to convert host to guest TSC and then to nanoseconds. Only fall > back to using get_kvmclock_base_ns() when not in master clock mode. > > There was previously a code path in __get_kvmclock() which looked like > it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps > even on 32-bit hosts. In practice that could never happen as the > ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit > hosts it would never be set when the system clock isn't TSC-based. So > that code path is now removed. > > The kvm_get_wall_clock_epoch() function had the same problem; make it > just call get_kvmclock() and subtract kvmclock from wallclock, with > the same fallback as before. > > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> > --- > arch/x86/include/asm/kvm_host.h | 4 + > arch/x86/kvm/x86.c | 150 ++++++++++++++++---------------- > 2 files changed, 78 insertions(+), 76 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index cfac72b4aa64..13f979dd14b9 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -1353,6 +1353,7 @@ struct kvm_arch { > u64 last_tsc_write; > u32 last_tsc_khz; > u64 last_tsc_offset; > + u64 last_tsc_scaling_ratio; > u64 cur_tsc_nsec; > u64 cur_tsc_write; > u64 cur_tsc_offset; > @@ -1366,6 +1367,9 @@ struct kvm_arch { > bool use_master_clock; > u64 master_kernel_ns; > u64 master_cycle_now; > + u64 master_tsc_scaling_ratio; > + u32 master_tsc_mul; > + s8 master_tsc_shift; > struct delayed_work kvmclock_update_work; > struct delayed_work kvmclock_sync_work; > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index f870e29d2558..5cd92f4b4c97 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -2671,6 +2671,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc, > kvm->arch.last_tsc_nsec = ns; > kvm->arch.last_tsc_write = tsc; > kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz; > + kvm->arch.last_tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio; > kvm->arch.last_tsc_offset = offset; > > vcpu->arch.last_guest_tsc = tsc; > @@ -3006,6 +3007,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm) > { > #ifdef CONFIG_X86_64 > struct kvm_arch *ka = &kvm->arch; > + uint64_t last_tsc_hz; > int vclock_mode; > bool host_tsc_clocksource, vcpus_matched; > > @@ -3025,6 +3027,34 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm) > && !ka->backwards_tsc_observed > && !ka->boot_vcpu_runs_old_kvmclock; > > + /* > + * When TSC scaling is in use (which can thankfully only happen > + * with X86_FEATURE_CONSTANT_TSC), the host must calculate the > + * KVM clock precisely as the guest would, by scaling through > + * the guest TSC frequency. Otherwise, differences in arithmetic > + * precision lead to systemic drift between the guest's and the > + * host's idea of the time. > + */ > + if (kvm_caps.has_tsc_control) { > + /* > + * Copy from the field protected solely by ka->tsc_write_lock, > + * to the field protected by the ka->pvclock_sc seqlock. > + */ > + ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio; > + > + /* > + * Calculate the scaling factors precisely the same way > + * that kvm_guest_time_update() does. > + */ > + last_tsc_hz = kvm_scale_tsc(tsc_khz * 1000, > + ka->last_tsc_scaling_ratio); > + kvm_get_time_scale(NSEC_PER_SEC, last_tsc_hz, > + &ka->master_tsc_shift, &ka->master_tsc_mul); > + } else if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) { > + kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1009, 1009? > + &ka->master_tsc_shift, &ka->master_tsc_mul); > + } > + > if (ka->use_master_clock) > atomic_set(&kvm_guest_has_master_clock, 1); >