Message ID | da83e967f5a266ca28ee54638a4a9be9@www.loen.fr |
---|---|
State | New |
Headers | show |
Hi Marc, On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier <marc.zyngier@arm.com> wrote: > Hi Anup, > > > On 2013-08-14 15:22, Anup Patel wrote: >> >> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier <marc.zyngier@arm.com> >> wrote: >>> >>> Hi Pranav, >>> >>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>> >>>> Systems with large external L3-cache (few MBs), might have dirty >>>> content belonging to the guest page in L3-cache. To tackle this, >>>> we need to flush such dirty content from d-cache so that guest >>>> will see correct contents of guest page when guest MMU is disabled. >>>> >>>> The patch fixes coherent_icache_guest_page() for external L3-cache. >>>> >>>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> >>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> >>>> --- >>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>> 1 file changed, 2 insertions(+) >>>> >>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>> b/arch/arm64/include/asm/kvm_mmu.h >>>> index efe609c..5129038 100644 >>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>> @@ -123,6 +123,8 @@ static inline void >>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>> if (!icache_is_aliasing()) { /* PIPT */ >>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>> + /* Flush d-cache for systems with external caches. */ >>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>> } else if (!icache_is_aivivt()) { /* non ASID-tagged VIVT */ >>>> /* any kind of VIPT cache */ >>>> __flush_icache_all(); >>> >>> >>> [adding Will to the discussion as we talked about this in the past] >>> >>> That's definitely an issue, but I'm not sure the fix is to hit the data >>> cache on each page mapping. It looks overkill. >>> >>> Wouldn't it be enough to let userspace do the cache cleaning? kvmtools >>> knows which bits of the guest memory have been touched, and can do a "DC >>> DVAC" on this region. >> >> >> It seems a bit unnatural to have cache cleaning is user-space. I am sure >> other architectures don't have such cache cleaning in user-space for KVM. >> >>> >>> The alternative is do it in the kernel before running any vcpu - but >>> that's not very nice either (have to clean the whole of the guest >>> memory, which makes a full dcache clean more appealing). >> >> >> Actually, cleaning full d-cache by set/way upon first run of VCPU was >> our second option but current approach seemed very simple hence >> we went for this. >> >> If more people vote for full d-cache clean upon first run of VCPU then >> we should revise this patch. > > > Can you please give the attached patch a spin on your HW? I've boot-tested > it on a model, but of course I can't really verify that it fixes your issue. > > As far as I can see, it should fix it without any additional flushing. > > Please let me know how it goes. HCR_EL2.DC=1 means all memory access for Stage1 MMU off are treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, Outer Write-Back Write-Allocate memory" HCR_EL2.DC=0 means all memory access for Stage1 MMU off are treated as "Strongly-ordered device memory" Now if Guest/VM access hardware MMIO devices directly (such as VGIC CPU interface) with MMU off then MMIO devices will be accessed as normal memory when HCR_EL2.DC=1. The HCR_EL2.DC=1 makes sense only if we have all software emulated devices for Guest/VM which is not true for KVM ARM or KVM ARM64 because we use VGIC. IMHO, this patch enforces incorrect memory attribute for Guest/VM when Stage1 MMU is off. Nevertheless, we will still try your patch. > > Thanks, > > > M. > -- > Fast, cheap, reliable. Pick two. Regards, Anup
On 2013-08-15 07:26, Anup Patel wrote: > Hi Marc, > > On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier <marc.zyngier@arm.com> > wrote: >> Hi Anup, >> >> >> On 2013-08-14 15:22, Anup Patel wrote: >>> >>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >>> <marc.zyngier@arm.com> >>> wrote: >>>> >>>> Hi Pranav, >>>> >>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>> >>>>> Systems with large external L3-cache (few MBs), might have dirty >>>>> content belonging to the guest page in L3-cache. To tackle this, >>>>> we need to flush such dirty content from d-cache so that guest >>>>> will see correct contents of guest page when guest MMU is >>>>> disabled. >>>>> >>>>> The patch fixes coherent_icache_guest_page() for external >>>>> L3-cache. >>>>> >>>>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> >>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> >>>>> --- >>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>> 1 file changed, 2 insertions(+) >>>>> >>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>> index efe609c..5129038 100644 >>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>> @@ -123,6 +123,8 @@ static inline void >>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>> + /* Flush d-cache for systems with external caches. >>>>> */ >>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>> } else if (!icache_is_aivivt()) { /* non ASID-tagged >>>>> VIVT */ >>>>> /* any kind of VIPT cache */ >>>>> __flush_icache_all(); >>>> >>>> >>>> [adding Will to the discussion as we talked about this in the >>>> past] >>>> >>>> That's definitely an issue, but I'm not sure the fix is to hit the >>>> data >>>> cache on each page mapping. It looks overkill. >>>> >>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>> kvmtools >>>> knows which bits of the guest memory have been touched, and can do >>>> a "DC >>>> DVAC" on this region. >>> >>> >>> It seems a bit unnatural to have cache cleaning is user-space. I am >>> sure >>> other architectures don't have such cache cleaning in user-space >>> for KVM. >>> >>>> >>>> The alternative is do it in the kernel before running any vcpu - >>>> but >>>> that's not very nice either (have to clean the whole of the guest >>>> memory, which makes a full dcache clean more appealing). >>> >>> >>> Actually, cleaning full d-cache by set/way upon first run of VCPU >>> was >>> our second option but current approach seemed very simple hence >>> we went for this. >>> >>> If more people vote for full d-cache clean upon first run of VCPU >>> then >>> we should revise this patch. >> >> >> Can you please give the attached patch a spin on your HW? I've >> boot-tested >> it on a model, but of course I can't really verify that it fixes >> your issue. >> >> As far as I can see, it should fix it without any additional >> flushing. >> >> Please let me know how it goes. > > HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, > Outer Write-Back Write-Allocate memory" > > HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > treated as "Strongly-ordered device memory" > > Now if Guest/VM access hardware MMIO devices directly (such as > VGIC CPU interface) with MMU off then MMIO devices will be > accessed as normal memory when HCR_EL2.DC=1. I don't think so. Stage-2 still applies, and should force MMIO to be accessed as device memory. > The HCR_EL2.DC=1 makes sense only if we have all software > emulated devices for Guest/VM which is not true for KVM ARM or > KVM ARM64 because we use VGIC. > > IMHO, this patch enforces incorrect memory attribute for Guest/VM > when Stage1 MMU is off. See above. My understanding is that HCR.DC controls the default output of Stage-1, and Stage-2 overrides still apply. > Nevertheless, we will still try your patch. Thanks. M.
Hi Marc, On 15 August 2013 10:22, Marc Zyngier <marc.zyngier@arm.com> wrote: > Hi Anup, > > > On 2013-08-14 15:22, Anup Patel wrote: >> >> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier <marc.zyngier@arm.com> >> wrote: >>> >>> Hi Pranav, >>> >>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>> >>>> Systems with large external L3-cache (few MBs), might have dirty >>>> content belonging to the guest page in L3-cache. To tackle this, >>>> we need to flush such dirty content from d-cache so that guest >>>> will see correct contents of guest page when guest MMU is disabled. >>>> >>>> The patch fixes coherent_icache_guest_page() for external L3-cache. >>>> >>>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> >>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> >>>> --- >>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>> 1 file changed, 2 insertions(+) >>>> >>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>> b/arch/arm64/include/asm/kvm_mmu.h >>>> index efe609c..5129038 100644 >>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>> @@ -123,6 +123,8 @@ static inline void >>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>> if (!icache_is_aliasing()) { /* PIPT */ >>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>> + /* Flush d-cache for systems with external caches. */ >>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>> } else if (!icache_is_aivivt()) { /* non ASID-tagged VIVT */ >>>> /* any kind of VIPT cache */ >>>> __flush_icache_all(); >>> >>> >>> [adding Will to the discussion as we talked about this in the past] >>> >>> That's definitely an issue, but I'm not sure the fix is to hit the data >>> cache on each page mapping. It looks overkill. >>> >>> Wouldn't it be enough to let userspace do the cache cleaning? kvmtools >>> knows which bits of the guest memory have been touched, and can do a "DC >>> DVAC" on this region. >> >> >> It seems a bit unnatural to have cache cleaning is user-space. I am sure >> other architectures don't have such cache cleaning in user-space for KVM. >> >>> >>> The alternative is do it in the kernel before running any vcpu - but >>> that's not very nice either (have to clean the whole of the guest >>> memory, which makes a full dcache clean more appealing). >> >> >> Actually, cleaning full d-cache by set/way upon first run of VCPU was >> our second option but current approach seemed very simple hence >> we went for this. >> >> If more people vote for full d-cache clean upon first run of VCPU then >> we should revise this patch. > > > Can you please give the attached patch a spin on your HW? I've boot-tested > it on a model, but of course I can't really verify that it fixes your issue. > > As far as I can see, it should fix it without any additional flushing. > > Please let me know how it goes. > > Thanks, > > > M. > -- > Fast, cheap, reliable. Pick two. > > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm Surely we will try this patch. Just to add few more points to this discussion : Actually we are already flushing d-cache in "coherent_icache_guest_page" by calling flush_icache_range. Now in my patch i am doing same thing one more time by calling __flush_dcache_area just below flush_icache_range. Main difference between d-cache flushing in above two routines is : flush_icache_range is flushing d-cache by point of unification (dc cvau) and __flush_dcache_area is flushing that by point of coherency (dc civac). Now since second routine is doing it by coherency it is making L3C to be coherent and sync changes with immediate effect and solving the issue. Thanks, Pranav
On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 2013-08-15 07:26, Anup Patel wrote: >> >> Hi Marc, >> >> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier <marc.zyngier@arm.com> >> wrote: >>> >>> Hi Anup, >>> >>> >>> On 2013-08-14 15:22, Anup Patel wrote: >>>> >>>> >>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier <marc.zyngier@arm.com> >>>> wrote: >>>>> >>>>> >>>>> Hi Pranav, >>>>> >>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>>> >>>>>> >>>>>> Systems with large external L3-cache (few MBs), might have dirty >>>>>> content belonging to the guest page in L3-cache. To tackle this, >>>>>> we need to flush such dirty content from d-cache so that guest >>>>>> will see correct contents of guest page when guest MMU is disabled. >>>>>> >>>>>> The patch fixes coherent_icache_guest_page() for external L3-cache. >>>>>> >>>>>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> >>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> >>>>>> --- >>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>>> 1 file changed, 2 insertions(+) >>>>>> >>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>>> index efe609c..5129038 100644 >>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>> @@ -123,6 +123,8 @@ static inline void >>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>>> + /* Flush d-cache for systems with external caches. */ >>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>>> } else if (!icache_is_aivivt()) { /* non ASID-tagged VIVT >>>>>> */ >>>>>> /* any kind of VIPT cache */ >>>>>> __flush_icache_all(); >>>>> >>>>> >>>>> >>>>> [adding Will to the discussion as we talked about this in the past] >>>>> >>>>> That's definitely an issue, but I'm not sure the fix is to hit the data >>>>> cache on each page mapping. It looks overkill. >>>>> >>>>> Wouldn't it be enough to let userspace do the cache cleaning? kvmtools >>>>> knows which bits of the guest memory have been touched, and can do a >>>>> "DC >>>>> DVAC" on this region. >>>> >>>> >>>> >>>> It seems a bit unnatural to have cache cleaning is user-space. I am sure >>>> other architectures don't have such cache cleaning in user-space for >>>> KVM. >>>> >>>>> >>>>> The alternative is do it in the kernel before running any vcpu - but >>>>> that's not very nice either (have to clean the whole of the guest >>>>> memory, which makes a full dcache clean more appealing). >>>> >>>> >>>> >>>> Actually, cleaning full d-cache by set/way upon first run of VCPU was >>>> our second option but current approach seemed very simple hence >>>> we went for this. >>>> >>>> If more people vote for full d-cache clean upon first run of VCPU then >>>> we should revise this patch. >>> >>> >>> >>> Can you please give the attached patch a spin on your HW? I've >>> boot-tested >>> it on a model, but of course I can't really verify that it fixes your >>> issue. >>> >>> As far as I can see, it should fix it without any additional flushing. >>> >>> Please let me know how it goes. >> >> >> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >> treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, >> Outer Write-Back Write-Allocate memory" >> >> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >> treated as "Strongly-ordered device memory" >> >> Now if Guest/VM access hardware MMIO devices directly (such as >> VGIC CPU interface) with MMU off then MMIO devices will be >> accessed as normal memory when HCR_EL2.DC=1. > > > I don't think so. Stage-2 still applies, and should force MMIO to be > accessed as device memory. > > >> The HCR_EL2.DC=1 makes sense only if we have all software >> emulated devices for Guest/VM which is not true for KVM ARM or >> KVM ARM64 because we use VGIC. >> >> IMHO, this patch enforces incorrect memory attribute for Guest/VM >> when Stage1 MMU is off. > > > See above. My understanding is that HCR.DC controls the default output of > Stage-1, and Stage-2 overrides still apply. You had mentioned that PAGE_S2_DEVICE attribute was redundant and wanted guest to decide the memory attribute. In other words, you did not want to enforce any memory attribute in Stage2. Please refer to this patch https://patchwork.kernel.org/patch/2543201/ > > >> Nevertheless, we will still try your patch. > > > Thanks. > > > M. > -- > Fast, cheap, reliable. Pick two. Regards, Anup
On 2013-08-15 14:31, Anup Patel wrote: > On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier <marc.zyngier@arm.com> > wrote: >> On 2013-08-15 07:26, Anup Patel wrote: >>> >>> Hi Marc, >>> >>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >>> <marc.zyngier@arm.com> >>> wrote: >>>> >>>> Hi Anup, >>>> >>>> >>>> On 2013-08-14 15:22, Anup Patel wrote: >>>>> >>>>> >>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >>>>> <marc.zyngier@arm.com> >>>>> wrote: >>>>>> >>>>>> >>>>>> Hi Pranav, >>>>>> >>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>>>> >>>>>>> >>>>>>> Systems with large external L3-cache (few MBs), might have >>>>>>> dirty >>>>>>> content belonging to the guest page in L3-cache. To tackle >>>>>>> this, >>>>>>> we need to flush such dirty content from d-cache so that guest >>>>>>> will see correct contents of guest page when guest MMU is >>>>>>> disabled. >>>>>>> >>>>>>> The patch fixes coherent_icache_guest_page() for external >>>>>>> L3-cache. >>>>>>> >>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >>>>>>> <pranavkumar@linaro.org> >>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> >>>>>>> --- >>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>>>> 1 file changed, 2 insertions(+) >>>>>>> >>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>>>> index efe609c..5129038 100644 >>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>>> @@ -123,6 +123,8 @@ static inline void >>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>>>> + /* Flush d-cache for systems with external >>>>>>> caches. */ >>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>>>> } else if (!icache_is_aivivt()) { /* non >>>>>>> ASID-tagged VIVT >>>>>>> */ >>>>>>> /* any kind of VIPT cache */ >>>>>>> __flush_icache_all(); >>>>>> >>>>>> >>>>>> >>>>>> [adding Will to the discussion as we talked about this in the >>>>>> past] >>>>>> >>>>>> That's definitely an issue, but I'm not sure the fix is to hit >>>>>> the data >>>>>> cache on each page mapping. It looks overkill. >>>>>> >>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>>>> kvmtools >>>>>> knows which bits of the guest memory have been touched, and can >>>>>> do a >>>>>> "DC >>>>>> DVAC" on this region. >>>>> >>>>> >>>>> >>>>> It seems a bit unnatural to have cache cleaning is user-space. I >>>>> am sure >>>>> other architectures don't have such cache cleaning in user-space >>>>> for >>>>> KVM. >>>>> >>>>>> >>>>>> The alternative is do it in the kernel before running any vcpu - >>>>>> but >>>>>> that's not very nice either (have to clean the whole of the >>>>>> guest >>>>>> memory, which makes a full dcache clean more appealing). >>>>> >>>>> >>>>> >>>>> Actually, cleaning full d-cache by set/way upon first run of VCPU >>>>> was >>>>> our second option but current approach seemed very simple hence >>>>> we went for this. >>>>> >>>>> If more people vote for full d-cache clean upon first run of VCPU >>>>> then >>>>> we should revise this patch. >>>> >>>> >>>> >>>> Can you please give the attached patch a spin on your HW? I've >>>> boot-tested >>>> it on a model, but of course I can't really verify that it fixes >>>> your >>>> issue. >>>> >>>> As far as I can see, it should fix it without any additional >>>> flushing. >>>> >>>> Please let me know how it goes. >>> >>> >>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >>> treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, >>> Outer Write-Back Write-Allocate memory" >>> >>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >>> treated as "Strongly-ordered device memory" >>> >>> Now if Guest/VM access hardware MMIO devices directly (such as >>> VGIC CPU interface) with MMU off then MMIO devices will be >>> accessed as normal memory when HCR_EL2.DC=1. >> >> >> I don't think so. Stage-2 still applies, and should force MMIO to be >> accessed as device memory. >> >> >>> The HCR_EL2.DC=1 makes sense only if we have all software >>> emulated devices for Guest/VM which is not true for KVM ARM or >>> KVM ARM64 because we use VGIC. >>> >>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >>> when Stage1 MMU is off. >> >> >> See above. My understanding is that HCR.DC controls the default >> output of >> Stage-1, and Stage-2 overrides still apply. > > You had mentioned that PAGE_S2_DEVICE attribute was redundant > and wanted guest to decide the memory attribute. In other words, you > did not want to enforce any memory attribute in Stage2. > > Please refer to this patch > https://patchwork.kernel.org/patch/2543201/ This patch has never been merged. If you carry on following the discussion, you will certainly notice it was dropped for a very good reason: https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html So Stage-2 memory attributes are used, they are not going away, and they are essential to the patch I sent this morning. M.
Hi Marc, On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 2013-08-15 14:31, Anup Patel wrote: >> >> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier <marc.zyngier@arm.com> >> wrote: >>> >>> On 2013-08-15 07:26, Anup Patel wrote: >>>> >>>> >>>> Hi Marc, >>>> >>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier <marc.zyngier@arm.com> >>>> wrote: >>>>> >>>>> >>>>> Hi Anup, >>>>> >>>>> >>>>> On 2013-08-14 15:22, Anup Patel wrote: >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier <marc.zyngier@arm.com> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Pranav, >>>>>>> >>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Systems with large external L3-cache (few MBs), might have dirty >>>>>>>> content belonging to the guest page in L3-cache. To tackle this, >>>>>>>> we need to flush such dirty content from d-cache so that guest >>>>>>>> will see correct contents of guest page when guest MMU is disabled. >>>>>>>> >>>>>>>> The patch fixes coherent_icache_guest_page() for external L3-cache. >>>>>>>> >>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> >>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> >>>>>>>> --- >>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>>>>> 1 file changed, 2 insertions(+) >>>>>>>> >>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>> index efe609c..5129038 100644 >>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>> @@ -123,6 +123,8 @@ static inline void >>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>>>>> + /* Flush d-cache for systems with external caches. */ >>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>>>>> } else if (!icache_is_aivivt()) { /* non ASID-tagged >>>>>>>> VIVT >>>>>>>> */ >>>>>>>> /* any kind of VIPT cache */ >>>>>>>> __flush_icache_all(); >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> [adding Will to the discussion as we talked about this in the past] >>>>>>> >>>>>>> That's definitely an issue, but I'm not sure the fix is to hit the >>>>>>> data >>>>>>> cache on each page mapping. It looks overkill. >>>>>>> >>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>>>>> kvmtools >>>>>>> knows which bits of the guest memory have been touched, and can do a >>>>>>> "DC >>>>>>> DVAC" on this region. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> It seems a bit unnatural to have cache cleaning is user-space. I am >>>>>> sure >>>>>> other architectures don't have such cache cleaning in user-space for >>>>>> KVM. >>>>>> >>>>>>> >>>>>>> The alternative is do it in the kernel before running any vcpu - but >>>>>>> that's not very nice either (have to clean the whole of the guest >>>>>>> memory, which makes a full dcache clean more appealing). >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Actually, cleaning full d-cache by set/way upon first run of VCPU was >>>>>> our second option but current approach seemed very simple hence >>>>>> we went for this. >>>>>> >>>>>> If more people vote for full d-cache clean upon first run of VCPU then >>>>>> we should revise this patch. >>>>> >>>>> >>>>> >>>>> >>>>> Can you please give the attached patch a spin on your HW? I've >>>>> boot-tested >>>>> it on a model, but of course I can't really verify that it fixes your >>>>> issue. >>>>> >>>>> As far as I can see, it should fix it without any additional flushing. >>>>> >>>>> Please let me know how it goes. >>>> >>>> >>>> >>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >>>> treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, >>>> Outer Write-Back Write-Allocate memory" >>>> >>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >>>> treated as "Strongly-ordered device memory" >>>> >>>> Now if Guest/VM access hardware MMIO devices directly (such as >>>> VGIC CPU interface) with MMU off then MMIO devices will be >>>> accessed as normal memory when HCR_EL2.DC=1. >>> >>> >>> >>> I don't think so. Stage-2 still applies, and should force MMIO to be >>> accessed as device memory. >>> >>> >>>> The HCR_EL2.DC=1 makes sense only if we have all software >>>> emulated devices for Guest/VM which is not true for KVM ARM or >>>> KVM ARM64 because we use VGIC. >>>> >>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >>>> when Stage1 MMU is off. >>> >>> >>> >>> See above. My understanding is that HCR.DC controls the default output of >>> Stage-1, and Stage-2 overrides still apply. >> >> >> You had mentioned that PAGE_S2_DEVICE attribute was redundant >> and wanted guest to decide the memory attribute. In other words, you >> did not want to enforce any memory attribute in Stage2. >> >> Please refer to this patch https://patchwork.kernel.org/patch/2543201/ > > > This patch has never been merged. If you carry on following the discussion, > you will certainly notice it was dropped for a very good reason: > https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > > So Stage-2 memory attributes are used, they are not going away, and they are > essential to the patch I sent this morning. > > > M. > -- > Fast, cheap, reliable. Pick two. HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is provided a DMA-capable device in pass-through mode. The reason being bootloader/firmware typically don't enable MMU and such bootloader/firmware will programme a pass-through DMA-capable device without any flushes to guest RAM (because it has not enabled MMU). A good example of such a device would be SATA AHCI controller given to a Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware accessing this SATA AHCI controller to load kernel images from SATA disk. In this situation, we will have to hack Guest bootloader/firmware AHCI driver to explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). Regards, Anup
On 2013-08-15 16:13, Anup Patel wrote: > Hi Marc, > > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> > wrote: >> On 2013-08-15 14:31, Anup Patel wrote: >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier >>> <marc.zyngier@arm.com> >>> wrote: >>>> >>>> On 2013-08-15 07:26, Anup Patel wrote: >>>>> >>>>> >>>>> Hi Marc, >>>>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >>>>> <marc.zyngier@arm.com> >>>>> wrote: >>>>>> >>>>>> >>>>>> Hi Anup, >>>>>> >>>>>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >>>>>>> <marc.zyngier@arm.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Hi Pranav, >>>>>>>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have >>>>>>>>> dirty >>>>>>>>> content belonging to the guest page in L3-cache. To tackle >>>>>>>>> this, >>>>>>>>> we need to flush such dirty content from d-cache so that >>>>>>>>> guest >>>>>>>>> will see correct contents of guest page when guest MMU is >>>>>>>>> disabled. >>>>>>>>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external >>>>>>>>> L3-cache. >>>>>>>>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >>>>>>>>> <pranavkumar@linaro.org> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> >>>>>>>>> --- >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>>>>>> 1 file changed, 2 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>> index efe609c..5129038 100644 >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>> @@ -123,6 +123,8 @@ static inline void >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>>>>>> + /* Flush d-cache for systems with external >>>>>>>>> caches. */ >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>>>>>> } else if (!icache_is_aivivt()) { /* non >>>>>>>>> ASID-tagged >>>>>>>>> VIVT >>>>>>>>> */ >>>>>>>>> /* any kind of VIPT cache */ >>>>>>>>> __flush_icache_all(); >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> [adding Will to the discussion as we talked about this in the >>>>>>>> past] >>>>>>>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit >>>>>>>> the >>>>>>>> data >>>>>>>> cache on each page mapping. It looks overkill. >>>>>>>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>>>>>> kvmtools >>>>>>>> knows which bits of the guest memory have been touched, and >>>>>>>> can do a >>>>>>>> "DC >>>>>>>> DVAC" on this region. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. >>>>>>> I am >>>>>>> sure >>>>>>> other architectures don't have such cache cleaning in >>>>>>> user-space for >>>>>>> KVM. >>>>>>> >>>>>>>> >>>>>>>> The alternative is do it in the kernel before running any vcpu >>>>>>>> - but >>>>>>>> that's not very nice either (have to clean the whole of the >>>>>>>> guest >>>>>>>> memory, which makes a full dcache clean more appealing). >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of >>>>>>> VCPU was >>>>>>> our second option but current approach seemed very simple hence >>>>>>> we went for this. >>>>>>> >>>>>>> If more people vote for full d-cache clean upon first run of >>>>>>> VCPU then >>>>>>> we should revise this patch. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Can you please give the attached patch a spin on your HW? I've >>>>>> boot-tested >>>>>> it on a model, but of course I can't really verify that it fixes >>>>>> your >>>>>> issue. >>>>>> >>>>>> As far as I can see, it should fix it without any additional >>>>>> flushing. >>>>>> >>>>>> Please let me know how it goes. >>>>> >>>>> >>>>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >>>>> treated as "Normal Non-shareable, Inner Write-Back >>>>> Write-Allocate, >>>>> Outer Write-Back Write-Allocate memory" >>>>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >>>>> treated as "Strongly-ordered device memory" >>>>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as >>>>> VGIC CPU interface) with MMU off then MMIO devices will be >>>>> accessed as normal memory when HCR_EL2.DC=1. >>>> >>>> >>>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to >>>> be >>>> accessed as device memory. >>>> >>>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software >>>>> emulated devices for Guest/VM which is not true for KVM ARM or >>>>> KVM ARM64 because we use VGIC. >>>>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >>>>> when Stage1 MMU is off. >>>> >>>> >>>> >>>> See above. My understanding is that HCR.DC controls the default >>>> output of >>>> Stage-1, and Stage-2 overrides still apply. >>> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant >>> and wanted guest to decide the memory attribute. In other words, >>> you >>> did not want to enforce any memory attribute in Stage2. >>> >>> Please refer to this patch >>> https://patchwork.kernel.org/patch/2543201/ >> >> >> This patch has never been merged. If you carry on following the >> discussion, >> you will certainly notice it was dropped for a very good reason: >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html >> >> So Stage-2 memory attributes are used, they are not going away, and >> they are >> essential to the patch I sent this morning. >> >> >> M. >> -- >> Fast, cheap, reliable. Pick two. > > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > is > provided a DMA-capable device in pass-through mode. The reason being > bootloader/firmware typically don't enable MMU and such > bootloader/firmware > will programme a pass-through DMA-capable device without any flushes > to > guest RAM (because it has not enabled MMU). > > A good example of such a device would be SATA AHCI controller given > to a > Guest/VM as direct access (using SystemMMU) and Guest > bootloader/firmware > accessing this SATA AHCI controller to load kernel images from SATA > disk. > In this situation, we will have to hack Guest bootloader/firmware > AHCI driver to > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of curiosity: is that a made up example or something you actually have? Back to square one: Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)? Thanks, M.
Hi Pranav, On 2013-08-15 09:39, Pranavkumar Sawargaonkar wrote: [...] > Just to add few more points to this discussion : > > Actually we are already flushing d-cache in > "coherent_icache_guest_page" > by calling flush_icache_range. > > Now in my patch i am doing same thing one more time by calling > __flush_dcache_area just below flush_icache_range. > > Main difference between d-cache flushing in above two routines is : > flush_icache_range is flushing d-cache by point of unification (dc > cvau) and > __flush_dcache_area is flushing that by point of coherency (dc > civac). > > Now since second routine is doing it by coherency it is making L3C to > be coherent and sync changes with immediate effect and solving the > issue. I understand that. What I object to is that always cleaning to PoC is expensive (both in terms of latency and energy), and doing so for each page, long after the MMU has been enabled feels like a great loss of performance. My gut feeling is that a single clean operation at startup time (whether it is from the kernel or userspace) would be a lot better. Thanks, M.
On Thu, Aug 15, 2013 at 9:07 PM, Marc Zyngier <marc.zyngier@arm.com> wrote: > On 2013-08-15 16:13, Anup Patel wrote: >> >> Hi Marc, >> >> On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> >> wrote: >>> >>> On 2013-08-15 14:31, Anup Patel wrote: >>>> >>>> >>>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier <marc.zyngier@arm.com> >>>> wrote: >>>>> >>>>> >>>>> On 2013-08-15 07:26, Anup Patel wrote: >>>>>> >>>>>> >>>>>> >>>>>> Hi Marc, >>>>>> >>>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier <marc.zyngier@arm.com> >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi Anup, >>>>>>> >>>>>>> >>>>>>> On 2013-08-14 15:22, Anup Patel wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier <marc.zyngier@arm.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Hi Pranav, >>>>>>>>> >>>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Systems with large external L3-cache (few MBs), might have dirty >>>>>>>>>> content belonging to the guest page in L3-cache. To tackle this, >>>>>>>>>> we need to flush such dirty content from d-cache so that guest >>>>>>>>>> will see correct contents of guest page when guest MMU is >>>>>>>>>> disabled. >>>>>>>>>> >>>>>>>>>> The patch fixes coherent_icache_guest_page() for external >>>>>>>>>> L3-cache. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar <pranavkumar@linaro.org> >>>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> >>>>>>>>>> --- >>>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>>>>>>>>> 1 file changed, 2 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>>> index efe609c..5129038 100644 >>>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>>>>>>>>> @@ -123,6 +123,8 @@ static inline void >>>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>>>>>>>>> + /* Flush d-cache for systems with external caches. >>>>>>>>>> */ >>>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>>>>>>>>> } else if (!icache_is_aivivt()) { /* non ASID-tagged >>>>>>>>>> VIVT >>>>>>>>>> */ >>>>>>>>>> /* any kind of VIPT cache */ >>>>>>>>>> __flush_icache_all(); >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> [adding Will to the discussion as we talked about this in the past] >>>>>>>>> >>>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit the >>>>>>>>> data >>>>>>>>> cache on each page mapping. It looks overkill. >>>>>>>>> >>>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >>>>>>>>> kvmtools >>>>>>>>> knows which bits of the guest memory have been touched, and can do >>>>>>>>> a >>>>>>>>> "DC >>>>>>>>> DVAC" on this region. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> It seems a bit unnatural to have cache cleaning is user-space. I am >>>>>>>> sure >>>>>>>> other architectures don't have such cache cleaning in user-space for >>>>>>>> KVM. >>>>>>>> >>>>>>>>> >>>>>>>>> The alternative is do it in the kernel before running any vcpu - >>>>>>>>> but >>>>>>>>> that's not very nice either (have to clean the whole of the guest >>>>>>>>> memory, which makes a full dcache clean more appealing). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Actually, cleaning full d-cache by set/way upon first run of VCPU >>>>>>>> was >>>>>>>> our second option but current approach seemed very simple hence >>>>>>>> we went for this. >>>>>>>> >>>>>>>> If more people vote for full d-cache clean upon first run of VCPU >>>>>>>> then >>>>>>>> we should revise this patch. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Can you please give the attached patch a spin on your HW? I've >>>>>>> boot-tested >>>>>>> it on a model, but of course I can't really verify that it fixes your >>>>>>> issue. >>>>>>> >>>>>>> As far as I can see, it should fix it without any additional >>>>>>> flushing. >>>>>>> >>>>>>> Please let me know how it goes. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >>>>>> treated as "Normal Non-shareable, Inner Write-Back Write-Allocate, >>>>>> Outer Write-Back Write-Allocate memory" >>>>>> >>>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >>>>>> treated as "Strongly-ordered device memory" >>>>>> >>>>>> Now if Guest/VM access hardware MMIO devices directly (such as >>>>>> VGIC CPU interface) with MMU off then MMIO devices will be >>>>>> accessed as normal memory when HCR_EL2.DC=1. >>>>> >>>>> >>>>> >>>>> >>>>> I don't think so. Stage-2 still applies, and should force MMIO to be >>>>> accessed as device memory. >>>>> >>>>> >>>>>> The HCR_EL2.DC=1 makes sense only if we have all software >>>>>> emulated devices for Guest/VM which is not true for KVM ARM or >>>>>> KVM ARM64 because we use VGIC. >>>>>> >>>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >>>>>> when Stage1 MMU is off. >>>>> >>>>> >>>>> >>>>> >>>>> See above. My understanding is that HCR.DC controls the default output >>>>> of >>>>> Stage-1, and Stage-2 overrides still apply. >>>> >>>> >>>> >>>> You had mentioned that PAGE_S2_DEVICE attribute was redundant >>>> and wanted guest to decide the memory attribute. In other words, you >>>> did not want to enforce any memory attribute in Stage2. >>>> >>>> Please refer to this patch https://patchwork.kernel.org/patch/2543201/ >>> >>> >>> >>> This patch has never been merged. If you carry on following the >>> discussion, >>> you will certainly notice it was dropped for a very good reason: >>> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html >>> >>> So Stage-2 memory attributes are used, they are not going away, and they >>> are >>> essential to the patch I sent this morning. >>> >>> >>> M. >>> -- >>> Fast, cheap, reliable. Pick two. >> >> >> HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM is >> provided a DMA-capable device in pass-through mode. The reason being >> bootloader/firmware typically don't enable MMU and such >> bootloader/firmware >> will programme a pass-through DMA-capable device without any flushes to >> guest RAM (because it has not enabled MMU). >> >> A good example of such a device would be SATA AHCI controller given to a >> Guest/VM as direct access (using SystemMMU) and Guest bootloader/firmware >> accessing this SATA AHCI controller to load kernel images from SATA disk. >> In this situation, we will have to hack Guest bootloader/firmware >> AHCI driver to >> explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). > > > OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of > curiosity: is that a made up example or something you actually have? Actually, this would be a typical use-case in embedded virtualization. Other devices that fit this use-case would be network controllers, security engines, or some proprietary HW not having drivers in Linux. > > Back to square one: > Can you please benchmark the various cache cleaning options (global at > startup time, per-page on S2 translation fault, and user-space)? > > Thanks, > > > M. > -- > Fast, cheap, reliable. Pick two. --Anup
On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: > On 2013-08-15 16:13, Anup Patel wrote: > > Hi Marc, > > > > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> > > wrote: > >> On 2013-08-15 14:31, Anup Patel wrote: > >>> > >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > >>> <marc.zyngier@arm.com> > >>> wrote: > >>>> > >>>> On 2013-08-15 07:26, Anup Patel wrote: > >>>>> > >>>>> > >>>>> Hi Marc, > >>>>> > >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier > >>>>> <marc.zyngier@arm.com> > >>>>> wrote: > >>>>>> > >>>>>> > >>>>>> Hi Anup, > >>>>>> > >>>>>> > >>>>>> On 2013-08-14 15:22, Anup Patel wrote: > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier > >>>>>>> <marc.zyngier@arm.com> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> Hi Pranav, > >>>>>>>> > >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Systems with large external L3-cache (few MBs), might have > >>>>>>>>> dirty > >>>>>>>>> content belonging to the guest page in L3-cache. To tackle > >>>>>>>>> this, > >>>>>>>>> we need to flush such dirty content from d-cache so that > >>>>>>>>> guest > >>>>>>>>> will see correct contents of guest page when guest MMU is > >>>>>>>>> disabled. > >>>>>>>>> > >>>>>>>>> The patch fixes coherent_icache_guest_page() for external > >>>>>>>>> L3-cache. > >>>>>>>>> > >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar > >>>>>>>>> <pranavkumar@linaro.org> > >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> > >>>>>>>>> --- > >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ > >>>>>>>>> 1 file changed, 2 insertions(+) > >>>>>>>>> > >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h > >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h > >>>>>>>>> index efe609c..5129038 100644 > >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h > >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h > >>>>>>>>> @@ -123,6 +123,8 @@ static inline void > >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) > >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ > >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); > >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); > >>>>>>>>> + /* Flush d-cache for systems with external > >>>>>>>>> caches. */ > >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); > >>>>>>>>> } else if (!icache_is_aivivt()) { /* non > >>>>>>>>> ASID-tagged > >>>>>>>>> VIVT > >>>>>>>>> */ > >>>>>>>>> /* any kind of VIPT cache */ > >>>>>>>>> __flush_icache_all(); > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> [adding Will to the discussion as we talked about this in the > >>>>>>>> past] > >>>>>>>> > >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit > >>>>>>>> the > >>>>>>>> data > >>>>>>>> cache on each page mapping. It looks overkill. > >>>>>>>> > >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? > >>>>>>>> kvmtools > >>>>>>>> knows which bits of the guest memory have been touched, and > >>>>>>>> can do a > >>>>>>>> "DC > >>>>>>>> DVAC" on this region. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. > >>>>>>> I am > >>>>>>> sure > >>>>>>> other architectures don't have such cache cleaning in > >>>>>>> user-space for > >>>>>>> KVM. > >>>>>>> > >>>>>>>> > >>>>>>>> The alternative is do it in the kernel before running any vcpu > >>>>>>>> - but > >>>>>>>> that's not very nice either (have to clean the whole of the > >>>>>>>> guest > >>>>>>>> memory, which makes a full dcache clean more appealing). > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> Actually, cleaning full d-cache by set/way upon first run of > >>>>>>> VCPU was > >>>>>>> our second option but current approach seemed very simple hence > >>>>>>> we went for this. > >>>>>>> > >>>>>>> If more people vote for full d-cache clean upon first run of > >>>>>>> VCPU then > >>>>>>> we should revise this patch. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> Can you please give the attached patch a spin on your HW? I've > >>>>>> boot-tested > >>>>>> it on a model, but of course I can't really verify that it fixes > >>>>>> your > >>>>>> issue. > >>>>>> > >>>>>> As far as I can see, it should fix it without any additional > >>>>>> flushing. > >>>>>> > >>>>>> Please let me know how it goes. > >>>>> > >>>>> > >>>>> > >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > >>>>> treated as "Normal Non-shareable, Inner Write-Back > >>>>> Write-Allocate, > >>>>> Outer Write-Back Write-Allocate memory" > >>>>> > >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > >>>>> treated as "Strongly-ordered device memory" > >>>>> > >>>>> Now if Guest/VM access hardware MMIO devices directly (such as > >>>>> VGIC CPU interface) with MMU off then MMIO devices will be > >>>>> accessed as normal memory when HCR_EL2.DC=1. > >>>> > >>>> > >>>> > >>>> I don't think so. Stage-2 still applies, and should force MMIO to > >>>> be > >>>> accessed as device memory. > >>>> > >>>> > >>>>> The HCR_EL2.DC=1 makes sense only if we have all software > >>>>> emulated devices for Guest/VM which is not true for KVM ARM or > >>>>> KVM ARM64 because we use VGIC. > >>>>> > >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM > >>>>> when Stage1 MMU is off. > >>>> > >>>> > >>>> > >>>> See above. My understanding is that HCR.DC controls the default > >>>> output of > >>>> Stage-1, and Stage-2 overrides still apply. > >>> > >>> > >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant > >>> and wanted guest to decide the memory attribute. In other words, > >>> you > >>> did not want to enforce any memory attribute in Stage2. > >>> > >>> Please refer to this patch > >>> https://patchwork.kernel.org/patch/2543201/ > >> > >> > >> This patch has never been merged. If you carry on following the > >> discussion, > >> you will certainly notice it was dropped for a very good reason: > >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > >> > >> So Stage-2 memory attributes are used, they are not going away, and > >> they are > >> essential to the patch I sent this morning. > >> > >> > >> M. > >> -- > >> Fast, cheap, reliable. Pick two. > > > > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > > is > > provided a DMA-capable device in pass-through mode. The reason being > > bootloader/firmware typically don't enable MMU and such > > bootloader/firmware > > will programme a pass-through DMA-capable device without any flushes > > to > > guest RAM (because it has not enabled MMU). > > > > A good example of such a device would be SATA AHCI controller given > > to a > > Guest/VM as direct access (using SystemMMU) and Guest > > bootloader/firmware > > accessing this SATA AHCI controller to load kernel images from SATA > > disk. > > In this situation, we will have to hack Guest bootloader/firmware > > AHCI driver to > > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). > > OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of > curiosity: is that a made up example or something you actually have? > > Back to square one: > Can you please benchmark the various cache cleaning options (global at > startup time, per-page on S2 translation fault, and user-space)? > Eh, why is this a more valid argument than the vgic? The device passthrough Stage-2 mappings would still have the Stage-2 memory attributes to configure that memory region as device memory. Why is it relevant if the device is DMA-capable in this context? -Christoffer
On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall <christoffer.dall@linaro.org> wrote: > On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: >> On 2013-08-15 16:13, Anup Patel wrote: >> > Hi Marc, >> > >> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> >> > wrote: >> >> On 2013-08-15 14:31, Anup Patel wrote: >> >>> >> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier >> >>> <marc.zyngier@arm.com> >> >>> wrote: >> >>>> >> >>>> On 2013-08-15 07:26, Anup Patel wrote: >> >>>>> >> >>>>> >> >>>>> Hi Marc, >> >>>>> >> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >> >>>>> <marc.zyngier@arm.com> >> >>>>> wrote: >> >>>>>> >> >>>>>> >> >>>>>> Hi Anup, >> >>>>>> >> >>>>>> >> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >> >>>>>>> <marc.zyngier@arm.com> >> >>>>>>> wrote: >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> Hi Pranav, >> >>>>>>>> >> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> Systems with large external L3-cache (few MBs), might have >> >>>>>>>>> dirty >> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle >> >>>>>>>>> this, >> >>>>>>>>> we need to flush such dirty content from d-cache so that >> >>>>>>>>> guest >> >>>>>>>>> will see correct contents of guest page when guest MMU is >> >>>>>>>>> disabled. >> >>>>>>>>> >> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external >> >>>>>>>>> L3-cache. >> >>>>>>>>> >> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >> >>>>>>>>> <pranavkumar@linaro.org> >> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> >> >>>>>>>>> --- >> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >> >>>>>>>>> 1 file changed, 2 insertions(+) >> >>>>>>>>> >> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >> >>>>>>>>> index efe609c..5129038 100644 >> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void >> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >> >>>>>>>>> + /* Flush d-cache for systems with external >> >>>>>>>>> caches. */ >> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non >> >>>>>>>>> ASID-tagged >> >>>>>>>>> VIVT >> >>>>>>>>> */ >> >>>>>>>>> /* any kind of VIPT cache */ >> >>>>>>>>> __flush_icache_all(); >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> [adding Will to the discussion as we talked about this in the >> >>>>>>>> past] >> >>>>>>>> >> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit >> >>>>>>>> the >> >>>>>>>> data >> >>>>>>>> cache on each page mapping. It looks overkill. >> >>>>>>>> >> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >> >>>>>>>> kvmtools >> >>>>>>>> knows which bits of the guest memory have been touched, and >> >>>>>>>> can do a >> >>>>>>>> "DC >> >>>>>>>> DVAC" on this region. >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. >> >>>>>>> I am >> >>>>>>> sure >> >>>>>>> other architectures don't have such cache cleaning in >> >>>>>>> user-space for >> >>>>>>> KVM. >> >>>>>>> >> >>>>>>>> >> >>>>>>>> The alternative is do it in the kernel before running any vcpu >> >>>>>>>> - but >> >>>>>>>> that's not very nice either (have to clean the whole of the >> >>>>>>>> guest >> >>>>>>>> memory, which makes a full dcache clean more appealing). >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of >> >>>>>>> VCPU was >> >>>>>>> our second option but current approach seemed very simple hence >> >>>>>>> we went for this. >> >>>>>>> >> >>>>>>> If more people vote for full d-cache clean upon first run of >> >>>>>>> VCPU then >> >>>>>>> we should revise this patch. >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> Can you please give the attached patch a spin on your HW? I've >> >>>>>> boot-tested >> >>>>>> it on a model, but of course I can't really verify that it fixes >> >>>>>> your >> >>>>>> issue. >> >>>>>> >> >>>>>> As far as I can see, it should fix it without any additional >> >>>>>> flushing. >> >>>>>> >> >>>>>> Please let me know how it goes. >> >>>>> >> >>>>> >> >>>>> >> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >> >>>>> treated as "Normal Non-shareable, Inner Write-Back >> >>>>> Write-Allocate, >> >>>>> Outer Write-Back Write-Allocate memory" >> >>>>> >> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >> >>>>> treated as "Strongly-ordered device memory" >> >>>>> >> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as >> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be >> >>>>> accessed as normal memory when HCR_EL2.DC=1. >> >>>> >> >>>> >> >>>> >> >>>> I don't think so. Stage-2 still applies, and should force MMIO to >> >>>> be >> >>>> accessed as device memory. >> >>>> >> >>>> >> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software >> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or >> >>>>> KVM ARM64 because we use VGIC. >> >>>>> >> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >> >>>>> when Stage1 MMU is off. >> >>>> >> >>>> >> >>>> >> >>>> See above. My understanding is that HCR.DC controls the default >> >>>> output of >> >>>> Stage-1, and Stage-2 overrides still apply. >> >>> >> >>> >> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant >> >>> and wanted guest to decide the memory attribute. In other words, >> >>> you >> >>> did not want to enforce any memory attribute in Stage2. >> >>> >> >>> Please refer to this patch >> >>> https://patchwork.kernel.org/patch/2543201/ >> >> >> >> >> >> This patch has never been merged. If you carry on following the >> >> discussion, >> >> you will certainly notice it was dropped for a very good reason: >> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html >> >> >> >> So Stage-2 memory attributes are used, they are not going away, and >> >> they are >> >> essential to the patch I sent this morning. >> >> >> >> >> >> M. >> >> -- >> >> Fast, cheap, reliable. Pick two. >> > >> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM >> > is >> > provided a DMA-capable device in pass-through mode. The reason being >> > bootloader/firmware typically don't enable MMU and such >> > bootloader/firmware >> > will programme a pass-through DMA-capable device without any flushes >> > to >> > guest RAM (because it has not enabled MMU). >> > >> > A good example of such a device would be SATA AHCI controller given >> > to a >> > Guest/VM as direct access (using SystemMMU) and Guest >> > bootloader/firmware >> > accessing this SATA AHCI controller to load kernel images from SATA >> > disk. >> > In this situation, we will have to hack Guest bootloader/firmware >> > AHCI driver to >> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). >> >> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of >> curiosity: is that a made up example or something you actually have? >> >> Back to square one: >> Can you please benchmark the various cache cleaning options (global at >> startup time, per-page on S2 translation fault, and user-space)? >> > Eh, why is this a more valid argument than the vgic? The device > passthrough Stage-2 mappings would still have the Stage-2 memory > attributes to configure that memory region as device memory. Why is it > relevant if the device is DMA-capable in this context? > > -Christoffer Most ARM bootloader/firmware run with MMU off hence, they will not do explicit cache flushes when programming DMA-capable device. Now If HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware will not be visible to DMA-capable device. --Anup
On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote: > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall > <christoffer.dall@linaro.org> wrote: >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: >>> On 2013-08-15 16:13, Anup Patel wrote: >>> > Hi Marc, >>> > >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> >>> > wrote: >>> >> On 2013-08-15 14:31, Anup Patel wrote: >>> >>> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier >>> >>> <marc.zyngier@arm.com> >>> >>> wrote: >>> >>>> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote: >>> >>>>> >>> >>>>> >>> >>>>> Hi Marc, >>> >>>>> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >>> >>>>> <marc.zyngier@arm.com> >>> >>>>> wrote: >>> >>>>>> >>> >>>>>> >>> >>>>>> Hi Anup, >>> >>>>>> >>> >>>>>> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >>> >>>>>>> <marc.zyngier@arm.com> >>> >>>>>>> wrote: >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> Hi Pranav, >>> >>>>>>>> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have >>> >>>>>>>>> dirty >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle >>> >>>>>>>>> this, >>> >>>>>>>>> we need to flush such dirty content from d-cache so that >>> >>>>>>>>> guest >>> >>>>>>>>> will see correct contents of guest page when guest MMU is >>> >>>>>>>>> disabled. >>> >>>>>>>>> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external >>> >>>>>>>>> L3-cache. >>> >>>>>>>>> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >>> >>>>>>>>> <pranavkumar@linaro.org> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> >>> >>>>>>>>> --- >>> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >>> >>>>>>>>> 1 file changed, 2 insertions(+) >>> >>>>>>>>> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >>> >>>>>>>>> index efe609c..5129038 100644 >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >>> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >>> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >>> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >>> >>>>>>>>> + /* Flush d-cache for systems with external >>> >>>>>>>>> caches. */ >>> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >>> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non >>> >>>>>>>>> ASID-tagged >>> >>>>>>>>> VIVT >>> >>>>>>>>> */ >>> >>>>>>>>> /* any kind of VIPT cache */ >>> >>>>>>>>> __flush_icache_all(); >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the >>> >>>>>>>> past] >>> >>>>>>>> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit >>> >>>>>>>> the >>> >>>>>>>> data >>> >>>>>>>> cache on each page mapping. It looks overkill. >>> >>>>>>>> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >>> >>>>>>>> kvmtools >>> >>>>>>>> knows which bits of the guest memory have been touched, and >>> >>>>>>>> can do a >>> >>>>>>>> "DC >>> >>>>>>>> DVAC" on this region. >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. >>> >>>>>>> I am >>> >>>>>>> sure >>> >>>>>>> other architectures don't have such cache cleaning in >>> >>>>>>> user-space for >>> >>>>>>> KVM. >>> >>>>>>> >>> >>>>>>>> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu >>> >>>>>>>> - but >>> >>>>>>>> that's not very nice either (have to clean the whole of the >>> >>>>>>>> guest >>> >>>>>>>> memory, which makes a full dcache clean more appealing). >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of >>> >>>>>>> VCPU was >>> >>>>>>> our second option but current approach seemed very simple hence >>> >>>>>>> we went for this. >>> >>>>>>> >>> >>>>>>> If more people vote for full d-cache clean upon first run of >>> >>>>>>> VCPU then >>> >>>>>>> we should revise this patch. >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've >>> >>>>>> boot-tested >>> >>>>>> it on a model, but of course I can't really verify that it fixes >>> >>>>>> your >>> >>>>>> issue. >>> >>>>>> >>> >>>>>> As far as I can see, it should fix it without any additional >>> >>>>>> flushing. >>> >>>>>> >>> >>>>>> Please let me know how it goes. >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back >>> >>>>> Write-Allocate, >>> >>>>> Outer Write-Back Write-Allocate memory" >>> >>>>> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >>> >>>>> treated as "Strongly-ordered device memory" >>> >>>>> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be >>> >>>>> accessed as normal memory when HCR_EL2.DC=1. >>> >>>> >>> >>>> >>> >>>> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to >>> >>>> be >>> >>>> accessed as device memory. >>> >>>> >>> >>>> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or >>> >>>>> KVM ARM64 because we use VGIC. >>> >>>>> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >>> >>>>> when Stage1 MMU is off. >>> >>>> >>> >>>> >>> >>>> >>> >>>> See above. My understanding is that HCR.DC controls the default >>> >>>> output of >>> >>>> Stage-1, and Stage-2 overrides still apply. >>> >>> >>> >>> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant >>> >>> and wanted guest to decide the memory attribute. In other words, >>> >>> you >>> >>> did not want to enforce any memory attribute in Stage2. >>> >>> >>> >>> Please refer to this patch >>> >>> https://patchwork.kernel.org/patch/2543201/ >>> >> >>> >> >>> >> This patch has never been merged. If you carry on following the >>> >> discussion, >>> >> you will certainly notice it was dropped for a very good reason: >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html >>> >> >>> >> So Stage-2 memory attributes are used, they are not going away, and >>> >> they are >>> >> essential to the patch I sent this morning. >>> >> >>> >> >>> >> M. >>> >> -- >>> >> Fast, cheap, reliable. Pick two. >>> > >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM >>> > is >>> > provided a DMA-capable device in pass-through mode. The reason being >>> > bootloader/firmware typically don't enable MMU and such >>> > bootloader/firmware >>> > will programme a pass-through DMA-capable device without any flushes >>> > to >>> > guest RAM (because it has not enabled MMU). >>> > >>> > A good example of such a device would be SATA AHCI controller given >>> > to a >>> > Guest/VM as direct access (using SystemMMU) and Guest >>> > bootloader/firmware >>> > accessing this SATA AHCI controller to load kernel images from SATA >>> > disk. >>> > In this situation, we will have to hack Guest bootloader/firmware >>> > AHCI driver to >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). >>> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of >>> curiosity: is that a made up example or something you actually have? >>> >>> Back to square one: >>> Can you please benchmark the various cache cleaning options (global at >>> startup time, per-page on S2 translation fault, and user-space)? >>> >> Eh, why is this a more valid argument than the vgic? The device >> passthrough Stage-2 mappings would still have the Stage-2 memory >> attributes to configure that memory region as device memory. Why is it >> relevant if the device is DMA-capable in this context? >> >> -Christoffer > > Most ARM bootloader/firmware run with MMU off hence, they will not do > explicit cache flushes when programming DMA-capable device. Now If > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware > will not be visible to DMA-capable device. > > --Anup The approach of flushing d-cache by set/way upon first run of VCPU will not work because for set/way operations ARM ARM says: "For set/way operations, and for All (entire cache) operations, the point is defined to be to the next level of caching". In other words, set/way operations work upto point of unification. Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() at bootup time which does not work for us when L3-cache is enabled so, there is no point is adding one more __flush_dcache_all() upon first run of VCPU in KVM ARM64. IMHO, we are left with following options: 1. Flush all RAM regions of VCPU using __flush_dcache_range() upon first run of VCPU 2. Implement outer-cache framework for ARM64 and flush all caches + outer cache (i.e. L3-cache) upon first run of VCPU 3. Use an alternate version of flush_icache_range() which will flush d-cache by PoC instead of PoU. We can also ensure that coherent_icache_guest_page() function will be called upon Stage2 prefetch aborts only. Regards, Anup
On Fri, Aug 16, 2013 at 10:32:30AM +0530, Anup Patel wrote: > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall > <christoffer.dall@linaro.org> wrote: > > On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: > >> On 2013-08-15 16:13, Anup Patel wrote: > >> > Hi Marc, > >> > > >> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> > >> > wrote: > >> >> On 2013-08-15 14:31, Anup Patel wrote: > >> >>> > >> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > >> >>> <marc.zyngier@arm.com> > >> >>> wrote: > >> >>>> > >> >>>> On 2013-08-15 07:26, Anup Patel wrote: > >> >>>>> > >> >>>>> > >> >>>>> Hi Marc, > >> >>>>> > >> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier > >> >>>>> <marc.zyngier@arm.com> > >> >>>>> wrote: > >> >>>>>> > >> >>>>>> > >> >>>>>> Hi Anup, > >> >>>>>> > >> >>>>>> > >> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier > >> >>>>>>> <marc.zyngier@arm.com> > >> >>>>>>> wrote: > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> Hi Pranav, > >> >>>>>>>> > >> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> > >> >>>>>>>>> Systems with large external L3-cache (few MBs), might have > >> >>>>>>>>> dirty > >> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle > >> >>>>>>>>> this, > >> >>>>>>>>> we need to flush such dirty content from d-cache so that > >> >>>>>>>>> guest > >> >>>>>>>>> will see correct contents of guest page when guest MMU is > >> >>>>>>>>> disabled. > >> >>>>>>>>> > >> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external > >> >>>>>>>>> L3-cache. > >> >>>>>>>>> > >> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar > >> >>>>>>>>> <pranavkumar@linaro.org> > >> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> > >> >>>>>>>>> --- > >> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ > >> >>>>>>>>> 1 file changed, 2 insertions(+) > >> >>>>>>>>> > >> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h > >> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h > >> >>>>>>>>> index efe609c..5129038 100644 > >> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h > >> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h > >> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void > >> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) > >> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ > >> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); > >> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); > >> >>>>>>>>> + /* Flush d-cache for systems with external > >> >>>>>>>>> caches. */ > >> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); > >> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non > >> >>>>>>>>> ASID-tagged > >> >>>>>>>>> VIVT > >> >>>>>>>>> */ > >> >>>>>>>>> /* any kind of VIPT cache */ > >> >>>>>>>>> __flush_icache_all(); > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> > >> >>>>>>>> [adding Will to the discussion as we talked about this in the > >> >>>>>>>> past] > >> >>>>>>>> > >> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit > >> >>>>>>>> the > >> >>>>>>>> data > >> >>>>>>>> cache on each page mapping. It looks overkill. > >> >>>>>>>> > >> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? > >> >>>>>>>> kvmtools > >> >>>>>>>> knows which bits of the guest memory have been touched, and > >> >>>>>>>> can do a > >> >>>>>>>> "DC > >> >>>>>>>> DVAC" on this region. > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. > >> >>>>>>> I am > >> >>>>>>> sure > >> >>>>>>> other architectures don't have such cache cleaning in > >> >>>>>>> user-space for > >> >>>>>>> KVM. > >> >>>>>>> > >> >>>>>>>> > >> >>>>>>>> The alternative is do it in the kernel before running any vcpu > >> >>>>>>>> - but > >> >>>>>>>> that's not very nice either (have to clean the whole of the > >> >>>>>>>> guest > >> >>>>>>>> memory, which makes a full dcache clean more appealing). > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> > >> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of > >> >>>>>>> VCPU was > >> >>>>>>> our second option but current approach seemed very simple hence > >> >>>>>>> we went for this. > >> >>>>>>> > >> >>>>>>> If more people vote for full d-cache clean upon first run of > >> >>>>>>> VCPU then > >> >>>>>>> we should revise this patch. > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> > >> >>>>>> Can you please give the attached patch a spin on your HW? I've > >> >>>>>> boot-tested > >> >>>>>> it on a model, but of course I can't really verify that it fixes > >> >>>>>> your > >> >>>>>> issue. > >> >>>>>> > >> >>>>>> As far as I can see, it should fix it without any additional > >> >>>>>> flushing. > >> >>>>>> > >> >>>>>> Please let me know how it goes. > >> >>>>> > >> >>>>> > >> >>>>> > >> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > >> >>>>> treated as "Normal Non-shareable, Inner Write-Back > >> >>>>> Write-Allocate, > >> >>>>> Outer Write-Back Write-Allocate memory" > >> >>>>> > >> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > >> >>>>> treated as "Strongly-ordered device memory" > >> >>>>> > >> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as > >> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be > >> >>>>> accessed as normal memory when HCR_EL2.DC=1. > >> >>>> > >> >>>> > >> >>>> > >> >>>> I don't think so. Stage-2 still applies, and should force MMIO to > >> >>>> be > >> >>>> accessed as device memory. > >> >>>> > >> >>>> > >> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software > >> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or > >> >>>>> KVM ARM64 because we use VGIC. > >> >>>>> > >> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM > >> >>>>> when Stage1 MMU is off. > >> >>>> > >> >>>> > >> >>>> > >> >>>> See above. My understanding is that HCR.DC controls the default > >> >>>> output of > >> >>>> Stage-1, and Stage-2 overrides still apply. > >> >>> > >> >>> > >> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant > >> >>> and wanted guest to decide the memory attribute. In other words, > >> >>> you > >> >>> did not want to enforce any memory attribute in Stage2. > >> >>> > >> >>> Please refer to this patch > >> >>> https://patchwork.kernel.org/patch/2543201/ > >> >> > >> >> > >> >> This patch has never been merged. If you carry on following the > >> >> discussion, > >> >> you will certainly notice it was dropped for a very good reason: > >> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > >> >> > >> >> So Stage-2 memory attributes are used, they are not going away, and > >> >> they are > >> >> essential to the patch I sent this morning. > >> >> > >> >> > >> >> M. > >> >> -- > >> >> Fast, cheap, reliable. Pick two. > >> > > >> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > >> > is > >> > provided a DMA-capable device in pass-through mode. The reason being > >> > bootloader/firmware typically don't enable MMU and such > >> > bootloader/firmware > >> > will programme a pass-through DMA-capable device without any flushes > >> > to > >> > guest RAM (because it has not enabled MMU). > >> > > >> > A good example of such a device would be SATA AHCI controller given > >> > to a > >> > Guest/VM as direct access (using SystemMMU) and Guest > >> > bootloader/firmware > >> > accessing this SATA AHCI controller to load kernel images from SATA > >> > disk. > >> > In this situation, we will have to hack Guest bootloader/firmware > >> > AHCI driver to > >> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). > >> > >> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of > >> curiosity: is that a made up example or something you actually have? > >> > >> Back to square one: > >> Can you please benchmark the various cache cleaning options (global at > >> startup time, per-page on S2 translation fault, and user-space)? > >> > > Eh, why is this a more valid argument than the vgic? The device > > passthrough Stage-2 mappings would still have the Stage-2 memory > > attributes to configure that memory region as device memory. Why is it > > relevant if the device is DMA-capable in this context? > > > > Most ARM bootloader/firmware run with MMU off hence, they will not do > explicit cache flushes when programming DMA-capable device. Now If > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware > will not be visible to DMA-capable device. > Read my question again: The bootloaders running with the MMU off in a VM will only disable the MMU for Stage-1 translations. Stage-2 translations are still enforced using hte Stage-2 page tables and their attributes for all mappings to devices will still enforce strongly-ordered device type memory. Again, what does it matter if the device is DMA-capable or not? -Christoffer
On 16 August 2013 22:44, Christoffer Dall <christoffer.dall@linaro.org>wrote: > On Fri, Aug 16, 2013 at 10:32:30AM +0530, Anup Patel wrote: > > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall > > <christoffer.dall@linaro.org> wrote: > > > On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: > > >> On 2013-08-15 16:13, Anup Patel wrote: > > >> > Hi Marc, > > >> > > > >> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com > > > > >> > wrote: > > >> >> On 2013-08-15 14:31, Anup Patel wrote: > > >> >>> > > >> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > > >> >>> <marc.zyngier@arm.com> > > >> >>> wrote: > > >> >>>> > > >> >>>> On 2013-08-15 07:26, Anup Patel wrote: > > >> >>>>> > > >> >>>>> > > >> >>>>> Hi Marc, > > >> >>>>> > > >> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier > > >> >>>>> <marc.zyngier@arm.com> > > >> >>>>> wrote: > > >> >>>>>> > > >> >>>>>> > > >> >>>>>> Hi Anup, > > >> >>>>>> > > >> >>>>>> > > >> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier > > >> >>>>>>> <marc.zyngier@arm.com> > > >> >>>>>>> wrote: > > >> >>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> Hi Pranav, > > >> >>>>>>>> > > >> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: > > >> >>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>>> > > >> >>>>>>>>> Systems with large external L3-cache (few MBs), might have > > >> >>>>>>>>> dirty > > >> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle > > >> >>>>>>>>> this, > > >> >>>>>>>>> we need to flush such dirty content from d-cache so that > > >> >>>>>>>>> guest > > >> >>>>>>>>> will see correct contents of guest page when guest MMU is > > >> >>>>>>>>> disabled. > > >> >>>>>>>>> > > >> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external > > >> >>>>>>>>> L3-cache. > > >> >>>>>>>>> > > >> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar > > >> >>>>>>>>> <pranavkumar@linaro.org> > > >> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> > > >> >>>>>>>>> --- > > >> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ > > >> >>>>>>>>> 1 file changed, 2 insertions(+) > > >> >>>>>>>>> > > >> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h > > >> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h > > >> >>>>>>>>> index efe609c..5129038 100644 > > >> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h > > >> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h > > >> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void > > >> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) > > >> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ > > >> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); > > >> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); > > >> >>>>>>>>> + /* Flush d-cache for systems with external > > >> >>>>>>>>> caches. */ > > >> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); > > >> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non > > >> >>>>>>>>> ASID-tagged > > >> >>>>>>>>> VIVT > > >> >>>>>>>>> */ > > >> >>>>>>>>> /* any kind of VIPT cache */ > > >> >>>>>>>>> __flush_icache_all(); > > >> >>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> [adding Will to the discussion as we talked about this in the > > >> >>>>>>>> past] > > >> >>>>>>>> > > >> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to > hit > > >> >>>>>>>> the > > >> >>>>>>>> data > > >> >>>>>>>> cache on each page mapping. It looks overkill. > > >> >>>>>>>> > > >> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? > > >> >>>>>>>> kvmtools > > >> >>>>>>>> knows which bits of the guest memory have been touched, and > > >> >>>>>>>> can do a > > >> >>>>>>>> "DC > > >> >>>>>>>> DVAC" on this region. > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. > > >> >>>>>>> I am > > >> >>>>>>> sure > > >> >>>>>>> other architectures don't have such cache cleaning in > > >> >>>>>>> user-space for > > >> >>>>>>> KVM. > > >> >>>>>>> > > >> >>>>>>>> > > >> >>>>>>>> The alternative is do it in the kernel before running any > vcpu > > >> >>>>>>>> - but > > >> >>>>>>>> that's not very nice either (have to clean the whole of the > > >> >>>>>>>> guest > > >> >>>>>>>> memory, which makes a full dcache clean more appealing). > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> > > >> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of > > >> >>>>>>> VCPU was > > >> >>>>>>> our second option but current approach seemed very simple > hence > > >> >>>>>>> we went for this. > > >> >>>>>>> > > >> >>>>>>> If more people vote for full d-cache clean upon first run of > > >> >>>>>>> VCPU then > > >> >>>>>>> we should revise this patch. > > >> >>>>>> > > >> >>>>>> > > >> >>>>>> > > >> >>>>>> > > >> >>>>>> Can you please give the attached patch a spin on your HW? I've > > >> >>>>>> boot-tested > > >> >>>>>> it on a model, but of course I can't really verify that it > fixes > > >> >>>>>> your > > >> >>>>>> issue. > > >> >>>>>> > > >> >>>>>> As far as I can see, it should fix it without any additional > > >> >>>>>> flushing. > > >> >>>>>> > > >> >>>>>> Please let me know how it goes. > > >> >>>>> > > >> >>>>> > > >> >>>>> > > >> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > > >> >>>>> treated as "Normal Non-shareable, Inner Write-Back > > >> >>>>> Write-Allocate, > > >> >>>>> Outer Write-Back Write-Allocate memory" > > >> >>>>> > > >> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > > >> >>>>> treated as "Strongly-ordered device memory" > > >> >>>>> > > >> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as > > >> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be > > >> >>>>> accessed as normal memory when HCR_EL2.DC=1. > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> I don't think so. Stage-2 still applies, and should force MMIO to > > >> >>>> be > > >> >>>> accessed as device memory. > > >> >>>> > > >> >>>> > > >> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software > > >> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or > > >> >>>>> KVM ARM64 because we use VGIC. > > >> >>>>> > > >> >>>>> IMHO, this patch enforces incorrect memory attribute for > Guest/VM > > >> >>>>> when Stage1 MMU is off. > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>> See above. My understanding is that HCR.DC controls the default > > >> >>>> output of > > >> >>>> Stage-1, and Stage-2 overrides still apply. > > >> >>> > > >> >>> > > >> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant > > >> >>> and wanted guest to decide the memory attribute. In other words, > > >> >>> you > > >> >>> did not want to enforce any memory attribute in Stage2. > > >> >>> > > >> >>> Please refer to this patch > > >> >>> https://patchwork.kernel.org/patch/2543201/ > > >> >> > > >> >> > > >> >> This patch has never been merged. If you carry on following the > > >> >> discussion, > > >> >> you will certainly notice it was dropped for a very good reason: > > >> >> > https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > > >> >> > > >> >> So Stage-2 memory attributes are used, they are not going away, and > > >> >> they are > > >> >> essential to the patch I sent this morning. > > >> >> > > >> >> > > >> >> M. > > >> >> -- > > >> >> Fast, cheap, reliable. Pick two. > > >> > > > >> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > > >> > is > > >> > provided a DMA-capable device in pass-through mode. The reason being > > >> > bootloader/firmware typically don't enable MMU and such > > >> > bootloader/firmware > > >> > will programme a pass-through DMA-capable device without any flushes > > >> > to > > >> > guest RAM (because it has not enabled MMU). > > >> > > > >> > A good example of such a device would be SATA AHCI controller given > > >> > to a > > >> > Guest/VM as direct access (using SystemMMU) and Guest > > >> > bootloader/firmware > > >> > accessing this SATA AHCI controller to load kernel images from SATA > > >> > disk. > > >> > In this situation, we will have to hack Guest bootloader/firmware > > >> > AHCI driver to > > >> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). > > >> > > >> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out > of > > >> curiosity: is that a made up example or something you actually have? > > >> > > >> Back to square one: > > >> Can you please benchmark the various cache cleaning options (global at > > >> startup time, per-page on S2 translation fault, and user-space)? > > >> > > > Eh, why is this a more valid argument than the vgic? The device > > > passthrough Stage-2 mappings would still have the Stage-2 memory > > > attributes to configure that memory region as device memory. Why is it > > > relevant if the device is DMA-capable in this context? > > > > > > > Most ARM bootloader/firmware run with MMU off hence, they will not do > > explicit cache flushes when programming DMA-capable device. Now If > > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware > > will not be visible to DMA-capable device. > > > Read my question again: The bootloaders running with the MMU off in a VM > will only disable the MMU for Stage-1 translations. Stage-2 > translations are still enforced using hte Stage-2 page tables and their > attributes for all mappings to devices will still enforce > strongly-ordered device type memory. > Please read my reply again. Also try to read-up SATA AHCI spec. > > Again, what does it matter if the device is DMA-capable or not? > > -Christoffer > _______________________________________________ > kvmarm mailing list > kvmarm@lists.cs.columbia.edu > https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm >
On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote: > On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote: > > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall > > <christoffer.dall@linaro.org> wrote: > >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: > >>> On 2013-08-15 16:13, Anup Patel wrote: > >>> > Hi Marc, > >>> > > >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> > >>> > wrote: > >>> >> On 2013-08-15 14:31, Anup Patel wrote: > >>> >>> > >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > >>> >>> <marc.zyngier@arm.com> > >>> >>> wrote: > >>> >>>> > >>> >>>> On 2013-08-15 07:26, Anup Patel wrote: > >>> >>>>> > >>> >>>>> > >>> >>>>> Hi Marc, > >>> >>>>> > >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier > >>> >>>>> <marc.zyngier@arm.com> > >>> >>>>> wrote: > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> Hi Anup, > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier > >>> >>>>>>> <marc.zyngier@arm.com> > >>> >>>>>>> wrote: > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> Hi Pranav, > >>> >>>>>>>> > >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: > >>> >>>>>>>>> > >>> >>>>>>>>> > >>> >>>>>>>>> > >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have > >>> >>>>>>>>> dirty > >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle > >>> >>>>>>>>> this, > >>> >>>>>>>>> we need to flush such dirty content from d-cache so that > >>> >>>>>>>>> guest > >>> >>>>>>>>> will see correct contents of guest page when guest MMU is > >>> >>>>>>>>> disabled. > >>> >>>>>>>>> > >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external > >>> >>>>>>>>> L3-cache. > >>> >>>>>>>>> > >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar > >>> >>>>>>>>> <pranavkumar@linaro.org> > >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> > >>> >>>>>>>>> --- > >>> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ > >>> >>>>>>>>> 1 file changed, 2 insertions(+) > >>> >>>>>>>>> > >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h > >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h > >>> >>>>>>>>> index efe609c..5129038 100644 > >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h > >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h > >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void > >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) > >>> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ > >>> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); > >>> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); > >>> >>>>>>>>> + /* Flush d-cache for systems with external > >>> >>>>>>>>> caches. */ > >>> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); > >>> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non > >>> >>>>>>>>> ASID-tagged > >>> >>>>>>>>> VIVT > >>> >>>>>>>>> */ > >>> >>>>>>>>> /* any kind of VIPT cache */ > >>> >>>>>>>>> __flush_icache_all(); > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> [adding Will to the discussion as we talked about this in the > >>> >>>>>>>> past] > >>> >>>>>>>> > >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit > >>> >>>>>>>> the > >>> >>>>>>>> data > >>> >>>>>>>> cache on each page mapping. It looks overkill. > >>> >>>>>>>> > >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? > >>> >>>>>>>> kvmtools > >>> >>>>>>>> knows which bits of the guest memory have been touched, and > >>> >>>>>>>> can do a > >>> >>>>>>>> "DC > >>> >>>>>>>> DVAC" on this region. > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. > >>> >>>>>>> I am > >>> >>>>>>> sure > >>> >>>>>>> other architectures don't have such cache cleaning in > >>> >>>>>>> user-space for > >>> >>>>>>> KVM. > >>> >>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu > >>> >>>>>>>> - but > >>> >>>>>>>> that's not very nice either (have to clean the whole of the > >>> >>>>>>>> guest > >>> >>>>>>>> memory, which makes a full dcache clean more appealing). > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of > >>> >>>>>>> VCPU was > >>> >>>>>>> our second option but current approach seemed very simple hence > >>> >>>>>>> we went for this. > >>> >>>>>>> > >>> >>>>>>> If more people vote for full d-cache clean upon first run of > >>> >>>>>>> VCPU then > >>> >>>>>>> we should revise this patch. > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> Can you please give the attached patch a spin on your HW? I've > >>> >>>>>> boot-tested > >>> >>>>>> it on a model, but of course I can't really verify that it fixes > >>> >>>>>> your > >>> >>>>>> issue. > >>> >>>>>> > >>> >>>>>> As far as I can see, it should fix it without any additional > >>> >>>>>> flushing. > >>> >>>>>> > >>> >>>>>> Please let me know how it goes. > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back > >>> >>>>> Write-Allocate, > >>> >>>>> Outer Write-Back Write-Allocate memory" > >>> >>>>> > >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > >>> >>>>> treated as "Strongly-ordered device memory" > >>> >>>>> > >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as > >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be > >>> >>>>> accessed as normal memory when HCR_EL2.DC=1. > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to > >>> >>>> be > >>> >>>> accessed as device memory. > >>> >>>> > >>> >>>> > >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software > >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or > >>> >>>>> KVM ARM64 because we use VGIC. > >>> >>>>> > >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM > >>> >>>>> when Stage1 MMU is off. > >>> >>>> > >>> >>>> > >>> >>>> > >>> >>>> See above. My understanding is that HCR.DC controls the default > >>> >>>> output of > >>> >>>> Stage-1, and Stage-2 overrides still apply. > >>> >>> > >>> >>> > >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant > >>> >>> and wanted guest to decide the memory attribute. In other words, > >>> >>> you > >>> >>> did not want to enforce any memory attribute in Stage2. > >>> >>> > >>> >>> Please refer to this patch > >>> >>> https://patchwork.kernel.org/patch/2543201/ > >>> >> > >>> >> > >>> >> This patch has never been merged. If you carry on following the > >>> >> discussion, > >>> >> you will certainly notice it was dropped for a very good reason: > >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > >>> >> > >>> >> So Stage-2 memory attributes are used, they are not going away, and > >>> >> they are > >>> >> essential to the patch I sent this morning. > >>> >> > >>> >> > >>> >> M. > >>> >> -- > >>> >> Fast, cheap, reliable. Pick two. > >>> > > >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > >>> > is > >>> > provided a DMA-capable device in pass-through mode. The reason being > >>> > bootloader/firmware typically don't enable MMU and such > >>> > bootloader/firmware > >>> > will programme a pass-through DMA-capable device without any flushes > >>> > to > >>> > guest RAM (because it has not enabled MMU). > >>> > > >>> > A good example of such a device would be SATA AHCI controller given > >>> > to a > >>> > Guest/VM as direct access (using SystemMMU) and Guest > >>> > bootloader/firmware > >>> > accessing this SATA AHCI controller to load kernel images from SATA > >>> > disk. > >>> > In this situation, we will have to hack Guest bootloader/firmware > >>> > AHCI driver to > >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). > >>> > >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of > >>> curiosity: is that a made up example or something you actually have? > >>> > >>> Back to square one: > >>> Can you please benchmark the various cache cleaning options (global at > >>> startup time, per-page on S2 translation fault, and user-space)? > >>> > >> Eh, why is this a more valid argument than the vgic? The device > >> passthrough Stage-2 mappings would still have the Stage-2 memory > >> attributes to configure that memory region as device memory. Why is it > >> relevant if the device is DMA-capable in this context? > >> > >> -Christoffer > > > > Most ARM bootloader/firmware run with MMU off hence, they will not do > > explicit cache flushes when programming DMA-capable device. Now If > > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware > > will not be visible to DMA-capable device. > > > > --Anup > > The approach of flushing d-cache by set/way upon first run of VCPU will > not work because for set/way operations ARM ARM says: "For set/way > operations, and for All (entire cache) operations, the point is defined to be > to the next level of caching". In other words, set/way operations work upto > point of unification. > > Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() > at bootup time which does not work for us when L3-cache is enabled so, > there is no point is adding one more __flush_dcache_all() upon first run of > VCPU in KVM ARM64. When did anyone suggest using a cache cleaning method that doesn't apply to this case. I'm a little confused about your comment here. > > IMHO, we are left with following options: > 1. Flush all RAM regions of VCPU using __flush_dcache_range() > upon first run of VCPU We could do that, but we have to ensure that no other memory regions can be added later. Is this the case? I don't remember. > 2. Implement outer-cache framework for ARM64 and flush all > caches + outer cache (i.e. L3-cache) upon first run of VCPU What's the difference between (2) and (1)? > 3. Use an alternate version of flush_icache_range() which will > flush d-cache by PoC instead of PoU. We can also ensure > that coherent_icache_guest_page() function will be called > upon Stage2 prefetch aborts only. > I'm sorry, but I'm not following this discussion. Are we not mixing a discussion along two different axis here? As I see it there are two separate issues: (A) When/Where to flush the memory regions (B) How to flush the memory regions, including the hardware method and the kernel software engineering aspect. -Christoffer
On Fri, Aug 16, 2013 at 10:48:39PM +0530, Anup Patel wrote: > On 16 August 2013 22:44, Christoffer Dall <christoffer.dall@linaro.org>wrote: > > > On Fri, Aug 16, 2013 at 10:32:30AM +0530, Anup Patel wrote: > > > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall > > > <christoffer.dall@linaro.org> wrote: > > > > On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: > > > >> On 2013-08-15 16:13, Anup Patel wrote: > > > >> > Hi Marc, > > > >> > > > > >> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com > > > > > > >> > wrote: > > > >> >> On 2013-08-15 14:31, Anup Patel wrote: > > > >> >>> > > > >> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > > > >> >>> <marc.zyngier@arm.com> > > > >> >>> wrote: > > > >> >>>> > > > >> >>>> On 2013-08-15 07:26, Anup Patel wrote: > > > >> >>>>> > > > >> >>>>> > > > >> >>>>> Hi Marc, > > > >> >>>>> > > > >> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier > > > >> >>>>> <marc.zyngier@arm.com> > > > >> >>>>> wrote: > > > >> >>>>>> > > > >> >>>>>> > > > >> >>>>>> Hi Anup, > > > >> >>>>>> > > > >> >>>>>> > > > >> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: > > > >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier > > > >> >>>>>>> <marc.zyngier@arm.com> > > > >> >>>>>>> wrote: > > > >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> Hi Pranav, > > > >> >>>>>>>> > > > >> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: > > > >> >>>>>>>>> > > > >> >>>>>>>>> > > > >> >>>>>>>>> > > > >> >>>>>>>>> Systems with large external L3-cache (few MBs), might have > > > >> >>>>>>>>> dirty > > > >> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle > > > >> >>>>>>>>> this, > > > >> >>>>>>>>> we need to flush such dirty content from d-cache so that > > > >> >>>>>>>>> guest > > > >> >>>>>>>>> will see correct contents of guest page when guest MMU is > > > >> >>>>>>>>> disabled. > > > >> >>>>>>>>> > > > >> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external > > > >> >>>>>>>>> L3-cache. > > > >> >>>>>>>>> > > > >> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar > > > >> >>>>>>>>> <pranavkumar@linaro.org> > > > >> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> > > > >> >>>>>>>>> --- > > > >> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ > > > >> >>>>>>>>> 1 file changed, 2 insertions(+) > > > >> >>>>>>>>> > > > >> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h > > > >> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h > > > >> >>>>>>>>> index efe609c..5129038 100644 > > > >> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h > > > >> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h > > > >> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void > > > >> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) > > > >> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ > > > >> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); > > > >> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); > > > >> >>>>>>>>> + /* Flush d-cache for systems with external > > > >> >>>>>>>>> caches. */ > > > >> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); > > > >> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non > > > >> >>>>>>>>> ASID-tagged > > > >> >>>>>>>>> VIVT > > > >> >>>>>>>>> */ > > > >> >>>>>>>>> /* any kind of VIPT cache */ > > > >> >>>>>>>>> __flush_icache_all(); > > > >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> [adding Will to the discussion as we talked about this in the > > > >> >>>>>>>> past] > > > >> >>>>>>>> > > > >> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to > > hit > > > >> >>>>>>>> the > > > >> >>>>>>>> data > > > >> >>>>>>>> cache on each page mapping. It looks overkill. > > > >> >>>>>>>> > > > >> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? > > > >> >>>>>>>> kvmtools > > > >> >>>>>>>> knows which bits of the guest memory have been touched, and > > > >> >>>>>>>> can do a > > > >> >>>>>>>> "DC > > > >> >>>>>>>> DVAC" on this region. > > > >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. > > > >> >>>>>>> I am > > > >> >>>>>>> sure > > > >> >>>>>>> other architectures don't have such cache cleaning in > > > >> >>>>>>> user-space for > > > >> >>>>>>> KVM. > > > >> >>>>>>> > > > >> >>>>>>>> > > > >> >>>>>>>> The alternative is do it in the kernel before running any > > vcpu > > > >> >>>>>>>> - but > > > >> >>>>>>>> that's not very nice either (have to clean the whole of the > > > >> >>>>>>>> guest > > > >> >>>>>>>> memory, which makes a full dcache clean more appealing). > > > >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> > > > >> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of > > > >> >>>>>>> VCPU was > > > >> >>>>>>> our second option but current approach seemed very simple > > hence > > > >> >>>>>>> we went for this. > > > >> >>>>>>> > > > >> >>>>>>> If more people vote for full d-cache clean upon first run of > > > >> >>>>>>> VCPU then > > > >> >>>>>>> we should revise this patch. > > > >> >>>>>> > > > >> >>>>>> > > > >> >>>>>> > > > >> >>>>>> > > > >> >>>>>> Can you please give the attached patch a spin on your HW? I've > > > >> >>>>>> boot-tested > > > >> >>>>>> it on a model, but of course I can't really verify that it > > fixes > > > >> >>>>>> your > > > >> >>>>>> issue. > > > >> >>>>>> > > > >> >>>>>> As far as I can see, it should fix it without any additional > > > >> >>>>>> flushing. > > > >> >>>>>> > > > >> >>>>>> Please let me know how it goes. > > > >> >>>>> > > > >> >>>>> > > > >> >>>>> > > > >> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > > > >> >>>>> treated as "Normal Non-shareable, Inner Write-Back > > > >> >>>>> Write-Allocate, > > > >> >>>>> Outer Write-Back Write-Allocate memory" > > > >> >>>>> > > > >> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > > > >> >>>>> treated as "Strongly-ordered device memory" > > > >> >>>>> > > > >> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as > > > >> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be > > > >> >>>>> accessed as normal memory when HCR_EL2.DC=1. > > > >> >>>> > > > >> >>>> > > > >> >>>> > > > >> >>>> I don't think so. Stage-2 still applies, and should force MMIO to > > > >> >>>> be > > > >> >>>> accessed as device memory. > > > >> >>>> > > > >> >>>> > > > >> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software > > > >> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or > > > >> >>>>> KVM ARM64 because we use VGIC. > > > >> >>>>> > > > >> >>>>> IMHO, this patch enforces incorrect memory attribute for > > Guest/VM > > > >> >>>>> when Stage1 MMU is off. > > > >> >>>> > > > >> >>>> > > > >> >>>> > > > >> >>>> See above. My understanding is that HCR.DC controls the default > > > >> >>>> output of > > > >> >>>> Stage-1, and Stage-2 overrides still apply. > > > >> >>> > > > >> >>> > > > >> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant > > > >> >>> and wanted guest to decide the memory attribute. In other words, > > > >> >>> you > > > >> >>> did not want to enforce any memory attribute in Stage2. > > > >> >>> > > > >> >>> Please refer to this patch > > > >> >>> https://patchwork.kernel.org/patch/2543201/ > > > >> >> > > > >> >> > > > >> >> This patch has never been merged. If you carry on following the > > > >> >> discussion, > > > >> >> you will certainly notice it was dropped for a very good reason: > > > >> >> > > https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > > > >> >> > > > >> >> So Stage-2 memory attributes are used, they are not going away, and > > > >> >> they are > > > >> >> essential to the patch I sent this morning. > > > >> >> > > > >> >> > > > >> >> M. > > > >> >> -- > > > >> >> Fast, cheap, reliable. Pick two. > > > >> > > > > >> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > > > >> > is > > > >> > provided a DMA-capable device in pass-through mode. The reason being > > > >> > bootloader/firmware typically don't enable MMU and such > > > >> > bootloader/firmware > > > >> > will programme a pass-through DMA-capable device without any flushes > > > >> > to > > > >> > guest RAM (because it has not enabled MMU). > > > >> > > > > >> > A good example of such a device would be SATA AHCI controller given > > > >> > to a > > > >> > Guest/VM as direct access (using SystemMMU) and Guest > > > >> > bootloader/firmware > > > >> > accessing this SATA AHCI controller to load kernel images from SATA > > > >> > disk. > > > >> > In this situation, we will have to hack Guest bootloader/firmware > > > >> > AHCI driver to > > > >> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). > > > >> > > > >> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out > > of > > > >> curiosity: is that a made up example or something you actually have? > > > >> > > > >> Back to square one: > > > >> Can you please benchmark the various cache cleaning options (global at > > > >> startup time, per-page on S2 translation fault, and user-space)? > > > >> > > > > Eh, why is this a more valid argument than the vgic? The device > > > > passthrough Stage-2 mappings would still have the Stage-2 memory > > > > attributes to configure that memory region as device memory. Why is it > > > > relevant if the device is DMA-capable in this context? > > > > > > > > > > Most ARM bootloader/firmware run with MMU off hence, they will not do > > > explicit cache flushes when programming DMA-capable device. Now If > > > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware > > > will not be visible to DMA-capable device. > > > > > Read my question again: The bootloaders running with the MMU off in a VM > > will only disable the MMU for Stage-1 translations. Stage-2 > > translations are still enforced using hte Stage-2 page tables and their > > attributes for all mappings to devices will still enforce > > strongly-ordered device type memory. > > > > Please read my reply again. Also try to read-up SATA AHCI spec. > > Can you please explain how the specs realate to my question? The only case I can see is if the guest puts data in a DMA region that it expects the device to read. Is this the case? Just so we're clear, this is quite different from programming the device through the device memory. -Christoffer
On Fri, Aug 16, 2013 at 10:28:06AM -0700, Christoffer Dall wrote: > On Fri, Aug 16, 2013 at 10:48:39PM +0530, Anup Patel wrote: > > On 16 August 2013 22:44, Christoffer Dall <christoffer.dall@linaro.org>wrote: > > > > > On Fri, Aug 16, 2013 at 10:32:30AM +0530, Anup Patel wrote: > > > > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall > > > > <christoffer.dall@linaro.org> wrote: > > > > > On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: > > > > >> On 2013-08-15 16:13, Anup Patel wrote: > > > > >> > Hi Marc, > > > > >> > > > > > >> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com > > > > > > > > >> > wrote: > > > > >> >> On 2013-08-15 14:31, Anup Patel wrote: > > > > >> >>> > > > > >> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > > > > >> >>> <marc.zyngier@arm.com> > > > > >> >>> wrote: > > > > >> >>>> > > > > >> >>>> On 2013-08-15 07:26, Anup Patel wrote: > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> Hi Marc, > > > > >> >>>>> > > > > >> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier > > > > >> >>>>> <marc.zyngier@arm.com> > > > > >> >>>>> wrote: > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> Hi Anup, > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: > > > > >> >>>>>>> > > > > >> >>>>>>> > > > > >> >>>>>>> > > > > >> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier > > > > >> >>>>>>> <marc.zyngier@arm.com> > > > > >> >>>>>>> wrote: > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> Hi Pranav, > > > > >> >>>>>>>> > > > > >> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> Systems with large external L3-cache (few MBs), might have > > > > >> >>>>>>>>> dirty > > > > >> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle > > > > >> >>>>>>>>> this, > > > > >> >>>>>>>>> we need to flush such dirty content from d-cache so that > > > > >> >>>>>>>>> guest > > > > >> >>>>>>>>> will see correct contents of guest page when guest MMU is > > > > >> >>>>>>>>> disabled. > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external > > > > >> >>>>>>>>> L3-cache. > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar > > > > >> >>>>>>>>> <pranavkumar@linaro.org> > > > > >> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> > > > > >> >>>>>>>>> --- > > > > >> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ > > > > >> >>>>>>>>> 1 file changed, 2 insertions(+) > > > > >> >>>>>>>>> > > > > >> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h > > > > >> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h > > > > >> >>>>>>>>> index efe609c..5129038 100644 > > > > >> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h > > > > >> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h > > > > >> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void > > > > >> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) > > > > >> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ > > > > >> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); > > > > >> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); > > > > >> >>>>>>>>> + /* Flush d-cache for systems with external > > > > >> >>>>>>>>> caches. */ > > > > >> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); > > > > >> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non > > > > >> >>>>>>>>> ASID-tagged > > > > >> >>>>>>>>> VIVT > > > > >> >>>>>>>>> */ > > > > >> >>>>>>>>> /* any kind of VIPT cache */ > > > > >> >>>>>>>>> __flush_icache_all(); > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> [adding Will to the discussion as we talked about this in the > > > > >> >>>>>>>> past] > > > > >> >>>>>>>> > > > > >> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to > > > hit > > > > >> >>>>>>>> the > > > > >> >>>>>>>> data > > > > >> >>>>>>>> cache on each page mapping. It looks overkill. > > > > >> >>>>>>>> > > > > >> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? > > > > >> >>>>>>>> kvmtools > > > > >> >>>>>>>> knows which bits of the guest memory have been touched, and > > > > >> >>>>>>>> can do a > > > > >> >>>>>>>> "DC > > > > >> >>>>>>>> DVAC" on this region. > > > > >> >>>>>>> > > > > >> >>>>>>> > > > > >> >>>>>>> > > > > >> >>>>>>> > > > > >> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. > > > > >> >>>>>>> I am > > > > >> >>>>>>> sure > > > > >> >>>>>>> other architectures don't have such cache cleaning in > > > > >> >>>>>>> user-space for > > > > >> >>>>>>> KVM. > > > > >> >>>>>>> > > > > >> >>>>>>>> > > > > >> >>>>>>>> The alternative is do it in the kernel before running any > > > vcpu > > > > >> >>>>>>>> - but > > > > >> >>>>>>>> that's not very nice either (have to clean the whole of the > > > > >> >>>>>>>> guest > > > > >> >>>>>>>> memory, which makes a full dcache clean more appealing). > > > > >> >>>>>>> > > > > >> >>>>>>> > > > > >> >>>>>>> > > > > >> >>>>>>> > > > > >> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of > > > > >> >>>>>>> VCPU was > > > > >> >>>>>>> our second option but current approach seemed very simple > > > hence > > > > >> >>>>>>> we went for this. > > > > >> >>>>>>> > > > > >> >>>>>>> If more people vote for full d-cache clean upon first run of > > > > >> >>>>>>> VCPU then > > > > >> >>>>>>> we should revise this patch. > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> > > > > >> >>>>>> Can you please give the attached patch a spin on your HW? I've > > > > >> >>>>>> boot-tested > > > > >> >>>>>> it on a model, but of course I can't really verify that it > > > fixes > > > > >> >>>>>> your > > > > >> >>>>>> issue. > > > > >> >>>>>> > > > > >> >>>>>> As far as I can see, it should fix it without any additional > > > > >> >>>>>> flushing. > > > > >> >>>>>> > > > > >> >>>>>> Please let me know how it goes. > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> > > > > >> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > > > > >> >>>>> treated as "Normal Non-shareable, Inner Write-Back > > > > >> >>>>> Write-Allocate, > > > > >> >>>>> Outer Write-Back Write-Allocate memory" > > > > >> >>>>> > > > > >> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > > > > >> >>>>> treated as "Strongly-ordered device memory" > > > > >> >>>>> > > > > >> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as > > > > >> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be > > > > >> >>>>> accessed as normal memory when HCR_EL2.DC=1. > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> I don't think so. Stage-2 still applies, and should force MMIO to > > > > >> >>>> be > > > > >> >>>> accessed as device memory. > > > > >> >>>> > > > > >> >>>> > > > > >> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software > > > > >> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or > > > > >> >>>>> KVM ARM64 because we use VGIC. > > > > >> >>>>> > > > > >> >>>>> IMHO, this patch enforces incorrect memory attribute for > > > Guest/VM > > > > >> >>>>> when Stage1 MMU is off. > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> > > > > >> >>>> See above. My understanding is that HCR.DC controls the default > > > > >> >>>> output of > > > > >> >>>> Stage-1, and Stage-2 overrides still apply. > > > > >> >>> > > > > >> >>> > > > > >> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant > > > > >> >>> and wanted guest to decide the memory attribute. In other words, > > > > >> >>> you > > > > >> >>> did not want to enforce any memory attribute in Stage2. > > > > >> >>> > > > > >> >>> Please refer to this patch > > > > >> >>> https://patchwork.kernel.org/patch/2543201/ > > > > >> >> > > > > >> >> > > > > >> >> This patch has never been merged. If you carry on following the > > > > >> >> discussion, > > > > >> >> you will certainly notice it was dropped for a very good reason: > > > > >> >> > > > https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > > > > >> >> > > > > >> >> So Stage-2 memory attributes are used, they are not going away, and > > > > >> >> they are > > > > >> >> essential to the patch I sent this morning. > > > > >> >> > > > > >> >> > > > > >> >> M. > > > > >> >> -- > > > > >> >> Fast, cheap, reliable. Pick two. > > > > >> > > > > > >> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > > > > >> > is > > > > >> > provided a DMA-capable device in pass-through mode. The reason being > > > > >> > bootloader/firmware typically don't enable MMU and such > > > > >> > bootloader/firmware > > > > >> > will programme a pass-through DMA-capable device without any flushes > > > > >> > to > > > > >> > guest RAM (because it has not enabled MMU). > > > > >> > > > > > >> > A good example of such a device would be SATA AHCI controller given > > > > >> > to a > > > > >> > Guest/VM as direct access (using SystemMMU) and Guest > > > > >> > bootloader/firmware > > > > >> > accessing this SATA AHCI controller to load kernel images from SATA > > > > >> > disk. > > > > >> > In this situation, we will have to hack Guest bootloader/firmware > > > > >> > AHCI driver to > > > > >> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). > > > > >> > > > > >> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out > > > of > > > > >> curiosity: is that a made up example or something you actually have? > > > > >> > > > > >> Back to square one: > > > > >> Can you please benchmark the various cache cleaning options (global at > > > > >> startup time, per-page on S2 translation fault, and user-space)? > > > > >> > > > > > Eh, why is this a more valid argument than the vgic? The device > > > > > passthrough Stage-2 mappings would still have the Stage-2 memory > > > > > attributes to configure that memory region as device memory. Why is it > > > > > relevant if the device is DMA-capable in this context? > > > > > > > > > > > > > Most ARM bootloader/firmware run with MMU off hence, they will not do > > > > explicit cache flushes when programming DMA-capable device. Now If > > > > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware > > > > will not be visible to DMA-capable device. > > > > > > > Read my question again: The bootloaders running with the MMU off in a VM > > > will only disable the MMU for Stage-1 translations. Stage-2 > > > translations are still enforced using hte Stage-2 page tables and their > > > attributes for all mappings to devices will still enforce > > > strongly-ordered device type memory. > > > > > > > Please read my reply again. Also try to read-up SATA AHCI spec. > > > > > > Can you please explain how the specs realate to my question? > > The only case I can see is if the guest puts data in a DMA region that > it expects the device to read. Is this the case? > > Just so we're clear, this is quite different from programming the device > through the device memory. > ok, I see from the specs. It's not just DMA-capable, but actually DMA-programmable. In that case setting DC=1 would be problematic. Thanks for being so helpful in explaining this ;) -Christoffer
On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall <christoffer.dall@linaro.org> wrote: > On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote: >> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote: >> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall >> > <christoffer.dall@linaro.org> wrote: >> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: >> >>> On 2013-08-15 16:13, Anup Patel wrote: >> >>> > Hi Marc, >> >>> > >> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> >> >>> > wrote: >> >>> >> On 2013-08-15 14:31, Anup Patel wrote: >> >>> >>> >> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier >> >>> >>> <marc.zyngier@arm.com> >> >>> >>> wrote: >> >>> >>>> >> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote: >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> Hi Marc, >> >>> >>>>> >> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >> >>> >>>>> <marc.zyngier@arm.com> >> >>> >>>>> wrote: >> >>> >>>>>> >> >>> >>>>>> >> >>> >>>>>> Hi Anup, >> >>> >>>>>> >> >>> >>>>>> >> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: >> >>> >>>>>>> >> >>> >>>>>>> >> >>> >>>>>>> >> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >> >>> >>>>>>> <marc.zyngier@arm.com> >> >>> >>>>>>> wrote: >> >>> >>>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>> Hi Pranav, >> >>> >>>>>>>> >> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >> >>> >>>>>>>>> >> >>> >>>>>>>>> >> >>> >>>>>>>>> >> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have >> >>> >>>>>>>>> dirty >> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle >> >>> >>>>>>>>> this, >> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that >> >>> >>>>>>>>> guest >> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is >> >>> >>>>>>>>> disabled. >> >>> >>>>>>>>> >> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external >> >>> >>>>>>>>> L3-cache. >> >>> >>>>>>>>> >> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >> >>> >>>>>>>>> <pranavkumar@linaro.org> >> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> >> >>> >>>>>>>>> --- >> >>> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >> >>> >>>>>>>>> 1 file changed, 2 insertions(+) >> >>> >>>>>>>>> >> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >> >>> >>>>>>>>> index efe609c..5129038 100644 >> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void >> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >> >>> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >> >>> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >> >>> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >> >>> >>>>>>>>> + /* Flush d-cache for systems with external >> >>> >>>>>>>>> caches. */ >> >>> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >> >>> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non >> >>> >>>>>>>>> ASID-tagged >> >>> >>>>>>>>> VIVT >> >>> >>>>>>>>> */ >> >>> >>>>>>>>> /* any kind of VIPT cache */ >> >>> >>>>>>>>> __flush_icache_all(); >> >>> >>>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the >> >>> >>>>>>>> past] >> >>> >>>>>>>> >> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit >> >>> >>>>>>>> the >> >>> >>>>>>>> data >> >>> >>>>>>>> cache on each page mapping. It looks overkill. >> >>> >>>>>>>> >> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >> >>> >>>>>>>> kvmtools >> >>> >>>>>>>> knows which bits of the guest memory have been touched, and >> >>> >>>>>>>> can do a >> >>> >>>>>>>> "DC >> >>> >>>>>>>> DVAC" on this region. >> >>> >>>>>>> >> >>> >>>>>>> >> >>> >>>>>>> >> >>> >>>>>>> >> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. >> >>> >>>>>>> I am >> >>> >>>>>>> sure >> >>> >>>>>>> other architectures don't have such cache cleaning in >> >>> >>>>>>> user-space for >> >>> >>>>>>> KVM. >> >>> >>>>>>> >> >>> >>>>>>>> >> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu >> >>> >>>>>>>> - but >> >>> >>>>>>>> that's not very nice either (have to clean the whole of the >> >>> >>>>>>>> guest >> >>> >>>>>>>> memory, which makes a full dcache clean more appealing). >> >>> >>>>>>> >> >>> >>>>>>> >> >>> >>>>>>> >> >>> >>>>>>> >> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of >> >>> >>>>>>> VCPU was >> >>> >>>>>>> our second option but current approach seemed very simple hence >> >>> >>>>>>> we went for this. >> >>> >>>>>>> >> >>> >>>>>>> If more people vote for full d-cache clean upon first run of >> >>> >>>>>>> VCPU then >> >>> >>>>>>> we should revise this patch. >> >>> >>>>>> >> >>> >>>>>> >> >>> >>>>>> >> >>> >>>>>> >> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've >> >>> >>>>>> boot-tested >> >>> >>>>>> it on a model, but of course I can't really verify that it fixes >> >>> >>>>>> your >> >>> >>>>>> issue. >> >>> >>>>>> >> >>> >>>>>> As far as I can see, it should fix it without any additional >> >>> >>>>>> flushing. >> >>> >>>>>> >> >>> >>>>>> Please let me know how it goes. >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> >> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back >> >>> >>>>> Write-Allocate, >> >>> >>>>> Outer Write-Back Write-Allocate memory" >> >>> >>>>> >> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >> >>> >>>>> treated as "Strongly-ordered device memory" >> >>> >>>>> >> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as >> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be >> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1. >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to >> >>> >>>> be >> >>> >>>> accessed as device memory. >> >>> >>>> >> >>> >>>> >> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software >> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or >> >>> >>>>> KVM ARM64 because we use VGIC. >> >>> >>>>> >> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >> >>> >>>>> when Stage1 MMU is off. >> >>> >>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> See above. My understanding is that HCR.DC controls the default >> >>> >>>> output of >> >>> >>>> Stage-1, and Stage-2 overrides still apply. >> >>> >>> >> >>> >>> >> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant >> >>> >>> and wanted guest to decide the memory attribute. In other words, >> >>> >>> you >> >>> >>> did not want to enforce any memory attribute in Stage2. >> >>> >>> >> >>> >>> Please refer to this patch >> >>> >>> https://patchwork.kernel.org/patch/2543201/ >> >>> >> >> >>> >> >> >>> >> This patch has never been merged. If you carry on following the >> >>> >> discussion, >> >>> >> you will certainly notice it was dropped for a very good reason: >> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html >> >>> >> >> >>> >> So Stage-2 memory attributes are used, they are not going away, and >> >>> >> they are >> >>> >> essential to the patch I sent this morning. >> >>> >> >> >>> >> >> >>> >> M. >> >>> >> -- >> >>> >> Fast, cheap, reliable. Pick two. >> >>> > >> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM >> >>> > is >> >>> > provided a DMA-capable device in pass-through mode. The reason being >> >>> > bootloader/firmware typically don't enable MMU and such >> >>> > bootloader/firmware >> >>> > will programme a pass-through DMA-capable device without any flushes >> >>> > to >> >>> > guest RAM (because it has not enabled MMU). >> >>> > >> >>> > A good example of such a device would be SATA AHCI controller given >> >>> > to a >> >>> > Guest/VM as direct access (using SystemMMU) and Guest >> >>> > bootloader/firmware >> >>> > accessing this SATA AHCI controller to load kernel images from SATA >> >>> > disk. >> >>> > In this situation, we will have to hack Guest bootloader/firmware >> >>> > AHCI driver to >> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). >> >>> >> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of >> >>> curiosity: is that a made up example or something you actually have? >> >>> >> >>> Back to square one: >> >>> Can you please benchmark the various cache cleaning options (global at >> >>> startup time, per-page on S2 translation fault, and user-space)? >> >>> >> >> Eh, why is this a more valid argument than the vgic? The device >> >> passthrough Stage-2 mappings would still have the Stage-2 memory >> >> attributes to configure that memory region as device memory. Why is it >> >> relevant if the device is DMA-capable in this context? >> >> >> >> -Christoffer >> > >> > Most ARM bootloader/firmware run with MMU off hence, they will not do >> > explicit cache flushes when programming DMA-capable device. Now If >> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware >> > will not be visible to DMA-capable device. >> > >> > --Anup >> >> The approach of flushing d-cache by set/way upon first run of VCPU will >> not work because for set/way operations ARM ARM says: "For set/way >> operations, and for All (entire cache) operations, the point is defined to be >> to the next level of caching". In other words, set/way operations work upto >> point of unification. >> >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() >> at bootup time which does not work for us when L3-cache is enabled so, >> there is no point is adding one more __flush_dcache_all() upon first run of >> VCPU in KVM ARM64. > > When did anyone suggest using a cache cleaning method that doesn't apply > to this case. I'm a little confused about your comment here. Please refer last reply from Marc Z where he says: "Can you please benchmark the various cache cleaning options (global at startup time, per-page on S2 translation fault, and user-space)?". Rather doing __flush_dcache_range() on each page in coherent_icache_guest_page() we could also flush entire d-cache upon first VCPU run. This only issue in flushing d-cache upon first VCPU run is that we cannot flush d-cache by set/way because as per ARM ARM all operations by set/way are upto PoU and not PoC. In presence of L3-Cache we need to flush d-cache upto PoC so we have to use __flush_dache_range() > >> >> IMHO, we are left with following options: >> 1. Flush all RAM regions of VCPU using __flush_dcache_range() >> upon first run of VCPU > > We could do that, but we have to ensure that no other memory regions can > be added later. Is this the case? I don't remember. Yes, memory regions being added later be problem for this option. > >> 2. Implement outer-cache framework for ARM64 and flush all >> caches + outer cache (i.e. L3-cache) upon first run of VCPU > > What's the difference between (2) and (1)? Linux ARM (i.e. 32-bit port) has a framework for having outer caches (i.e. caches that are not part of the CPU but exist as separate entity between CPU and Memory for performance improvement). Using this framework we can flush all CPU caches and outer cache. > >> 3. Use an alternate version of flush_icache_range() which will >> flush d-cache by PoC instead of PoU. We can also ensure >> that coherent_icache_guest_page() function will be called >> upon Stage2 prefetch aborts only. >> > > I'm sorry, but I'm not following this discussion. Are we not mixing a > discussion along two different axis here? As I see it there are two > separate issues: > > (A) When/Where to flush the memory regions > (B) How to flush the memory regions, including the hardware method and > the kernel software engineering aspect. Discussion here is about getting KVM ARM64 working in-presence of an external L3-cache (i.e. not part of CPU). Before starting a VCPU user-space typically loads images to guest RAM so, in-presence of huge L3-cache (few MBs). When the VCPU starts running some of the contents guest RAM will be still in L3-cache and VCPU runs with MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and see incorrect contents. To solve this problem we need to flush the guest RAM contents before they are accessed by first time by VCPU. > > -Christoffer --Anup
On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote: > On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall > <christoffer.dall@linaro.org> wrote: > > On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote: > >> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote: > >> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall > >> > <christoffer.dall@linaro.org> wrote: > >> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: > >> >>> On 2013-08-15 16:13, Anup Patel wrote: > >> >>> > Hi Marc, > >> >>> > > >> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> > >> >>> > wrote: > >> >>> >> On 2013-08-15 14:31, Anup Patel wrote: > >> >>> >>> > >> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > >> >>> >>> <marc.zyngier@arm.com> > >> >>> >>> wrote: > >> >>> >>>> > >> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote: > >> >>> >>>>> > >> >>> >>>>> > >> >>> >>>>> Hi Marc, > >> >>> >>>>> > >> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier > >> >>> >>>>> <marc.zyngier@arm.com> > >> >>> >>>>> wrote: > >> >>> >>>>>> > >> >>> >>>>>> > >> >>> >>>>>> Hi Anup, > >> >>> >>>>>> > >> >>> >>>>>> > >> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: > >> >>> >>>>>>> > >> >>> >>>>>>> > >> >>> >>>>>>> > >> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier > >> >>> >>>>>>> <marc.zyngier@arm.com> > >> >>> >>>>>>> wrote: > >> >>> >>>>>>>> > >> >>> >>>>>>>> > >> >>> >>>>>>>> > >> >>> >>>>>>>> Hi Pranav, > >> >>> >>>>>>>> > >> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: > >> >>> >>>>>>>>> > >> >>> >>>>>>>>> > >> >>> >>>>>>>>> > >> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have > >> >>> >>>>>>>>> dirty > >> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle > >> >>> >>>>>>>>> this, > >> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that > >> >>> >>>>>>>>> guest > >> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is > >> >>> >>>>>>>>> disabled. > >> >>> >>>>>>>>> > >> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external > >> >>> >>>>>>>>> L3-cache. > >> >>> >>>>>>>>> > >> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar > >> >>> >>>>>>>>> <pranavkumar@linaro.org> > >> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> > >> >>> >>>>>>>>> --- > >> >>> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ > >> >>> >>>>>>>>> 1 file changed, 2 insertions(+) > >> >>> >>>>>>>>> > >> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h > >> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h > >> >>> >>>>>>>>> index efe609c..5129038 100644 > >> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h > >> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h > >> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void > >> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) > >> >>> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ > >> >>> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); > >> >>> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); > >> >>> >>>>>>>>> + /* Flush d-cache for systems with external > >> >>> >>>>>>>>> caches. */ > >> >>> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); > >> >>> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non > >> >>> >>>>>>>>> ASID-tagged > >> >>> >>>>>>>>> VIVT > >> >>> >>>>>>>>> */ > >> >>> >>>>>>>>> /* any kind of VIPT cache */ > >> >>> >>>>>>>>> __flush_icache_all(); > >> >>> >>>>>>>> > >> >>> >>>>>>>> > >> >>> >>>>>>>> > >> >>> >>>>>>>> > >> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the > >> >>> >>>>>>>> past] > >> >>> >>>>>>>> > >> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit > >> >>> >>>>>>>> the > >> >>> >>>>>>>> data > >> >>> >>>>>>>> cache on each page mapping. It looks overkill. > >> >>> >>>>>>>> > >> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? > >> >>> >>>>>>>> kvmtools > >> >>> >>>>>>>> knows which bits of the guest memory have been touched, and > >> >>> >>>>>>>> can do a > >> >>> >>>>>>>> "DC > >> >>> >>>>>>>> DVAC" on this region. > >> >>> >>>>>>> > >> >>> >>>>>>> > >> >>> >>>>>>> > >> >>> >>>>>>> > >> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. > >> >>> >>>>>>> I am > >> >>> >>>>>>> sure > >> >>> >>>>>>> other architectures don't have such cache cleaning in > >> >>> >>>>>>> user-space for > >> >>> >>>>>>> KVM. > >> >>> >>>>>>> > >> >>> >>>>>>>> > >> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu > >> >>> >>>>>>>> - but > >> >>> >>>>>>>> that's not very nice either (have to clean the whole of the > >> >>> >>>>>>>> guest > >> >>> >>>>>>>> memory, which makes a full dcache clean more appealing). > >> >>> >>>>>>> > >> >>> >>>>>>> > >> >>> >>>>>>> > >> >>> >>>>>>> > >> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of > >> >>> >>>>>>> VCPU was > >> >>> >>>>>>> our second option but current approach seemed very simple hence > >> >>> >>>>>>> we went for this. > >> >>> >>>>>>> > >> >>> >>>>>>> If more people vote for full d-cache clean upon first run of > >> >>> >>>>>>> VCPU then > >> >>> >>>>>>> we should revise this patch. > >> >>> >>>>>> > >> >>> >>>>>> > >> >>> >>>>>> > >> >>> >>>>>> > >> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've > >> >>> >>>>>> boot-tested > >> >>> >>>>>> it on a model, but of course I can't really verify that it fixes > >> >>> >>>>>> your > >> >>> >>>>>> issue. > >> >>> >>>>>> > >> >>> >>>>>> As far as I can see, it should fix it without any additional > >> >>> >>>>>> flushing. > >> >>> >>>>>> > >> >>> >>>>>> Please let me know how it goes. > >> >>> >>>>> > >> >>> >>>>> > >> >>> >>>>> > >> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > >> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back > >> >>> >>>>> Write-Allocate, > >> >>> >>>>> Outer Write-Back Write-Allocate memory" > >> >>> >>>>> > >> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > >> >>> >>>>> treated as "Strongly-ordered device memory" > >> >>> >>>>> > >> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as > >> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be > >> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1. > >> >>> >>>> > >> >>> >>>> > >> >>> >>>> > >> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to > >> >>> >>>> be > >> >>> >>>> accessed as device memory. > >> >>> >>>> > >> >>> >>>> > >> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software > >> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or > >> >>> >>>>> KVM ARM64 because we use VGIC. > >> >>> >>>>> > >> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM > >> >>> >>>>> when Stage1 MMU is off. > >> >>> >>>> > >> >>> >>>> > >> >>> >>>> > >> >>> >>>> See above. My understanding is that HCR.DC controls the default > >> >>> >>>> output of > >> >>> >>>> Stage-1, and Stage-2 overrides still apply. > >> >>> >>> > >> >>> >>> > >> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant > >> >>> >>> and wanted guest to decide the memory attribute. In other words, > >> >>> >>> you > >> >>> >>> did not want to enforce any memory attribute in Stage2. > >> >>> >>> > >> >>> >>> Please refer to this patch > >> >>> >>> https://patchwork.kernel.org/patch/2543201/ > >> >>> >> > >> >>> >> > >> >>> >> This patch has never been merged. If you carry on following the > >> >>> >> discussion, > >> >>> >> you will certainly notice it was dropped for a very good reason: > >> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > >> >>> >> > >> >>> >> So Stage-2 memory attributes are used, they are not going away, and > >> >>> >> they are > >> >>> >> essential to the patch I sent this morning. > >> >>> >> > >> >>> >> > >> >>> >> M. > >> >>> >> -- > >> >>> >> Fast, cheap, reliable. Pick two. > >> >>> > > >> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > >> >>> > is > >> >>> > provided a DMA-capable device in pass-through mode. The reason being > >> >>> > bootloader/firmware typically don't enable MMU and such > >> >>> > bootloader/firmware > >> >>> > will programme a pass-through DMA-capable device without any flushes > >> >>> > to > >> >>> > guest RAM (because it has not enabled MMU). > >> >>> > > >> >>> > A good example of such a device would be SATA AHCI controller given > >> >>> > to a > >> >>> > Guest/VM as direct access (using SystemMMU) and Guest > >> >>> > bootloader/firmware > >> >>> > accessing this SATA AHCI controller to load kernel images from SATA > >> >>> > disk. > >> >>> > In this situation, we will have to hack Guest bootloader/firmware > >> >>> > AHCI driver to > >> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). > >> >>> > >> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of > >> >>> curiosity: is that a made up example or something you actually have? > >> >>> > >> >>> Back to square one: > >> >>> Can you please benchmark the various cache cleaning options (global at > >> >>> startup time, per-page on S2 translation fault, and user-space)? > >> >>> > >> >> Eh, why is this a more valid argument than the vgic? The device > >> >> passthrough Stage-2 mappings would still have the Stage-2 memory > >> >> attributes to configure that memory region as device memory. Why is it > >> >> relevant if the device is DMA-capable in this context? > >> >> > >> >> -Christoffer > >> > > >> > Most ARM bootloader/firmware run with MMU off hence, they will not do > >> > explicit cache flushes when programming DMA-capable device. Now If > >> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware > >> > will not be visible to DMA-capable device. > >> > > >> > --Anup > >> > >> The approach of flushing d-cache by set/way upon first run of VCPU will > >> not work because for set/way operations ARM ARM says: "For set/way > >> operations, and for All (entire cache) operations, the point is defined to be > >> to the next level of caching". In other words, set/way operations work upto > >> point of unification. > >> > >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() > >> at bootup time which does not work for us when L3-cache is enabled so, > >> there is no point is adding one more __flush_dcache_all() upon first run of > >> VCPU in KVM ARM64. > > > > When did anyone suggest using a cache cleaning method that doesn't apply > > to this case. I'm a little confused about your comment here. > > Please refer last reply from Marc Z where he says: > "Can you please benchmark the various cache cleaning options (global at > startup time, per-page on S2 translation fault, and user-space)?". > > Rather doing __flush_dcache_range() on each page in > coherent_icache_guest_page() we could also flush entire d-cache upon > first VCPU run. This only issue in flushing d-cache upon first VCPU run > is that we cannot flush d-cache by set/way because as per ARM ARM > all operations by set/way are upto PoU and not PoC. In presence of > L3-Cache we need to flush d-cache upto PoC so we have to use > __flush_dache_range() > > > > >> > >> IMHO, we are left with following options: > >> 1. Flush all RAM regions of VCPU using __flush_dcache_range() > >> upon first run of VCPU > > > > We could do that, but we have to ensure that no other memory regions can > > be added later. Is this the case? I don't remember. > > Yes, memory regions being added later be problem for this option. > > > > >> 2. Implement outer-cache framework for ARM64 and flush all > >> caches + outer cache (i.e. L3-cache) upon first run of VCPU > > > > What's the difference between (2) and (1)? > > Linux ARM (i.e. 32-bit port) has a framework for having outer > caches (i.e. caches that are not part of the CPU but exist as > separate entity between CPU and Memory for performance > improvement). Using this framework we can flush all CPU caches > and outer cache. > And we don't have such a framework in arm64? But __flush_dcache_range does nevertheless flush outer caches as well? > > > >> 3. Use an alternate version of flush_icache_range() which will > >> flush d-cache by PoC instead of PoU. We can also ensure > >> that coherent_icache_guest_page() function will be called > >> upon Stage2 prefetch aborts only. > >> > > > > I'm sorry, but I'm not following this discussion. Are we not mixing a > > discussion along two different axis here? As I see it there are two > > separate issues: > > > > (A) When/Where to flush the memory regions > > (B) How to flush the memory regions, including the hardware method and > > the kernel software engineering aspect. > > Discussion here is about getting KVM ARM64 working in-presence > of an external L3-cache (i.e. not part of CPU). Before starting a VCPU > user-space typically loads images to guest RAM so, in-presence of > huge L3-cache (few MBs). When the VCPU starts running some of the > contents guest RAM will be still in L3-cache and VCPU runs with > MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and > see incorrect contents. To solve this problem we need to flush the > guest RAM contents before they are accessed by first time by VCPU. > ok, I'm with you that far. But is it also not true that we need to decide between: A.1: Flush the entire guest RAM before running the VCPU A.2: Flush the pages as we fault them in And (independently): B.1: Use __flush_dcache_range B.2: Use something else + outer cache framework for arm64 B.3: Use flush_icache_range Or do these all interleave somehow? If so, I don't understand why. Can you help? -Christoffer
On Fri, Aug 16, 2013 at 10:50:34AM -0700, Christoffer Dall wrote: > On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote: > > On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall > > <christoffer.dall@linaro.org> wrote: > > > On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote: > > >> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote: > > >> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall > > >> > <christoffer.dall@linaro.org> wrote: > > >> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: > > >> >>> On 2013-08-15 16:13, Anup Patel wrote: > > >> >>> > Hi Marc, > > >> >>> > > > >> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> > > >> >>> > wrote: > > >> >>> >> On 2013-08-15 14:31, Anup Patel wrote: > > >> >>> >>> > > >> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > > >> >>> >>> <marc.zyngier@arm.com> > > >> >>> >>> wrote: > > >> >>> >>>> > > >> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote: > > >> >>> >>>>> > > >> >>> >>>>> > > >> >>> >>>>> Hi Marc, > > >> >>> >>>>> > > >> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier > > >> >>> >>>>> <marc.zyngier@arm.com> > > >> >>> >>>>> wrote: > > >> >>> >>>>>> > > >> >>> >>>>>> > > >> >>> >>>>>> Hi Anup, > > >> >>> >>>>>> > > >> >>> >>>>>> > > >> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: > > >> >>> >>>>>>> > > >> >>> >>>>>>> > > >> >>> >>>>>>> > > >> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier > > >> >>> >>>>>>> <marc.zyngier@arm.com> > > >> >>> >>>>>>> wrote: > > >> >>> >>>>>>>> > > >> >>> >>>>>>>> > > >> >>> >>>>>>>> > > >> >>> >>>>>>>> Hi Pranav, > > >> >>> >>>>>>>> > > >> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: > > >> >>> >>>>>>>>> > > >> >>> >>>>>>>>> > > >> >>> >>>>>>>>> > > >> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have > > >> >>> >>>>>>>>> dirty > > >> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle > > >> >>> >>>>>>>>> this, > > >> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that > > >> >>> >>>>>>>>> guest > > >> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is > > >> >>> >>>>>>>>> disabled. > > >> >>> >>>>>>>>> > > >> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external > > >> >>> >>>>>>>>> L3-cache. > > >> >>> >>>>>>>>> > > >> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar > > >> >>> >>>>>>>>> <pranavkumar@linaro.org> > > >> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> > > >> >>> >>>>>>>>> --- > > >> >>> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ > > >> >>> >>>>>>>>> 1 file changed, 2 insertions(+) > > >> >>> >>>>>>>>> > > >> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h > > >> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h > > >> >>> >>>>>>>>> index efe609c..5129038 100644 > > >> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h > > >> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h > > >> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void > > >> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) > > >> >>> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ > > >> >>> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); > > >> >>> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); > > >> >>> >>>>>>>>> + /* Flush d-cache for systems with external > > >> >>> >>>>>>>>> caches. */ > > >> >>> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); > > >> >>> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non > > >> >>> >>>>>>>>> ASID-tagged > > >> >>> >>>>>>>>> VIVT > > >> >>> >>>>>>>>> */ > > >> >>> >>>>>>>>> /* any kind of VIPT cache */ > > >> >>> >>>>>>>>> __flush_icache_all(); > > >> >>> >>>>>>>> > > >> >>> >>>>>>>> > > >> >>> >>>>>>>> > > >> >>> >>>>>>>> > > >> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the > > >> >>> >>>>>>>> past] > > >> >>> >>>>>>>> > > >> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit > > >> >>> >>>>>>>> the > > >> >>> >>>>>>>> data > > >> >>> >>>>>>>> cache on each page mapping. It looks overkill. > > >> >>> >>>>>>>> > > >> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? > > >> >>> >>>>>>>> kvmtools > > >> >>> >>>>>>>> knows which bits of the guest memory have been touched, and > > >> >>> >>>>>>>> can do a > > >> >>> >>>>>>>> "DC > > >> >>> >>>>>>>> DVAC" on this region. > > >> >>> >>>>>>> > > >> >>> >>>>>>> > > >> >>> >>>>>>> > > >> >>> >>>>>>> > > >> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. > > >> >>> >>>>>>> I am > > >> >>> >>>>>>> sure > > >> >>> >>>>>>> other architectures don't have such cache cleaning in > > >> >>> >>>>>>> user-space for > > >> >>> >>>>>>> KVM. > > >> >>> >>>>>>> > > >> >>> >>>>>>>> > > >> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu > > >> >>> >>>>>>>> - but > > >> >>> >>>>>>>> that's not very nice either (have to clean the whole of the > > >> >>> >>>>>>>> guest > > >> >>> >>>>>>>> memory, which makes a full dcache clean more appealing). > > >> >>> >>>>>>> > > >> >>> >>>>>>> > > >> >>> >>>>>>> > > >> >>> >>>>>>> > > >> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of > > >> >>> >>>>>>> VCPU was > > >> >>> >>>>>>> our second option but current approach seemed very simple hence > > >> >>> >>>>>>> we went for this. > > >> >>> >>>>>>> > > >> >>> >>>>>>> If more people vote for full d-cache clean upon first run of > > >> >>> >>>>>>> VCPU then > > >> >>> >>>>>>> we should revise this patch. > > >> >>> >>>>>> > > >> >>> >>>>>> > > >> >>> >>>>>> > > >> >>> >>>>>> > > >> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've > > >> >>> >>>>>> boot-tested > > >> >>> >>>>>> it on a model, but of course I can't really verify that it fixes > > >> >>> >>>>>> your > > >> >>> >>>>>> issue. > > >> >>> >>>>>> > > >> >>> >>>>>> As far as I can see, it should fix it without any additional > > >> >>> >>>>>> flushing. > > >> >>> >>>>>> > > >> >>> >>>>>> Please let me know how it goes. > > >> >>> >>>>> > > >> >>> >>>>> > > >> >>> >>>>> > > >> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > > >> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back > > >> >>> >>>>> Write-Allocate, > > >> >>> >>>>> Outer Write-Back Write-Allocate memory" > > >> >>> >>>>> > > >> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > > >> >>> >>>>> treated as "Strongly-ordered device memory" > > >> >>> >>>>> > > >> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as > > >> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be > > >> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1. > > >> >>> >>>> > > >> >>> >>>> > > >> >>> >>>> > > >> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to > > >> >>> >>>> be > > >> >>> >>>> accessed as device memory. > > >> >>> >>>> > > >> >>> >>>> > > >> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software > > >> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or > > >> >>> >>>>> KVM ARM64 because we use VGIC. > > >> >>> >>>>> > > >> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM > > >> >>> >>>>> when Stage1 MMU is off. > > >> >>> >>>> > > >> >>> >>>> > > >> >>> >>>> > > >> >>> >>>> See above. My understanding is that HCR.DC controls the default > > >> >>> >>>> output of > > >> >>> >>>> Stage-1, and Stage-2 overrides still apply. > > >> >>> >>> > > >> >>> >>> > > >> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant > > >> >>> >>> and wanted guest to decide the memory attribute. In other words, > > >> >>> >>> you > > >> >>> >>> did not want to enforce any memory attribute in Stage2. > > >> >>> >>> > > >> >>> >>> Please refer to this patch > > >> >>> >>> https://patchwork.kernel.org/patch/2543201/ > > >> >>> >> > > >> >>> >> > > >> >>> >> This patch has never been merged. If you carry on following the > > >> >>> >> discussion, > > >> >>> >> you will certainly notice it was dropped for a very good reason: > > >> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > > >> >>> >> > > >> >>> >> So Stage-2 memory attributes are used, they are not going away, and > > >> >>> >> they are > > >> >>> >> essential to the patch I sent this morning. > > >> >>> >> > > >> >>> >> > > >> >>> >> M. > > >> >>> >> -- > > >> >>> >> Fast, cheap, reliable. Pick two. > > >> >>> > > > >> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > > >> >>> > is > > >> >>> > provided a DMA-capable device in pass-through mode. The reason being > > >> >>> > bootloader/firmware typically don't enable MMU and such > > >> >>> > bootloader/firmware > > >> >>> > will programme a pass-through DMA-capable device without any flushes > > >> >>> > to > > >> >>> > guest RAM (because it has not enabled MMU). > > >> >>> > > > >> >>> > A good example of such a device would be SATA AHCI controller given > > >> >>> > to a > > >> >>> > Guest/VM as direct access (using SystemMMU) and Guest > > >> >>> > bootloader/firmware > > >> >>> > accessing this SATA AHCI controller to load kernel images from SATA > > >> >>> > disk. > > >> >>> > In this situation, we will have to hack Guest bootloader/firmware > > >> >>> > AHCI driver to > > >> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). > > >> >>> > > >> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of > > >> >>> curiosity: is that a made up example or something you actually have? > > >> >>> > > >> >>> Back to square one: > > >> >>> Can you please benchmark the various cache cleaning options (global at > > >> >>> startup time, per-page on S2 translation fault, and user-space)? > > >> >>> > > >> >> Eh, why is this a more valid argument than the vgic? The device > > >> >> passthrough Stage-2 mappings would still have the Stage-2 memory > > >> >> attributes to configure that memory region as device memory. Why is it > > >> >> relevant if the device is DMA-capable in this context? > > >> >> > > >> >> -Christoffer > > >> > > > >> > Most ARM bootloader/firmware run with MMU off hence, they will not do > > >> > explicit cache flushes when programming DMA-capable device. Now If > > >> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware > > >> > will not be visible to DMA-capable device. > > >> > > > >> > --Anup > > >> > > >> The approach of flushing d-cache by set/way upon first run of VCPU will > > >> not work because for set/way operations ARM ARM says: "For set/way > > >> operations, and for All (entire cache) operations, the point is defined to be > > >> to the next level of caching". In other words, set/way operations work upto > > >> point of unification. > > >> > > >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() > > >> at bootup time which does not work for us when L3-cache is enabled so, > > >> there is no point is adding one more __flush_dcache_all() upon first run of > > >> VCPU in KVM ARM64. > > > > > > When did anyone suggest using a cache cleaning method that doesn't apply > > > to this case. I'm a little confused about your comment here. > > > > Please refer last reply from Marc Z where he says: > > "Can you please benchmark the various cache cleaning options (global at > > startup time, per-page on S2 translation fault, and user-space)?". > > > > Rather doing __flush_dcache_range() on each page in > > coherent_icache_guest_page() we could also flush entire d-cache upon > > first VCPU run. This only issue in flushing d-cache upon first VCPU run > > is that we cannot flush d-cache by set/way because as per ARM ARM > > all operations by set/way are upto PoU and not PoC. In presence of > > L3-Cache we need to flush d-cache upto PoC so we have to use > > __flush_dache_range() > > > > > > > >> > > >> IMHO, we are left with following options: > > >> 1. Flush all RAM regions of VCPU using __flush_dcache_range() > > >> upon first run of VCPU > > > > > > We could do that, but we have to ensure that no other memory regions can > > > be added later. Is this the case? I don't remember. > > > > Yes, memory regions being added later be problem for this option. > > > > > > > >> 2. Implement outer-cache framework for ARM64 and flush all > > >> caches + outer cache (i.e. L3-cache) upon first run of VCPU > > > > > > What's the difference between (2) and (1)? > > > > Linux ARM (i.e. 32-bit port) has a framework for having outer > > caches (i.e. caches that are not part of the CPU but exist as > > separate entity between CPU and Memory for performance > > improvement). Using this framework we can flush all CPU caches > > and outer cache. > > > > And we don't have such a framework in arm64? But __flush_dcache_range > does nevertheless flush outer caches as well? > > > > > > >> 3. Use an alternate version of flush_icache_range() which will > > >> flush d-cache by PoC instead of PoU. We can also ensure > > >> that coherent_icache_guest_page() function will be called > > >> upon Stage2 prefetch aborts only. > > >> > > > > > > I'm sorry, but I'm not following this discussion. Are we not mixing a > > > discussion along two different axis here? As I see it there are two > > > separate issues: > > > > > > (A) When/Where to flush the memory regions > > > (B) How to flush the memory regions, including the hardware method and > > > the kernel software engineering aspect. > > > > Discussion here is about getting KVM ARM64 working in-presence > > of an external L3-cache (i.e. not part of CPU). Before starting a VCPU > > user-space typically loads images to guest RAM so, in-presence of > > huge L3-cache (few MBs). When the VCPU starts running some of the > > contents guest RAM will be still in L3-cache and VCPU runs with > > MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and > > see incorrect contents. To solve this problem we need to flush the > > guest RAM contents before they are accessed by first time by VCPU. > > > ok, I'm with you that far. > > But is it also not true that we need to decide between: > > A.1: Flush the entire guest RAM before running the VCPU > A.2: Flush the pages as we fault them in > > And (independently): > > B.1: Use __flush_dcache_range > B.2: Use something else + outer cache framework for arm64 > B.3: Use flush_icache_range > > Or do these all interleave somehow? If so, I don't understand why. Can > you help? > Oh, I think I understand your point now. You mean if we use flush_cache_all before we run the vcpu, then it's not sufficient? I assumed we would simply be using __flush_dcache_area for the guest RAM regions which would flush to PoC. For the record, I think we need a solution that also covers the case where a memory region is registered later, and I therefore prefer having this functionality in the stage-2 fault handler, where we are already taking care of similar issues (like your patch suggested). Especially if we can limit ourselves to doing so when the guest MMU is disabled, then I really think this is going to be the least overhead and measuring the performance of this penalty vs. at first CPU execution is a bit overkill IMHO, since we are only talking about boot time for any reasonable guest (which would run with the MMU enabled for real workloads presumeably). The only caveat is the previously discussed issue if user space loads code after the first VCPU execution, and only user space would know if that happens, which would argue for user space doing the cleaning... Hmmm. I also still have my worries about swapping, since user space is free to map guest RAM as non-executable. Did I miss anything here? -Christoffer
On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall <christoffer.dall@linaro.org> wrote: > On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote: >> On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall >> <christoffer.dall@linaro.org> wrote: >> > On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote: >> >> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote: >> >> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall >> >> > <christoffer.dall@linaro.org> wrote: >> >> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: >> >> >>> On 2013-08-15 16:13, Anup Patel wrote: >> >> >>> > Hi Marc, >> >> >>> > >> >> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> >> >> >>> > wrote: >> >> >>> >> On 2013-08-15 14:31, Anup Patel wrote: >> >> >>> >>> >> >> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier >> >> >>> >>> <marc.zyngier@arm.com> >> >> >>> >>> wrote: >> >> >>> >>>> >> >> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote: >> >> >>> >>>>> >> >> >>> >>>>> >> >> >>> >>>>> Hi Marc, >> >> >>> >>>>> >> >> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >> >> >>> >>>>> <marc.zyngier@arm.com> >> >> >>> >>>>> wrote: >> >> >>> >>>>>> >> >> >>> >>>>>> >> >> >>> >>>>>> Hi Anup, >> >> >>> >>>>>> >> >> >>> >>>>>> >> >> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: >> >> >>> >>>>>>> >> >> >>> >>>>>>> >> >> >>> >>>>>>> >> >> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >> >> >>> >>>>>>> <marc.zyngier@arm.com> >> >> >>> >>>>>>> wrote: >> >> >>> >>>>>>>> >> >> >>> >>>>>>>> >> >> >>> >>>>>>>> >> >> >>> >>>>>>>> Hi Pranav, >> >> >>> >>>>>>>> >> >> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >> >> >>> >>>>>>>>> >> >> >>> >>>>>>>>> >> >> >>> >>>>>>>>> >> >> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have >> >> >>> >>>>>>>>> dirty >> >> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle >> >> >>> >>>>>>>>> this, >> >> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that >> >> >>> >>>>>>>>> guest >> >> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is >> >> >>> >>>>>>>>> disabled. >> >> >>> >>>>>>>>> >> >> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external >> >> >>> >>>>>>>>> L3-cache. >> >> >>> >>>>>>>>> >> >> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >> >> >>> >>>>>>>>> <pranavkumar@linaro.org> >> >> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> >> >> >>> >>>>>>>>> --- >> >> >>> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >> >> >>> >>>>>>>>> 1 file changed, 2 insertions(+) >> >> >>> >>>>>>>>> >> >> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >> >> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >> >> >>> >>>>>>>>> index efe609c..5129038 100644 >> >> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >> >> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >> >> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void >> >> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >> >> >>> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >> >> >>> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >> >> >>> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >> >> >>> >>>>>>>>> + /* Flush d-cache for systems with external >> >> >>> >>>>>>>>> caches. */ >> >> >>> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >> >> >>> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non >> >> >>> >>>>>>>>> ASID-tagged >> >> >>> >>>>>>>>> VIVT >> >> >>> >>>>>>>>> */ >> >> >>> >>>>>>>>> /* any kind of VIPT cache */ >> >> >>> >>>>>>>>> __flush_icache_all(); >> >> >>> >>>>>>>> >> >> >>> >>>>>>>> >> >> >>> >>>>>>>> >> >> >>> >>>>>>>> >> >> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the >> >> >>> >>>>>>>> past] >> >> >>> >>>>>>>> >> >> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit >> >> >>> >>>>>>>> the >> >> >>> >>>>>>>> data >> >> >>> >>>>>>>> cache on each page mapping. It looks overkill. >> >> >>> >>>>>>>> >> >> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >> >> >>> >>>>>>>> kvmtools >> >> >>> >>>>>>>> knows which bits of the guest memory have been touched, and >> >> >>> >>>>>>>> can do a >> >> >>> >>>>>>>> "DC >> >> >>> >>>>>>>> DVAC" on this region. >> >> >>> >>>>>>> >> >> >>> >>>>>>> >> >> >>> >>>>>>> >> >> >>> >>>>>>> >> >> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. >> >> >>> >>>>>>> I am >> >> >>> >>>>>>> sure >> >> >>> >>>>>>> other architectures don't have such cache cleaning in >> >> >>> >>>>>>> user-space for >> >> >>> >>>>>>> KVM. >> >> >>> >>>>>>> >> >> >>> >>>>>>>> >> >> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu >> >> >>> >>>>>>>> - but >> >> >>> >>>>>>>> that's not very nice either (have to clean the whole of the >> >> >>> >>>>>>>> guest >> >> >>> >>>>>>>> memory, which makes a full dcache clean more appealing). >> >> >>> >>>>>>> >> >> >>> >>>>>>> >> >> >>> >>>>>>> >> >> >>> >>>>>>> >> >> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of >> >> >>> >>>>>>> VCPU was >> >> >>> >>>>>>> our second option but current approach seemed very simple hence >> >> >>> >>>>>>> we went for this. >> >> >>> >>>>>>> >> >> >>> >>>>>>> If more people vote for full d-cache clean upon first run of >> >> >>> >>>>>>> VCPU then >> >> >>> >>>>>>> we should revise this patch. >> >> >>> >>>>>> >> >> >>> >>>>>> >> >> >>> >>>>>> >> >> >>> >>>>>> >> >> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've >> >> >>> >>>>>> boot-tested >> >> >>> >>>>>> it on a model, but of course I can't really verify that it fixes >> >> >>> >>>>>> your >> >> >>> >>>>>> issue. >> >> >>> >>>>>> >> >> >>> >>>>>> As far as I can see, it should fix it without any additional >> >> >>> >>>>>> flushing. >> >> >>> >>>>>> >> >> >>> >>>>>> Please let me know how it goes. >> >> >>> >>>>> >> >> >>> >>>>> >> >> >>> >>>>> >> >> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >> >> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back >> >> >>> >>>>> Write-Allocate, >> >> >>> >>>>> Outer Write-Back Write-Allocate memory" >> >> >>> >>>>> >> >> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >> >> >>> >>>>> treated as "Strongly-ordered device memory" >> >> >>> >>>>> >> >> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as >> >> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be >> >> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1. >> >> >>> >>>> >> >> >>> >>>> >> >> >>> >>>> >> >> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to >> >> >>> >>>> be >> >> >>> >>>> accessed as device memory. >> >> >>> >>>> >> >> >>> >>>> >> >> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software >> >> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or >> >> >>> >>>>> KVM ARM64 because we use VGIC. >> >> >>> >>>>> >> >> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >> >> >>> >>>>> when Stage1 MMU is off. >> >> >>> >>>> >> >> >>> >>>> >> >> >>> >>>> >> >> >>> >>>> See above. My understanding is that HCR.DC controls the default >> >> >>> >>>> output of >> >> >>> >>>> Stage-1, and Stage-2 overrides still apply. >> >> >>> >>> >> >> >>> >>> >> >> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant >> >> >>> >>> and wanted guest to decide the memory attribute. In other words, >> >> >>> >>> you >> >> >>> >>> did not want to enforce any memory attribute in Stage2. >> >> >>> >>> >> >> >>> >>> Please refer to this patch >> >> >>> >>> https://patchwork.kernel.org/patch/2543201/ >> >> >>> >> >> >> >>> >> >> >> >>> >> This patch has never been merged. If you carry on following the >> >> >>> >> discussion, >> >> >>> >> you will certainly notice it was dropped for a very good reason: >> >> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html >> >> >>> >> >> >> >>> >> So Stage-2 memory attributes are used, they are not going away, and >> >> >>> >> they are >> >> >>> >> essential to the patch I sent this morning. >> >> >>> >> >> >> >>> >> >> >> >>> >> M. >> >> >>> >> -- >> >> >>> >> Fast, cheap, reliable. Pick two. >> >> >>> > >> >> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM >> >> >>> > is >> >> >>> > provided a DMA-capable device in pass-through mode. The reason being >> >> >>> > bootloader/firmware typically don't enable MMU and such >> >> >>> > bootloader/firmware >> >> >>> > will programme a pass-through DMA-capable device without any flushes >> >> >>> > to >> >> >>> > guest RAM (because it has not enabled MMU). >> >> >>> > >> >> >>> > A good example of such a device would be SATA AHCI controller given >> >> >>> > to a >> >> >>> > Guest/VM as direct access (using SystemMMU) and Guest >> >> >>> > bootloader/firmware >> >> >>> > accessing this SATA AHCI controller to load kernel images from SATA >> >> >>> > disk. >> >> >>> > In this situation, we will have to hack Guest bootloader/firmware >> >> >>> > AHCI driver to >> >> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). >> >> >>> >> >> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of >> >> >>> curiosity: is that a made up example or something you actually have? >> >> >>> >> >> >>> Back to square one: >> >> >>> Can you please benchmark the various cache cleaning options (global at >> >> >>> startup time, per-page on S2 translation fault, and user-space)? >> >> >>> >> >> >> Eh, why is this a more valid argument than the vgic? The device >> >> >> passthrough Stage-2 mappings would still have the Stage-2 memory >> >> >> attributes to configure that memory region as device memory. Why is it >> >> >> relevant if the device is DMA-capable in this context? >> >> >> >> >> >> -Christoffer >> >> > >> >> > Most ARM bootloader/firmware run with MMU off hence, they will not do >> >> > explicit cache flushes when programming DMA-capable device. Now If >> >> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware >> >> > will not be visible to DMA-capable device. >> >> > >> >> > --Anup >> >> >> >> The approach of flushing d-cache by set/way upon first run of VCPU will >> >> not work because for set/way operations ARM ARM says: "For set/way >> >> operations, and for All (entire cache) operations, the point is defined to be >> >> to the next level of caching". In other words, set/way operations work upto >> >> point of unification. >> >> >> >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() >> >> at bootup time which does not work for us when L3-cache is enabled so, >> >> there is no point is adding one more __flush_dcache_all() upon first run of >> >> VCPU in KVM ARM64. >> > >> > When did anyone suggest using a cache cleaning method that doesn't apply >> > to this case. I'm a little confused about your comment here. >> >> Please refer last reply from Marc Z where he says: >> "Can you please benchmark the various cache cleaning options (global at >> startup time, per-page on S2 translation fault, and user-space)?". >> >> Rather doing __flush_dcache_range() on each page in >> coherent_icache_guest_page() we could also flush entire d-cache upon >> first VCPU run. This only issue in flushing d-cache upon first VCPU run >> is that we cannot flush d-cache by set/way because as per ARM ARM >> all operations by set/way are upto PoU and not PoC. In presence of >> L3-Cache we need to flush d-cache upto PoC so we have to use >> __flush_dache_range() >> >> > >> >> >> >> IMHO, we are left with following options: >> >> 1. Flush all RAM regions of VCPU using __flush_dcache_range() >> >> upon first run of VCPU >> > >> > We could do that, but we have to ensure that no other memory regions can >> > be added later. Is this the case? I don't remember. >> >> Yes, memory regions being added later be problem for this option. >> >> > >> >> 2. Implement outer-cache framework for ARM64 and flush all >> >> caches + outer cache (i.e. L3-cache) upon first run of VCPU >> > >> > What's the difference between (2) and (1)? >> >> Linux ARM (i.e. 32-bit port) has a framework for having outer >> caches (i.e. caches that are not part of the CPU but exist as >> separate entity between CPU and Memory for performance >> improvement). Using this framework we can flush all CPU caches >> and outer cache. >> > > And we don't have such a framework in arm64? But __flush_dcache_range Yes, for now we don't have outer-cache framework in arm64. > does nevertheless flush outer caches as well? Yes, __flush_dcache_range() flushes the outer cache too. The __flush_dcache_range() works for us because it flushes (i.e. clean & invalidate) by VA upto PoC. All operation upto PoC on a cache line will force eviction of the cache line from all CPU caches (i.e. both L1 & L2) and also from external L3-cache to ensure that RAM has updated contents of cache line. Now, the outer caches (such as L3-cache) typically provide a mechanism to flush entire L3-cache using L3-cache registers. If we have an outer-cache framework then we can flush all caches using __flush_dcache_all() + outer cache flush all. > >> > >> >> 3. Use an alternate version of flush_icache_range() which will >> >> flush d-cache by PoC instead of PoU. We can also ensure >> >> that coherent_icache_guest_page() function will be called >> >> upon Stage2 prefetch aborts only. >> >> >> > >> > I'm sorry, but I'm not following this discussion. Are we not mixing a >> > discussion along two different axis here? As I see it there are two >> > separate issues: >> > >> > (A) When/Where to flush the memory regions >> > (B) How to flush the memory regions, including the hardware method and >> > the kernel software engineering aspect. >> >> Discussion here is about getting KVM ARM64 working in-presence >> of an external L3-cache (i.e. not part of CPU). Before starting a VCPU >> user-space typically loads images to guest RAM so, in-presence of >> huge L3-cache (few MBs). When the VCPU starts running some of the >> contents guest RAM will be still in L3-cache and VCPU runs with >> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and >> see incorrect contents. To solve this problem we need to flush the >> guest RAM contents before they are accessed by first time by VCPU. >> > ok, I'm with you that far. > > But is it also not true that we need to decide between: > > A.1: Flush the entire guest RAM before running the VCPU > A.2: Flush the pages as we fault them in Yes, thats the decision we have to make. > > And (independently): > > B.1: Use __flush_dcache_range > B.2: Use something else + outer cache framework for arm64 This would be __flush_dcache_all() + outer cache flush all. > B.3: Use flush_icache_range Actually modified version of flush_icache_range() which uses "dc cvac" and not "dc cvau". > > Or do these all interleave somehow? If so, I don't understand why. Can > you help? > > -Christoffer --Anup
On Fri, Aug 16, 2013 at 11:41:51PM +0530, Anup Patel wrote: > On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall > <christoffer.dall@linaro.org> wrote: > > On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote: > >> On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall > >> <christoffer.dall@linaro.org> wrote: > >> > On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote: > >> >> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote: > >> >> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall > >> >> > <christoffer.dall@linaro.org> wrote: > >> >> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: > >> >> >>> On 2013-08-15 16:13, Anup Patel wrote: > >> >> >>> > Hi Marc, > >> >> >>> > > >> >> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> > >> >> >>> > wrote: > >> >> >>> >> On 2013-08-15 14:31, Anup Patel wrote: > >> >> >>> >>> > >> >> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier > >> >> >>> >>> <marc.zyngier@arm.com> > >> >> >>> >>> wrote: > >> >> >>> >>>> > >> >> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote: > >> >> >>> >>>>> > >> >> >>> >>>>> > >> >> >>> >>>>> Hi Marc, > >> >> >>> >>>>> > >> >> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier > >> >> >>> >>>>> <marc.zyngier@arm.com> > >> >> >>> >>>>> wrote: > >> >> >>> >>>>>> > >> >> >>> >>>>>> > >> >> >>> >>>>>> Hi Anup, > >> >> >>> >>>>>> > >> >> >>> >>>>>> > >> >> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier > >> >> >>> >>>>>>> <marc.zyngier@arm.com> > >> >> >>> >>>>>>> wrote: > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> Hi Pranav, > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have > >> >> >>> >>>>>>>>> dirty > >> >> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle > >> >> >>> >>>>>>>>> this, > >> >> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that > >> >> >>> >>>>>>>>> guest > >> >> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is > >> >> >>> >>>>>>>>> disabled. > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external > >> >> >>> >>>>>>>>> L3-cache. > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar > >> >> >>> >>>>>>>>> <pranavkumar@linaro.org> > >> >> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> > >> >> >>> >>>>>>>>> --- > >> >> >>> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ > >> >> >>> >>>>>>>>> 1 file changed, 2 insertions(+) > >> >> >>> >>>>>>>>> > >> >> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h > >> >> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h > >> >> >>> >>>>>>>>> index efe609c..5129038 100644 > >> >> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h > >> >> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h > >> >> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void > >> >> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) > >> >> >>> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ > >> >> >>> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); > >> >> >>> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); > >> >> >>> >>>>>>>>> + /* Flush d-cache for systems with external > >> >> >>> >>>>>>>>> caches. */ > >> >> >>> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); > >> >> >>> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non > >> >> >>> >>>>>>>>> ASID-tagged > >> >> >>> >>>>>>>>> VIVT > >> >> >>> >>>>>>>>> */ > >> >> >>> >>>>>>>>> /* any kind of VIPT cache */ > >> >> >>> >>>>>>>>> __flush_icache_all(); > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the > >> >> >>> >>>>>>>> past] > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit > >> >> >>> >>>>>>>> the > >> >> >>> >>>>>>>> data > >> >> >>> >>>>>>>> cache on each page mapping. It looks overkill. > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? > >> >> >>> >>>>>>>> kvmtools > >> >> >>> >>>>>>>> knows which bits of the guest memory have been touched, and > >> >> >>> >>>>>>>> can do a > >> >> >>> >>>>>>>> "DC > >> >> >>> >>>>>>>> DVAC" on this region. > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. > >> >> >>> >>>>>>> I am > >> >> >>> >>>>>>> sure > >> >> >>> >>>>>>> other architectures don't have such cache cleaning in > >> >> >>> >>>>>>> user-space for > >> >> >>> >>>>>>> KVM. > >> >> >>> >>>>>>> > >> >> >>> >>>>>>>> > >> >> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu > >> >> >>> >>>>>>>> - but > >> >> >>> >>>>>>>> that's not very nice either (have to clean the whole of the > >> >> >>> >>>>>>>> guest > >> >> >>> >>>>>>>> memory, which makes a full dcache clean more appealing). > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of > >> >> >>> >>>>>>> VCPU was > >> >> >>> >>>>>>> our second option but current approach seemed very simple hence > >> >> >>> >>>>>>> we went for this. > >> >> >>> >>>>>>> > >> >> >>> >>>>>>> If more people vote for full d-cache clean upon first run of > >> >> >>> >>>>>>> VCPU then > >> >> >>> >>>>>>> we should revise this patch. > >> >> >>> >>>>>> > >> >> >>> >>>>>> > >> >> >>> >>>>>> > >> >> >>> >>>>>> > >> >> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've > >> >> >>> >>>>>> boot-tested > >> >> >>> >>>>>> it on a model, but of course I can't really verify that it fixes > >> >> >>> >>>>>> your > >> >> >>> >>>>>> issue. > >> >> >>> >>>>>> > >> >> >>> >>>>>> As far as I can see, it should fix it without any additional > >> >> >>> >>>>>> flushing. > >> >> >>> >>>>>> > >> >> >>> >>>>>> Please let me know how it goes. > >> >> >>> >>>>> > >> >> >>> >>>>> > >> >> >>> >>>>> > >> >> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are > >> >> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back > >> >> >>> >>>>> Write-Allocate, > >> >> >>> >>>>> Outer Write-Back Write-Allocate memory" > >> >> >>> >>>>> > >> >> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are > >> >> >>> >>>>> treated as "Strongly-ordered device memory" > >> >> >>> >>>>> > >> >> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as > >> >> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be > >> >> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1. > >> >> >>> >>>> > >> >> >>> >>>> > >> >> >>> >>>> > >> >> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to > >> >> >>> >>>> be > >> >> >>> >>>> accessed as device memory. > >> >> >>> >>>> > >> >> >>> >>>> > >> >> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software > >> >> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or > >> >> >>> >>>>> KVM ARM64 because we use VGIC. > >> >> >>> >>>>> > >> >> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM > >> >> >>> >>>>> when Stage1 MMU is off. > >> >> >>> >>>> > >> >> >>> >>>> > >> >> >>> >>>> > >> >> >>> >>>> See above. My understanding is that HCR.DC controls the default > >> >> >>> >>>> output of > >> >> >>> >>>> Stage-1, and Stage-2 overrides still apply. > >> >> >>> >>> > >> >> >>> >>> > >> >> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant > >> >> >>> >>> and wanted guest to decide the memory attribute. In other words, > >> >> >>> >>> you > >> >> >>> >>> did not want to enforce any memory attribute in Stage2. > >> >> >>> >>> > >> >> >>> >>> Please refer to this patch > >> >> >>> >>> https://patchwork.kernel.org/patch/2543201/ > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> This patch has never been merged. If you carry on following the > >> >> >>> >> discussion, > >> >> >>> >> you will certainly notice it was dropped for a very good reason: > >> >> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html > >> >> >>> >> > >> >> >>> >> So Stage-2 memory attributes are used, they are not going away, and > >> >> >>> >> they are > >> >> >>> >> essential to the patch I sent this morning. > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> M. > >> >> >>> >> -- > >> >> >>> >> Fast, cheap, reliable. Pick two. > >> >> >>> > > >> >> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM > >> >> >>> > is > >> >> >>> > provided a DMA-capable device in pass-through mode. The reason being > >> >> >>> > bootloader/firmware typically don't enable MMU and such > >> >> >>> > bootloader/firmware > >> >> >>> > will programme a pass-through DMA-capable device without any flushes > >> >> >>> > to > >> >> >>> > guest RAM (because it has not enabled MMU). > >> >> >>> > > >> >> >>> > A good example of such a device would be SATA AHCI controller given > >> >> >>> > to a > >> >> >>> > Guest/VM as direct access (using SystemMMU) and Guest > >> >> >>> > bootloader/firmware > >> >> >>> > accessing this SATA AHCI controller to load kernel images from SATA > >> >> >>> > disk. > >> >> >>> > In this situation, we will have to hack Guest bootloader/firmware > >> >> >>> > AHCI driver to > >> >> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). > >> >> >>> > >> >> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of > >> >> >>> curiosity: is that a made up example or something you actually have? > >> >> >>> > >> >> >>> Back to square one: > >> >> >>> Can you please benchmark the various cache cleaning options (global at > >> >> >>> startup time, per-page on S2 translation fault, and user-space)? > >> >> >>> > >> >> >> Eh, why is this a more valid argument than the vgic? The device > >> >> >> passthrough Stage-2 mappings would still have the Stage-2 memory > >> >> >> attributes to configure that memory region as device memory. Why is it > >> >> >> relevant if the device is DMA-capable in this context? > >> >> >> > >> >> >> -Christoffer > >> >> > > >> >> > Most ARM bootloader/firmware run with MMU off hence, they will not do > >> >> > explicit cache flushes when programming DMA-capable device. Now If > >> >> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware > >> >> > will not be visible to DMA-capable device. > >> >> > > >> >> > --Anup > >> >> > >> >> The approach of flushing d-cache by set/way upon first run of VCPU will > >> >> not work because for set/way operations ARM ARM says: "For set/way > >> >> operations, and for All (entire cache) operations, the point is defined to be > >> >> to the next level of caching". In other words, set/way operations work upto > >> >> point of unification. > >> >> > >> >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() > >> >> at bootup time which does not work for us when L3-cache is enabled so, > >> >> there is no point is adding one more __flush_dcache_all() upon first run of > >> >> VCPU in KVM ARM64. > >> > > >> > When did anyone suggest using a cache cleaning method that doesn't apply > >> > to this case. I'm a little confused about your comment here. > >> > >> Please refer last reply from Marc Z where he says: > >> "Can you please benchmark the various cache cleaning options (global at > >> startup time, per-page on S2 translation fault, and user-space)?". > >> > >> Rather doing __flush_dcache_range() on each page in > >> coherent_icache_guest_page() we could also flush entire d-cache upon > >> first VCPU run. This only issue in flushing d-cache upon first VCPU run > >> is that we cannot flush d-cache by set/way because as per ARM ARM > >> all operations by set/way are upto PoU and not PoC. In presence of > >> L3-Cache we need to flush d-cache upto PoC so we have to use > >> __flush_dache_range() > >> > >> > > >> >> > >> >> IMHO, we are left with following options: > >> >> 1. Flush all RAM regions of VCPU using __flush_dcache_range() > >> >> upon first run of VCPU > >> > > >> > We could do that, but we have to ensure that no other memory regions can > >> > be added later. Is this the case? I don't remember. > >> > >> Yes, memory regions being added later be problem for this option. > >> > >> > > >> >> 2. Implement outer-cache framework for ARM64 and flush all > >> >> caches + outer cache (i.e. L3-cache) upon first run of VCPU > >> > > >> > What's the difference between (2) and (1)? > >> > >> Linux ARM (i.e. 32-bit port) has a framework for having outer > >> caches (i.e. caches that are not part of the CPU but exist as > >> separate entity between CPU and Memory for performance > >> improvement). Using this framework we can flush all CPU caches > >> and outer cache. > >> > > > > And we don't have such a framework in arm64? But __flush_dcache_range > > Yes, for now we don't have outer-cache framework in arm64. > > > does nevertheless flush outer caches as well? > > Yes, __flush_dcache_range() flushes the outer cache too. > > The __flush_dcache_range() works for us because it flushes (i.e. > clean & invalidate) by VA upto PoC. All operation upto PoC on a > cache line will force eviction of the cache line from all CPU caches > (i.e. both L1 & L2) and also from external L3-cache to ensure that > RAM has updated contents of cache line. > > Now, the outer caches (such as L3-cache) typically provide a > mechanism to flush entire L3-cache using L3-cache registers. If > we have an outer-cache framework then we can flush all caches > using __flush_dcache_all() + outer cache flush all. > > > > >> > > >> >> 3. Use an alternate version of flush_icache_range() which will > >> >> flush d-cache by PoC instead of PoU. We can also ensure > >> >> that coherent_icache_guest_page() function will be called > >> >> upon Stage2 prefetch aborts only. > >> >> > >> > > >> > I'm sorry, but I'm not following this discussion. Are we not mixing a > >> > discussion along two different axis here? As I see it there are two > >> > separate issues: > >> > > >> > (A) When/Where to flush the memory regions > >> > (B) How to flush the memory regions, including the hardware method and > >> > the kernel software engineering aspect. > >> > >> Discussion here is about getting KVM ARM64 working in-presence > >> of an external L3-cache (i.e. not part of CPU). Before starting a VCPU > >> user-space typically loads images to guest RAM so, in-presence of > >> huge L3-cache (few MBs). When the VCPU starts running some of the > >> contents guest RAM will be still in L3-cache and VCPU runs with > >> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and > >> see incorrect contents. To solve this problem we need to flush the > >> guest RAM contents before they are accessed by first time by VCPU. > >> > > ok, I'm with you that far. > > > > But is it also not true that we need to decide between: > > > > A.1: Flush the entire guest RAM before running the VCPU > > A.2: Flush the pages as we fault them in > > Yes, thats the decision we have to make. > > > > > And (independently): > > > > B.1: Use __flush_dcache_range > > B.2: Use something else + outer cache framework for arm64 > > This would be __flush_dcache_all() + outer cache flush all. > > > B.3: Use flush_icache_range > > Actually modified version of flush_icache_range() which uses > "dc cvac" and not "dc cvau". > > > > > Or do these all interleave somehow? If so, I don't understand why. Can > > you help? > > Hi Anup, Thanks for the explanation, it's crystal now. -Christoffer
On Fri, Aug 16, 2013 at 11:36 PM, Christoffer Dall <christoffer.dall@linaro.org> wrote: > On Fri, Aug 16, 2013 at 10:50:34AM -0700, Christoffer Dall wrote: >> On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote: >> > On Fri, Aug 16, 2013 at 10:49 PM, Christoffer Dall >> > <christoffer.dall@linaro.org> wrote: >> > > On Fri, Aug 16, 2013 at 12:27:55PM +0530, Anup Patel wrote: >> > >> On Fri, Aug 16, 2013 at 10:32 AM, Anup Patel <anup@brainfault.org> wrote: >> > >> > On Thu, Aug 15, 2013 at 10:23 PM, Christoffer Dall >> > >> > <christoffer.dall@linaro.org> wrote: >> > >> >> On Thu, Aug 15, 2013 at 04:37:27PM +0100, Marc Zyngier wrote: >> > >> >>> On 2013-08-15 16:13, Anup Patel wrote: >> > >> >>> > Hi Marc, >> > >> >>> > >> > >> >>> > On Thu, Aug 15, 2013 at 8:17 PM, Marc Zyngier <marc.zyngier@arm.com> >> > >> >>> > wrote: >> > >> >>> >> On 2013-08-15 14:31, Anup Patel wrote: >> > >> >>> >>> >> > >> >>> >>> On Thu, Aug 15, 2013 at 2:01 PM, Marc Zyngier >> > >> >>> >>> <marc.zyngier@arm.com> >> > >> >>> >>> wrote: >> > >> >>> >>>> >> > >> >>> >>>> On 2013-08-15 07:26, Anup Patel wrote: >> > >> >>> >>>>> >> > >> >>> >>>>> >> > >> >>> >>>>> Hi Marc, >> > >> >>> >>>>> >> > >> >>> >>>>> On Thu, Aug 15, 2013 at 10:22 AM, Marc Zyngier >> > >> >>> >>>>> <marc.zyngier@arm.com> >> > >> >>> >>>>> wrote: >> > >> >>> >>>>>> >> > >> >>> >>>>>> >> > >> >>> >>>>>> Hi Anup, >> > >> >>> >>>>>> >> > >> >>> >>>>>> >> > >> >>> >>>>>> On 2013-08-14 15:22, Anup Patel wrote: >> > >> >>> >>>>>>> >> > >> >>> >>>>>>> >> > >> >>> >>>>>>> >> > >> >>> >>>>>>> On Wed, Aug 14, 2013 at 5:34 PM, Marc Zyngier >> > >> >>> >>>>>>> <marc.zyngier@arm.com> >> > >> >>> >>>>>>> wrote: >> > >> >>> >>>>>>>> >> > >> >>> >>>>>>>> >> > >> >>> >>>>>>>> >> > >> >>> >>>>>>>> Hi Pranav, >> > >> >>> >>>>>>>> >> > >> >>> >>>>>>>> On 2013-08-14 12:47, Pranavkumar Sawargaonkar wrote: >> > >> >>> >>>>>>>>> >> > >> >>> >>>>>>>>> >> > >> >>> >>>>>>>>> >> > >> >>> >>>>>>>>> Systems with large external L3-cache (few MBs), might have >> > >> >>> >>>>>>>>> dirty >> > >> >>> >>>>>>>>> content belonging to the guest page in L3-cache. To tackle >> > >> >>> >>>>>>>>> this, >> > >> >>> >>>>>>>>> we need to flush such dirty content from d-cache so that >> > >> >>> >>>>>>>>> guest >> > >> >>> >>>>>>>>> will see correct contents of guest page when guest MMU is >> > >> >>> >>>>>>>>> disabled. >> > >> >>> >>>>>>>>> >> > >> >>> >>>>>>>>> The patch fixes coherent_icache_guest_page() for external >> > >> >>> >>>>>>>>> L3-cache. >> > >> >>> >>>>>>>>> >> > >> >>> >>>>>>>>> Signed-off-by: Pranavkumar Sawargaonkar >> > >> >>> >>>>>>>>> <pranavkumar@linaro.org> >> > >> >>> >>>>>>>>> Signed-off-by: Anup Patel <anup.patel@linaro.org> >> > >> >>> >>>>>>>>> --- >> > >> >>> >>>>>>>>> arch/arm64/include/asm/kvm_mmu.h | 2 ++ >> > >> >>> >>>>>>>>> 1 file changed, 2 insertions(+) >> > >> >>> >>>>>>>>> >> > >> >>> >>>>>>>>> diff --git a/arch/arm64/include/asm/kvm_mmu.h >> > >> >>> >>>>>>>>> b/arch/arm64/include/asm/kvm_mmu.h >> > >> >>> >>>>>>>>> index efe609c..5129038 100644 >> > >> >>> >>>>>>>>> --- a/arch/arm64/include/asm/kvm_mmu.h >> > >> >>> >>>>>>>>> +++ b/arch/arm64/include/asm/kvm_mmu.h >> > >> >>> >>>>>>>>> @@ -123,6 +123,8 @@ static inline void >> > >> >>> >>>>>>>>> coherent_icache_guest_page(struct kvm *kvm, gfn_t gfn) >> > >> >>> >>>>>>>>> if (!icache_is_aliasing()) { /* PIPT */ >> > >> >>> >>>>>>>>> unsigned long hva = gfn_to_hva(kvm, gfn); >> > >> >>> >>>>>>>>> flush_icache_range(hva, hva + PAGE_SIZE); >> > >> >>> >>>>>>>>> + /* Flush d-cache for systems with external >> > >> >>> >>>>>>>>> caches. */ >> > >> >>> >>>>>>>>> + __flush_dcache_area((void *) hva, PAGE_SIZE); >> > >> >>> >>>>>>>>> } else if (!icache_is_aivivt()) { /* non >> > >> >>> >>>>>>>>> ASID-tagged >> > >> >>> >>>>>>>>> VIVT >> > >> >>> >>>>>>>>> */ >> > >> >>> >>>>>>>>> /* any kind of VIPT cache */ >> > >> >>> >>>>>>>>> __flush_icache_all(); >> > >> >>> >>>>>>>> >> > >> >>> >>>>>>>> >> > >> >>> >>>>>>>> >> > >> >>> >>>>>>>> >> > >> >>> >>>>>>>> [adding Will to the discussion as we talked about this in the >> > >> >>> >>>>>>>> past] >> > >> >>> >>>>>>>> >> > >> >>> >>>>>>>> That's definitely an issue, but I'm not sure the fix is to hit >> > >> >>> >>>>>>>> the >> > >> >>> >>>>>>>> data >> > >> >>> >>>>>>>> cache on each page mapping. It looks overkill. >> > >> >>> >>>>>>>> >> > >> >>> >>>>>>>> Wouldn't it be enough to let userspace do the cache cleaning? >> > >> >>> >>>>>>>> kvmtools >> > >> >>> >>>>>>>> knows which bits of the guest memory have been touched, and >> > >> >>> >>>>>>>> can do a >> > >> >>> >>>>>>>> "DC >> > >> >>> >>>>>>>> DVAC" on this region. >> > >> >>> >>>>>>> >> > >> >>> >>>>>>> >> > >> >>> >>>>>>> >> > >> >>> >>>>>>> >> > >> >>> >>>>>>> It seems a bit unnatural to have cache cleaning is user-space. >> > >> >>> >>>>>>> I am >> > >> >>> >>>>>>> sure >> > >> >>> >>>>>>> other architectures don't have such cache cleaning in >> > >> >>> >>>>>>> user-space for >> > >> >>> >>>>>>> KVM. >> > >> >>> >>>>>>> >> > >> >>> >>>>>>>> >> > >> >>> >>>>>>>> The alternative is do it in the kernel before running any vcpu >> > >> >>> >>>>>>>> - but >> > >> >>> >>>>>>>> that's not very nice either (have to clean the whole of the >> > >> >>> >>>>>>>> guest >> > >> >>> >>>>>>>> memory, which makes a full dcache clean more appealing). >> > >> >>> >>>>>>> >> > >> >>> >>>>>>> >> > >> >>> >>>>>>> >> > >> >>> >>>>>>> >> > >> >>> >>>>>>> Actually, cleaning full d-cache by set/way upon first run of >> > >> >>> >>>>>>> VCPU was >> > >> >>> >>>>>>> our second option but current approach seemed very simple hence >> > >> >>> >>>>>>> we went for this. >> > >> >>> >>>>>>> >> > >> >>> >>>>>>> If more people vote for full d-cache clean upon first run of >> > >> >>> >>>>>>> VCPU then >> > >> >>> >>>>>>> we should revise this patch. >> > >> >>> >>>>>> >> > >> >>> >>>>>> >> > >> >>> >>>>>> >> > >> >>> >>>>>> >> > >> >>> >>>>>> Can you please give the attached patch a spin on your HW? I've >> > >> >>> >>>>>> boot-tested >> > >> >>> >>>>>> it on a model, but of course I can't really verify that it fixes >> > >> >>> >>>>>> your >> > >> >>> >>>>>> issue. >> > >> >>> >>>>>> >> > >> >>> >>>>>> As far as I can see, it should fix it without any additional >> > >> >>> >>>>>> flushing. >> > >> >>> >>>>>> >> > >> >>> >>>>>> Please let me know how it goes. >> > >> >>> >>>>> >> > >> >>> >>>>> >> > >> >>> >>>>> >> > >> >>> >>>>> HCR_EL2.DC=1 means all memory access for Stage1 MMU off are >> > >> >>> >>>>> treated as "Normal Non-shareable, Inner Write-Back >> > >> >>> >>>>> Write-Allocate, >> > >> >>> >>>>> Outer Write-Back Write-Allocate memory" >> > >> >>> >>>>> >> > >> >>> >>>>> HCR_EL2.DC=0 means all memory access for Stage1 MMU off are >> > >> >>> >>>>> treated as "Strongly-ordered device memory" >> > >> >>> >>>>> >> > >> >>> >>>>> Now if Guest/VM access hardware MMIO devices directly (such as >> > >> >>> >>>>> VGIC CPU interface) with MMU off then MMIO devices will be >> > >> >>> >>>>> accessed as normal memory when HCR_EL2.DC=1. >> > >> >>> >>>> >> > >> >>> >>>> >> > >> >>> >>>> >> > >> >>> >>>> I don't think so. Stage-2 still applies, and should force MMIO to >> > >> >>> >>>> be >> > >> >>> >>>> accessed as device memory. >> > >> >>> >>>> >> > >> >>> >>>> >> > >> >>> >>>>> The HCR_EL2.DC=1 makes sense only if we have all software >> > >> >>> >>>>> emulated devices for Guest/VM which is not true for KVM ARM or >> > >> >>> >>>>> KVM ARM64 because we use VGIC. >> > >> >>> >>>>> >> > >> >>> >>>>> IMHO, this patch enforces incorrect memory attribute for Guest/VM >> > >> >>> >>>>> when Stage1 MMU is off. >> > >> >>> >>>> >> > >> >>> >>>> >> > >> >>> >>>> >> > >> >>> >>>> See above. My understanding is that HCR.DC controls the default >> > >> >>> >>>> output of >> > >> >>> >>>> Stage-1, and Stage-2 overrides still apply. >> > >> >>> >>> >> > >> >>> >>> >> > >> >>> >>> You had mentioned that PAGE_S2_DEVICE attribute was redundant >> > >> >>> >>> and wanted guest to decide the memory attribute. In other words, >> > >> >>> >>> you >> > >> >>> >>> did not want to enforce any memory attribute in Stage2. >> > >> >>> >>> >> > >> >>> >>> Please refer to this patch >> > >> >>> >>> https://patchwork.kernel.org/patch/2543201/ >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> This patch has never been merged. If you carry on following the >> > >> >>> >> discussion, >> > >> >>> >> you will certainly notice it was dropped for a very good reason: >> > >> >>> >> https://lists.cs.columbia.edu/pipermail/kvmarm/2013-May/005827.html >> > >> >>> >> >> > >> >>> >> So Stage-2 memory attributes are used, they are not going away, and >> > >> >>> >> they are >> > >> >>> >> essential to the patch I sent this morning. >> > >> >>> >> >> > >> >>> >> >> > >> >>> >> M. >> > >> >>> >> -- >> > >> >>> >> Fast, cheap, reliable. Pick two. >> > >> >>> > >> > >> >>> > HCR_EL2.DC=1 will break Guest/VM bootloader/firmware if the Guest/VM >> > >> >>> > is >> > >> >>> > provided a DMA-capable device in pass-through mode. The reason being >> > >> >>> > bootloader/firmware typically don't enable MMU and such >> > >> >>> > bootloader/firmware >> > >> >>> > will programme a pass-through DMA-capable device without any flushes >> > >> >>> > to >> > >> >>> > guest RAM (because it has not enabled MMU). >> > >> >>> > >> > >> >>> > A good example of such a device would be SATA AHCI controller given >> > >> >>> > to a >> > >> >>> > Guest/VM as direct access (using SystemMMU) and Guest >> > >> >>> > bootloader/firmware >> > >> >>> > accessing this SATA AHCI controller to load kernel images from SATA >> > >> >>> > disk. >> > >> >>> > In this situation, we will have to hack Guest bootloader/firmware >> > >> >>> > AHCI driver to >> > >> >>> > explicitly flushes to Guest RAM (because have HCR_EL2.DC=1). >> > >> >>> >> > >> >>> OK. So *that* looks like a valid argument against HCR_EL2.DC==1. Out of >> > >> >>> curiosity: is that a made up example or something you actually have? >> > >> >>> >> > >> >>> Back to square one: >> > >> >>> Can you please benchmark the various cache cleaning options (global at >> > >> >>> startup time, per-page on S2 translation fault, and user-space)? >> > >> >>> >> > >> >> Eh, why is this a more valid argument than the vgic? The device >> > >> >> passthrough Stage-2 mappings would still have the Stage-2 memory >> > >> >> attributes to configure that memory region as device memory. Why is it >> > >> >> relevant if the device is DMA-capable in this context? >> > >> >> >> > >> >> -Christoffer >> > >> > >> > >> > Most ARM bootloader/firmware run with MMU off hence, they will not do >> > >> > explicit cache flushes when programming DMA-capable device. Now If >> > >> > HCR_EL2.DC=1 then Guest RAM updates done by bootloader/firmware >> > >> > will not be visible to DMA-capable device. >> > >> > >> > >> > --Anup >> > >> >> > >> The approach of flushing d-cache by set/way upon first run of VCPU will >> > >> not work because for set/way operations ARM ARM says: "For set/way >> > >> operations, and for All (entire cache) operations, the point is defined to be >> > >> to the next level of caching". In other words, set/way operations work upto >> > >> point of unification. >> > >> >> > >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() >> > >> at bootup time which does not work for us when L3-cache is enabled so, >> > >> there is no point is adding one more __flush_dcache_all() upon first run of >> > >> VCPU in KVM ARM64. >> > > >> > > When did anyone suggest using a cache cleaning method that doesn't apply >> > > to this case. I'm a little confused about your comment here. >> > >> > Please refer last reply from Marc Z where he says: >> > "Can you please benchmark the various cache cleaning options (global at >> > startup time, per-page on S2 translation fault, and user-space)?". >> > >> > Rather doing __flush_dcache_range() on each page in >> > coherent_icache_guest_page() we could also flush entire d-cache upon >> > first VCPU run. This only issue in flushing d-cache upon first VCPU run >> > is that we cannot flush d-cache by set/way because as per ARM ARM >> > all operations by set/way are upto PoU and not PoC. In presence of >> > L3-Cache we need to flush d-cache upto PoC so we have to use >> > __flush_dache_range() >> > >> > > >> > >> >> > >> IMHO, we are left with following options: >> > >> 1. Flush all RAM regions of VCPU using __flush_dcache_range() >> > >> upon first run of VCPU >> > > >> > > We could do that, but we have to ensure that no other memory regions can >> > > be added later. Is this the case? I don't remember. >> > >> > Yes, memory regions being added later be problem for this option. >> > >> > > >> > >> 2. Implement outer-cache framework for ARM64 and flush all >> > >> caches + outer cache (i.e. L3-cache) upon first run of VCPU >> > > >> > > What's the difference between (2) and (1)? >> > >> > Linux ARM (i.e. 32-bit port) has a framework for having outer >> > caches (i.e. caches that are not part of the CPU but exist as >> > separate entity between CPU and Memory for performance >> > improvement). Using this framework we can flush all CPU caches >> > and outer cache. >> > >> >> And we don't have such a framework in arm64? But __flush_dcache_range >> does nevertheless flush outer caches as well? >> >> > > >> > >> 3. Use an alternate version of flush_icache_range() which will >> > >> flush d-cache by PoC instead of PoU. We can also ensure >> > >> that coherent_icache_guest_page() function will be called >> > >> upon Stage2 prefetch aborts only. >> > >> >> > > >> > > I'm sorry, but I'm not following this discussion. Are we not mixing a >> > > discussion along two different axis here? As I see it there are two >> > > separate issues: >> > > >> > > (A) When/Where to flush the memory regions >> > > (B) How to flush the memory regions, including the hardware method and >> > > the kernel software engineering aspect. >> > >> > Discussion here is about getting KVM ARM64 working in-presence >> > of an external L3-cache (i.e. not part of CPU). Before starting a VCPU >> > user-space typically loads images to guest RAM so, in-presence of >> > huge L3-cache (few MBs). When the VCPU starts running some of the >> > contents guest RAM will be still in L3-cache and VCPU runs with >> > MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and >> > see incorrect contents. To solve this problem we need to flush the >> > guest RAM contents before they are accessed by first time by VCPU. >> > >> ok, I'm with you that far. >> >> But is it also not true that we need to decide between: >> >> A.1: Flush the entire guest RAM before running the VCPU >> A.2: Flush the pages as we fault them in >> >> And (independently): >> >> B.1: Use __flush_dcache_range >> B.2: Use something else + outer cache framework for arm64 >> B.3: Use flush_icache_range >> >> Or do these all interleave somehow? If so, I don't understand why. Can >> you help? >> > Oh, I think I understand your point now. You mean if we use > flush_cache_all before we run the vcpu, then it's not sufficient? I > assumed we would simply be using __flush_dcache_area for the guest RAM > regions which would flush to PoC. > > For the record, I think we need a solution that also covers the case > where a memory region is registered later, and I therefore prefer having > this functionality in the stage-2 fault handler, where we are already > taking care of similar issues (like your patch suggested). > > Especially if we can limit ourselves to doing so when the guest MMU is > disabled, then I really think this is going to be the least overhead and > measuring the performance of this penalty vs. at first CPU execution is > a bit overkill IMHO, since we are only talking about boot time for any > reasonable guest (which would run with the MMU enabled for real > workloads presumeably). Yes, currently we call coherent_icache_guest_page() upon all Stage2 translation faults but, we can be improve a bit here just like your suggestion. May be we can call coherent_icache_guest_page() only when VCPU MMU is off and we get Stage2 instruction prefetch translation fault. > > The only caveat is the previously discussed issue if user space loads > code after the first VCPU execution, and only user space would know if > that happens, which would argue for user space doing the cleaning... > Hmmm. > > I also still have my worries about swapping, since user space is free to > map guest RAM as non-executable. > > Did I miss anything here? I don't know much about Linux swapping but for the rest part we are in sync. > > -Christoffer --Anup
Hi Anup, Jumping late into this thread (holidays). On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote: > The approach of flushing d-cache by set/way upon first run of VCPU will > not work because for set/way operations ARM ARM says: "For set/way > operations, and for All (entire cache) operations, the point is defined to be > to the next level of caching". In other words, set/way operations work upto > point of unification. I don't understand where you got the idea that set/way operations work up to the point of unification. This is incorrect, the set/way operations work on the level of cache specified by bits 3:1 in the register passed to the DC CISW instruction. For your L3 cache, those bits would be 2 (and __flush_dcache_all() implementation does this dynamically). For the I-cache all operation, that's correct, it only invalidates to the PoU but it doesn't need to go any further, you do the D-cache maintenance for this (no point in duplicating functionality). > Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() > at bootup time which does not work for us when L3-cache is enabled so, > there is no point is adding one more __flush_dcache_all() upon first run of > VCPU in KVM ARM64. Do you mean the CPU is not aware of the L3 cache? Does CLIDR_EL1 report an L3 cache on your implementation? It's not clear to me whether your L3 cache is inner or outer (or a mix). You say that D-cache maintenance to PoC flushes the L3 which looks to me like an inner cache, in which cache it should be reported in the LoC bits in CLIDR_EL1. > IMHO, we are left with following options: > 1. Flush all RAM regions of VCPU using __flush_dcache_range() > upon first run of VCPU > 2. Implement outer-cache framework for ARM64 and flush all > caches + outer cache (i.e. L3-cache) upon first run of VCPU Do you have specific instructions for flushing the L3 cache only? It's not clear to me what an outer-cache framework would to on AArch64. It was added on AArch32 for the L2x0/PL310 which need separate instructions by physical address for flushing the cache. I really hope we don't get these again on ARMv8 hardware. > 3. Use an alternate version of flush_icache_range() which will > flush d-cache by PoC instead of PoU. We can also ensure > that coherent_icache_guest_page() function will be called > upon Stage2 prefetch aborts only. flush_icache_range() is meant to flush only to the PoU to ensure the I-D cache coherency. I don't think we should change this.
On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > Hi Anup, > > Jumping late into this thread (holidays). > > On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote: >> The approach of flushing d-cache by set/way upon first run of VCPU will >> not work because for set/way operations ARM ARM says: "For set/way >> operations, and for All (entire cache) operations, the point is defined to be >> to the next level of caching". In other words, set/way operations work upto >> point of unification. > > I don't understand where you got the idea that set/way operations work > up to the point of unification. This is incorrect, the set/way > operations work on the level of cache specified by bits 3:1 in the > register passed to the DC CISW instruction. For your L3 cache, those > bits would be 2 (and __flush_dcache_all() implementation does this > dynamically). The L3-cache is not visible to CPU. It is totally independent and transparent to CPU. > > For the I-cache all operation, that's correct, it only invalidates to > the PoU but it doesn't need to go any further, you do the D-cache > maintenance for this (no point in duplicating functionality). > >> Also, Guest Linux already does __flush_dcache_all() from __cpu_setup() >> at bootup time which does not work for us when L3-cache is enabled so, >> there is no point is adding one more __flush_dcache_all() upon first run of >> VCPU in KVM ARM64. > > Do you mean the CPU is not aware of the L3 cache? Does CLIDR_EL1 report > an L3 cache on your implementation? Yes, CPU is not aware of L3-cache and its state. In our case, CLIDR_EL1 does not report L3-cache. > > It's not clear to me whether your L3 cache is inner or outer (or a mix). > You say that D-cache maintenance to PoC flushes the L3 which looks to me > like an inner cache, in which cache it should be reported in the LoC > bits in CLIDR_EL1. > >> IMHO, we are left with following options: >> 1. Flush all RAM regions of VCPU using __flush_dcache_range() >> upon first run of VCPU >> 2. Implement outer-cache framework for ARM64 and flush all >> caches + outer cache (i.e. L3-cache) upon first run of VCPU > > Do you have specific instructions for flushing the L3 cache only? It's > not clear to me what an outer-cache framework would to on AArch64. It > was added on AArch32 for the L2x0/PL310 which need separate instructions > by physical address for flushing the cache. I really hope we don't get > these again on ARMv8 hardware. > >> 3. Use an alternate version of flush_icache_range() which will >> flush d-cache by PoC instead of PoU. We can also ensure >> that coherent_icache_guest_page() function will be called >> upon Stage2 prefetch aborts only. > > flush_icache_range() is meant to flush only to the PoU to ensure the I-D > cache coherency. I don't think we should change this. We did not mean to change flush_icache_range() instead we suggest to have separate flush_icache_range_kvm() for KVM ARM64 which flushes d-cache to PoC. > > -- > Catalin --Anup
On Thu, Aug 29, 2013 at 01:31:43PM +0100, Anup Patel wrote: > On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote: > >> The approach of flushing d-cache by set/way upon first run of VCPU will > >> not work because for set/way operations ARM ARM says: "For set/way > >> operations, and for All (entire cache) operations, the point is defined to be > >> to the next level of caching". In other words, set/way operations work upto > >> point of unification. > > > > I don't understand where you got the idea that set/way operations work > > up to the point of unification. This is incorrect, the set/way > > operations work on the level of cache specified by bits 3:1 in the > > register passed to the DC CISW instruction. For your L3 cache, those > > bits would be 2 (and __flush_dcache_all() implementation does this > > dynamically). > > The L3-cache is not visible to CPU. It is totally independent and transparent > to CPU. OK. But you say that operations like DC CIVAC actually flush the L3? So I don't see it as completely transparent to the CPU. Do you have any configuration bits which would make the L3 completely transparent like always caching even when accesses are non-cacheable and DC ops to PoC ignoring it? Now, back to the idea of outer_cache framework for arm64. Does your CPU have separate instructions for flushing this L3 cache?
On Thu, Aug 29, 2013 at 6:23 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Thu, Aug 29, 2013 at 01:31:43PM +0100, Anup Patel wrote: >> On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas >> <catalin.marinas@arm.com> wrote: >> > On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote: >> >> The approach of flushing d-cache by set/way upon first run of VCPU will >> >> not work because for set/way operations ARM ARM says: "For set/way >> >> operations, and for All (entire cache) operations, the point is defined to be >> >> to the next level of caching". In other words, set/way operations work upto >> >> point of unification. >> > >> > I don't understand where you got the idea that set/way operations work >> > up to the point of unification. This is incorrect, the set/way >> > operations work on the level of cache specified by bits 3:1 in the >> > register passed to the DC CISW instruction. For your L3 cache, those >> > bits would be 2 (and __flush_dcache_all() implementation does this >> > dynamically). >> >> The L3-cache is not visible to CPU. It is totally independent and transparent >> to CPU. > > OK. But you say that operations like DC CIVAC actually flush the L3? So > I don't see it as completely transparent to the CPU. It is transparent from CPU perspective. In other words, there is nothing in CPU for controlling/monitoring L3-cache. > > Do you have any configuration bits which would make the L3 completely > transparent like always caching even when accesses are non-cacheable and > DC ops to PoC ignoring it? Actually, L3-cache monitors the types of read/write generated by CPU (i.e. whether the request is cacheable/non-cacheable or whether the request is due to DC ops to PoC, or ...). To answer your query, there is no configuration to have L3 caching when accesses are non-cacheable and DC ops to PoC. > > Now, back to the idea of outer_cache framework for arm64. Does your CPU > have separate instructions for flushing this L3 cache? No, CPU does not have separate instruction for flushing L3-cache. On the other hand, L3-cache has MMIO registers which can be use to explicitly flush L3-cache. > > -- > Catalin --Anup
On Thu, Aug 29, 2013 at 05:02:50PM +0100, Anup Patel wrote: > On Thu, Aug 29, 2013 at 6:23 PM, Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Thu, Aug 29, 2013 at 01:31:43PM +0100, Anup Patel wrote: > >> On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas > >> <catalin.marinas@arm.com> wrote: > >> > On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote: > >> >> The approach of flushing d-cache by set/way upon first run of VCPU will > >> >> not work because for set/way operations ARM ARM says: "For set/way > >> >> operations, and for All (entire cache) operations, the point is defined to be > >> >> to the next level of caching". In other words, set/way operations work upto > >> >> point of unification. > >> > > >> > I don't understand where you got the idea that set/way operations work > >> > up to the point of unification. This is incorrect, the set/way > >> > operations work on the level of cache specified by bits 3:1 in the > >> > register passed to the DC CISW instruction. For your L3 cache, those > >> > bits would be 2 (and __flush_dcache_all() implementation does this > >> > dynamically). > >> > >> The L3-cache is not visible to CPU. It is totally independent and transparent > >> to CPU. > > > > OK. But you say that operations like DC CIVAC actually flush the L3? So > > I don't see it as completely transparent to the CPU. > > It is transparent from CPU perspective. In other words, there is nothing in > CPU for controlling/monitoring L3-cache. We probably have a different understanding of "transparent". It doesn't look to me like any more transparent than the L1 or L2 cache. Basically, from a software perspective, it needs maintenance. Whether the CPU explicitly asks the L3 cache for this or the L3 cache figures it on its own based on the L1/L2 operations is irrelevant. It would have been transparent if the software didn't need to know about it at all, but it's not the case. > > Do you have any configuration bits which would make the L3 completely > > transparent like always caching even when accesses are non-cacheable and > > DC ops to PoC ignoring it? > > Actually, L3-cache monitors the types of read/write generated by CPU (i.e. > whether the request is cacheable/non-cacheable or whether the request is > due to DC ops to PoC, or ...). > > To answer your query, there is no configuration to have L3 caching when > accesses are non-cacheable and DC ops to PoC. So it's an outer cache with some "improvements" to handle DC ops to PoC. I think it was a pretty bad decision on the hardware side as we really try to get rid of outer caches for many reasons: 1. Non-standard cache flushing operations (MMIO-based) 2. It may require cache maintenance by physical address - something hard to get in a guest OS (unless you virtualise L3 cache maintenance) 3. Are barriers like DSB propagated correctly? Does a DC op to PoC followed by DSB ensure that the L3 drained the cachelines to RAM? I think point 2 isn't required because your L3 detects DC ops to PoC. I hope point 3 is handled correctly (otherwise look how "nice" the mb() macro on arm is to cope with L2x0). If only 1 is left, we don't need the full outer_cache framework but it still needs to be addressed since the assumption is that flush_cache_all (or __flush_dcache_all) flushes all cache levels. These are not used in generic code but are used during kernel booting, KVM and cpuidle drivers. > > Now, back to the idea of outer_cache framework for arm64. Does your CPU > > have separate instructions for flushing this L3 cache? > > No, CPU does not have separate instruction for flushing L3-cache. On the > other hand, L3-cache has MMIO registers which can be use to explicitly > flush L3-cache. I guess you use those in your firmware or boot loader since Linux requires clean/invalidated caches at boot (and I plan to push a patch which removes kernel D-cache cleaning during boot to spot such problems early). A cpuidle driver would probably need this as well.
On Fri, Aug 16, 2013 at 07:11:51PM +0100, Anup Patel wrote: > On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall > <christoffer.dall@linaro.org> wrote: > > On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote: > >> Discussion here is about getting KVM ARM64 working in-presence > >> of an external L3-cache (i.e. not part of CPU). Before starting a VCPU > >> user-space typically loads images to guest RAM so, in-presence of > >> huge L3-cache (few MBs). When the VCPU starts running some of the > >> contents guest RAM will be still in L3-cache and VCPU runs with > >> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and > >> see incorrect contents. To solve this problem we need to flush the > >> guest RAM contents before they are accessed by first time by VCPU. > >> > > ok, I'm with you that far. > > > > But is it also not true that we need to decide between: > > > > A.1: Flush the entire guest RAM before running the VCPU > > A.2: Flush the pages as we fault them in > > Yes, thats the decision we have to make. > > > > > And (independently): > > > > B.1: Use __flush_dcache_range > > B.2: Use something else + outer cache framework for arm64 > > This would be __flush_dcache_all() + outer cache flush all. We need to be careful here since the __flush_dcache_all() operation uses cache maintenance by set/way and these are *local* to a CPU (IOW not broadcast). Do you have any guarantee that dirty cache lines don't migrate between CPUs and __flush_dcache_all() wouldn't miss them? Architecturally we don't, so this is not a safe operation that would guarantee L1 cache flushing (we probably need to revisit some of the __flush_dcache_all() calls in KVM, I haven't looked into this). So I think we are left to the range operation where the DC ops to PoC would be enough for your L3. An outer cache flush all is probably only needed for cpuidle/suspend (the booting part should be handled by the boot loader).
On Fri, Aug 30, 2013 at 3:14 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Thu, Aug 29, 2013 at 05:02:50PM +0100, Anup Patel wrote: >> On Thu, Aug 29, 2013 at 6:23 PM, Catalin Marinas >> <catalin.marinas@arm.com> wrote: >> > On Thu, Aug 29, 2013 at 01:31:43PM +0100, Anup Patel wrote: >> >> On Thu, Aug 29, 2013 at 4:22 PM, Catalin Marinas >> >> <catalin.marinas@arm.com> wrote: >> >> > On Fri, Aug 16, 2013 at 07:57:55AM +0100, Anup Patel wrote: >> >> >> The approach of flushing d-cache by set/way upon first run of VCPU will >> >> >> not work because for set/way operations ARM ARM says: "For set/way >> >> >> operations, and for All (entire cache) operations, the point is defined to be >> >> >> to the next level of caching". In other words, set/way operations work upto >> >> >> point of unification. >> >> > >> >> > I don't understand where you got the idea that set/way operations work >> >> > up to the point of unification. This is incorrect, the set/way >> >> > operations work on the level of cache specified by bits 3:1 in the >> >> > register passed to the DC CISW instruction. For your L3 cache, those >> >> > bits would be 2 (and __flush_dcache_all() implementation does this >> >> > dynamically). >> >> >> >> The L3-cache is not visible to CPU. It is totally independent and transparent >> >> to CPU. >> > >> > OK. But you say that operations like DC CIVAC actually flush the L3? So >> > I don't see it as completely transparent to the CPU. >> >> It is transparent from CPU perspective. In other words, there is nothing in >> CPU for controlling/monitoring L3-cache. > > We probably have a different understanding of "transparent". It doesn't > look to me like any more transparent than the L1 or L2 cache. Basically, > from a software perspective, it needs maintenance. Whether the CPU > explicitly asks the L3 cache for this or the L3 cache figures it on its > own based on the L1/L2 operations is irrelevant. Yes, we have a different understanding regarding "transparent". The goal behind our L3 is to have external cache such that CPU is not aware of it and CPU would work fine even if L3 is turned-off on-the-fly. Further such L3 has to be smarter than normal outer-caches because it will be maintained automatically by observing CPU operations. > > It would have been transparent if the software didn't need to know about > it at all, but it's not the case. Currently, KVM is the only place where we need L3 maintenance because Guest starts with MMU turned-off otherwise we don't need L3 maintenance in kernel for any kind of IO or Subsystem. > >> > Do you have any configuration bits which would make the L3 completely >> > transparent like always caching even when accesses are non-cacheable and >> > DC ops to PoC ignoring it? >> >> Actually, L3-cache monitors the types of read/write generated by CPU (i.e. >> whether the request is cacheable/non-cacheable or whether the request is >> due to DC ops to PoC, or ...). >> >> To answer your query, there is no configuration to have L3 caching when >> accesses are non-cacheable and DC ops to PoC. > > So it's an outer cache with some "improvements" to handle DC ops to PoC. > I think it was a pretty bad decision on the hardware side as we really > try to get rid of outer caches for many reasons: Getting rid off outer-caches (such as in this context) may make sense for Embedded/Low-end systems but for Servers L3-cache makes lot of sense. Claiming this to be a bad decision all depends on what end application you are looking at. > > 1. Non-standard cache flushing operations (MMIO-based) > 2. It may require cache maintenance by physical address - something > hard to get in a guest OS (unless you virtualise L3 cache > maintenance) We don't have cache maintenance by physical address in our L3-cache. > 3. Are barriers like DSB propagated correctly? Does a DC op to PoC > followed by DSB ensure that the L3 drained the cachelines to RAM? Yes, DSB works perfectly fine for us with L3. Yes, DC op to PoC followed by DSB works fine with L3 too. > > I think point 2 isn't required because your L3 detects DC ops to PoC. I > hope point 3 is handled correctly (otherwise look how "nice" the mb() > macro on arm is to cope with L2x0). > > If only 1 is left, we don't need the full outer_cache framework but it > still needs to be addressed since the assumption is that flush_cache_all > (or __flush_dcache_all) flushes all cache levels. These are not used in > generic code but are used during kernel booting, KVM and cpuidle > drivers. > >> > Now, back to the idea of outer_cache framework for arm64. Does your CPU >> > have separate instructions for flushing this L3 cache? >> >> No, CPU does not have separate instruction for flushing L3-cache. On the >> other hand, L3-cache has MMIO registers which can be use to explicitly >> flush L3-cache. > > I guess you use those in your firmware or boot loader since Linux > requires clean/invalidated caches at boot (and I plan to push a patch > which removes kernel D-cache cleaning during boot to spot such problems > early). A cpuidle driver would probably need this as well. Yes, cpuidle would be another place where we may need L3-cache maintenance. In other words, if we need to power-off L3-cache from kernel then we need to first flush it. > > -- > Catalin --Anup
On Fri, Aug 30, 2013 at 3:22 PM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Fri, Aug 16, 2013 at 07:11:51PM +0100, Anup Patel wrote: >> On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall >> <christoffer.dall@linaro.org> wrote: >> > On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote: >> >> Discussion here is about getting KVM ARM64 working in-presence >> >> of an external L3-cache (i.e. not part of CPU). Before starting a VCPU >> >> user-space typically loads images to guest RAM so, in-presence of >> >> huge L3-cache (few MBs). When the VCPU starts running some of the >> >> contents guest RAM will be still in L3-cache and VCPU runs with >> >> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and >> >> see incorrect contents. To solve this problem we need to flush the >> >> guest RAM contents before they are accessed by first time by VCPU. >> >> >> > ok, I'm with you that far. >> > >> > But is it also not true that we need to decide between: >> > >> > A.1: Flush the entire guest RAM before running the VCPU >> > A.2: Flush the pages as we fault them in >> >> Yes, thats the decision we have to make. >> >> > >> > And (independently): >> > >> > B.1: Use __flush_dcache_range >> > B.2: Use something else + outer cache framework for arm64 >> >> This would be __flush_dcache_all() + outer cache flush all. > > We need to be careful here since the __flush_dcache_all() operation uses > cache maintenance by set/way and these are *local* to a CPU (IOW not > broadcast). Do you have any guarantee that dirty cache lines don't > migrate between CPUs and __flush_dcache_all() wouldn't miss them? > Architecturally we don't, so this is not a safe operation that would > guarantee L1 cache flushing (we probably need to revisit some of the > __flush_dcache_all() calls in KVM, I haven't looked into this). > > So I think we are left to the range operation where the DC ops to PoC > would be enough for your L3. If __flush_dcache_all() is *local" to a CPU then I guess DC ops to PoC by range would be the only option. > > An outer cache flush all is probably only needed for cpuidle/suspend > (the booting part should be handled by the boot loader). Yes, cpuidle/suspend would definitely require outer cache maintenance. For KVM, we can avoid flushing d-cache to PoC every time in coherent_icache_guest_page() by only doing it when Guest MMU is turned-off. This may reduce the performance penalty. > > -- > Catalin --Anup
On Fri, Aug 30, 2013 at 11:36:30AM +0100, Anup Patel wrote: > On Fri, Aug 30, 2013 at 3:14 PM, Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Thu, Aug 29, 2013 at 05:02:50PM +0100, Anup Patel wrote: > >> Actually, L3-cache monitors the types of read/write generated by CPU (i.e. > >> whether the request is cacheable/non-cacheable or whether the request is > >> due to DC ops to PoC, or ...). > >> > >> To answer your query, there is no configuration to have L3 caching when > >> accesses are non-cacheable and DC ops to PoC. > > > > So it's an outer cache with some "improvements" to handle DC ops to PoC. > > I think it was a pretty bad decision on the hardware side as we really > > try to get rid of outer caches for many reasons: > > Getting rid off outer-caches (such as in this context) may make sense for > Embedded/Low-end systems but for Servers L3-cache makes lot of sense. > > Claiming this to be a bad decision all depends on what end application > you are looking at. It's not a bad decision to have big L3 cache, that's a very *good* decision for server applications. The bad part is that it is not fully integrated with the CPU (like allowing set/way operations to flush this cache level). > > 1. Non-standard cache flushing operations (MMIO-based) > > 2. It may require cache maintenance by physical address - something > > hard to get in a guest OS (unless you virtualise L3 cache > > maintenance) > > We don't have cache maintenance by physical address in our L3-cache. Great. > > 3. Are barriers like DSB propagated correctly? Does a DC op to PoC > > followed by DSB ensure that the L3 drained the cachelines to RAM? > > Yes, DSB works perfectly fine for us with L3. > Yes, DC op to PoC followed by DSB works fine with L3 too. Even better. See my other comment on flush_dcache_all() use in the kernel/KVM and why I don't think it is suitable (which leaves us with DC ops to PoC in KVM). > > I think point 2 isn't required because your L3 detects DC ops to PoC. I > > hope point 3 is handled correctly (otherwise look how "nice" the mb() > > macro on arm is to cope with L2x0). > > > > If only 1 is left, we don't need the full outer_cache framework but it > > still needs to be addressed since the assumption is that flush_cache_all > > (or __flush_dcache_all) flushes all cache levels. These are not used in > > generic code but are used during kernel booting, KVM and cpuidle > > drivers. > > > >> > Now, back to the idea of outer_cache framework for arm64. Does your CPU > >> > have separate instructions for flushing this L3 cache? > >> > >> No, CPU does not have separate instruction for flushing L3-cache. On the > >> other hand, L3-cache has MMIO registers which can be use to explicitly > >> flush L3-cache. > > > > I guess you use those in your firmware or boot loader since Linux > > requires clean/invalidated caches at boot (and I plan to push a patch > > which removes kernel D-cache cleaning during boot to spot such problems > > early). A cpuidle driver would probably need this as well. > > Yes, cpuidle would be another place where we may need L3-cache > maintenance. In other words, if we need to power-off L3-cache from > kernel then we need to first flush it. The problem is if you need to disable the MMU in the kernel, you would need to flush the L3 cache first. Normally this would be done in the firmware with PSCI but most likely not the case for the Applied hardware.
On Fri, Aug 30, 2013 at 11:44:30AM +0100, Anup Patel wrote: > On Fri, Aug 30, 2013 at 3:22 PM, Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Fri, Aug 16, 2013 at 07:11:51PM +0100, Anup Patel wrote: > >> On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall > >> <christoffer.dall@linaro.org> wrote: > >> > On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote: > >> >> Discussion here is about getting KVM ARM64 working in-presence > >> >> of an external L3-cache (i.e. not part of CPU). Before starting a VCPU > >> >> user-space typically loads images to guest RAM so, in-presence of > >> >> huge L3-cache (few MBs). When the VCPU starts running some of the > >> >> contents guest RAM will be still in L3-cache and VCPU runs with > >> >> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and > >> >> see incorrect contents. To solve this problem we need to flush the > >> >> guest RAM contents before they are accessed by first time by VCPU. > >> >> > >> > ok, I'm with you that far. > >> > > >> > But is it also not true that we need to decide between: > >> > > >> > A.1: Flush the entire guest RAM before running the VCPU > >> > A.2: Flush the pages as we fault them in > >> > >> Yes, thats the decision we have to make. > >> > >> > > >> > And (independently): > >> > > >> > B.1: Use __flush_dcache_range > >> > B.2: Use something else + outer cache framework for arm64 > >> > >> This would be __flush_dcache_all() + outer cache flush all. > > > > We need to be careful here since the __flush_dcache_all() operation uses > > cache maintenance by set/way and these are *local* to a CPU (IOW not > > broadcast). Do you have any guarantee that dirty cache lines don't > > migrate between CPUs and __flush_dcache_all() wouldn't miss them? > > Architecturally we don't, so this is not a safe operation that would > > guarantee L1 cache flushing (we probably need to revisit some of the > > __flush_dcache_all() calls in KVM, I haven't looked into this). > > > > So I think we are left to the range operation where the DC ops to PoC > > would be enough for your L3. > > If __flush_dcache_all() is *local" to a CPU then I guess DC ops to PoC > by range would be the only option. Yes. In the (upcoming) ARM ARMv8 there is a clear note that set/way operations to flush the whole cache must not be used for the maintenance of large buffer but only during power-down/power-up code sequences. > > An outer cache flush all is probably only needed for cpuidle/suspend > > (the booting part should be handled by the boot loader). > > Yes, cpuidle/suspend would definitely require outer cache maintenance. > > For KVM, we can avoid flushing d-cache to PoC every time in > coherent_icache_guest_page() by only doing it when Guest MMU is > turned-off. This may reduce the performance penalty. That's for the KVM guys to decide ;)
On 2013-08-30 11:44, Anup Patel wrote: > On Fri, Aug 30, 2013 at 3:22 PM, Catalin Marinas > <catalin.marinas@arm.com> wrote: >> On Fri, Aug 16, 2013 at 07:11:51PM +0100, Anup Patel wrote: >>> On Fri, Aug 16, 2013 at 11:20 PM, Christoffer Dall >>> <christoffer.dall@linaro.org> wrote: >>> > On Fri, Aug 16, 2013 at 11:12:08PM +0530, Anup Patel wrote: >>> >> Discussion here is about getting KVM ARM64 working in-presence >>> >> of an external L3-cache (i.e. not part of CPU). Before starting >>> a VCPU >>> >> user-space typically loads images to guest RAM so, in-presence >>> of >>> >> huge L3-cache (few MBs). When the VCPU starts running some of >>> the >>> >> contents guest RAM will be still in L3-cache and VCPU runs with >>> >> MMU off (i.e. cacheing off) hence VCPU will bypass L3-cache and >>> >> see incorrect contents. To solve this problem we need to flush >>> the >>> >> guest RAM contents before they are accessed by first time by >>> VCPU. >>> >> >>> > ok, I'm with you that far. >>> > >>> > But is it also not true that we need to decide between: >>> > >>> > A.1: Flush the entire guest RAM before running the VCPU >>> > A.2: Flush the pages as we fault them in >>> >>> Yes, thats the decision we have to make. >>> >>> > >>> > And (independently): >>> > >>> > B.1: Use __flush_dcache_range >>> > B.2: Use something else + outer cache framework for arm64 >>> >>> This would be __flush_dcache_all() + outer cache flush all. >> >> We need to be careful here since the __flush_dcache_all() operation >> uses >> cache maintenance by set/way and these are *local* to a CPU (IOW not >> broadcast). Do you have any guarantee that dirty cache lines don't >> migrate between CPUs and __flush_dcache_all() wouldn't miss them? >> Architecturally we don't, so this is not a safe operation that would >> guarantee L1 cache flushing (we probably need to revisit some of the >> __flush_dcache_all() calls in KVM, I haven't looked into this). >> >> So I think we are left to the range operation where the DC ops to >> PoC >> would be enough for your L3. > > If __flush_dcache_all() is *local" to a CPU then I guess DC ops to > PoC > by range would be the only option. > >> >> An outer cache flush all is probably only needed for cpuidle/suspend >> (the booting part should be handled by the boot loader). > > Yes, cpuidle/suspend would definitely require outer cache > maintenance. > > For KVM, we can avoid flushing d-cache to PoC every time in > coherent_icache_guest_page() by only doing it when Guest MMU is > turned-off. This may reduce the performance penalty. What about the I and C bits in SCTLR_EL1? Does L3 also honour these bits? M.
On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote: > On 2013-08-30 11:44, Anup Patel wrote: > > For KVM, we can avoid flushing d-cache to PoC every time in > > coherent_icache_guest_page() by only doing it when Guest MMU is > > turned-off. This may reduce the performance penalty. > > What about the I and C bits in SCTLR_EL1? Does L3 also honour these > bits? I would think so, it probably cares about how the transactions are presented at the bus level by the CPU.
On 2013-08-30 15:04, Catalin Marinas wrote: > On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote: >> On 2013-08-30 11:44, Anup Patel wrote: >> > For KVM, we can avoid flushing d-cache to PoC every time in >> > coherent_icache_guest_page() by only doing it when Guest MMU is >> > turned-off. This may reduce the performance penalty. >> >> What about the I and C bits in SCTLR_EL1? Does L3 also honour these >> bits? > > I would think so, it probably cares about how the transactions are > presented at the bus level by the CPU. That'd make sense indeed. So we need to track both SCTLR_EL1.{I,C,M} in order to find out whether or not we need to clean the cache. Mumble mumble... M.
On Fri, Aug 30, 2013 at 03:22:11PM +0100, Marc Zyngier wrote: > On 2013-08-30 15:04, Catalin Marinas wrote: > > On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote: > >> On 2013-08-30 11:44, Anup Patel wrote: > >> > For KVM, we can avoid flushing d-cache to PoC every time in > >> > coherent_icache_guest_page() by only doing it when Guest MMU is > >> > turned-off. This may reduce the performance penalty. > >> > >> What about the I and C bits in SCTLR_EL1? Does L3 also honour these > >> bits? > > > > I would think so, it probably cares about how the transactions are > > presented at the bus level by the CPU. > > That'd make sense indeed. So we need to track both SCTLR_EL1.{I,C,M} in > order to find out whether or not we need to clean the cache. <horrible hack> Could you use a dummy ATS_* operation, then read the cacheability attributes out of the PAR? </horrible hack> Will
Hi Will, On Fri, Aug 30, 2013 at 8:00 PM, Will Deacon <will.deacon@arm.com> wrote: > On Fri, Aug 30, 2013 at 03:22:11PM +0100, Marc Zyngier wrote: >> On 2013-08-30 15:04, Catalin Marinas wrote: >> > On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote: >> >> On 2013-08-30 11:44, Anup Patel wrote: >> >> > For KVM, we can avoid flushing d-cache to PoC every time in >> >> > coherent_icache_guest_page() by only doing it when Guest MMU is >> >> > turned-off. This may reduce the performance penalty. >> >> >> >> What about the I and C bits in SCTLR_EL1? Does L3 also honour these >> >> bits? >> > >> > I would think so, it probably cares about how the transactions are >> > presented at the bus level by the CPU. >> >> That'd make sense indeed. So we need to track both SCTLR_EL1.{I,C,M} in >> order to find out whether or not we need to clean the cache. > > <horrible hack> > Could you use a dummy ATS_* operation, then read the cacheability attributes > out of the PAR? > </horrible hack> We will have to use ATS12xxx operation which can only be executed from EL2 or EL3. The host kernel runs at EL1 hence we will need to use HVC call to execute ATS12xxx operation in EL2 mode which in-turn means a world-switch overhead just to execute ATS12xxx operation. > > Will --Anup
Hi Will, On 2013-08-30 15:30, Will Deacon wrote: > On Fri, Aug 30, 2013 at 03:22:11PM +0100, Marc Zyngier wrote: >> On 2013-08-30 15:04, Catalin Marinas wrote: >> > On Fri, Aug 30, 2013 at 02:21:57PM +0100, Marc Zyngier wrote: >> >> On 2013-08-30 11:44, Anup Patel wrote: >> >> > For KVM, we can avoid flushing d-cache to PoC every time in >> >> > coherent_icache_guest_page() by only doing it when Guest MMU is >> >> > turned-off. This may reduce the performance penalty. >> >> >> >> What about the I and C bits in SCTLR_EL1? Does L3 also honour >> these >> >> bits? >> > >> > I would think so, it probably cares about how the transactions are >> > presented at the bus level by the CPU. >> >> That'd make sense indeed. So we need to track both SCTLR_EL1.{I,C,M} >> in >> order to find out whether or not we need to clean the cache. > > <horrible hack> > Could you use a dummy ATS_* operation, then read the cacheability > attributes > out of the PAR? > </horrible hack> That'd be doable, and maybe not that expensive (using ATS1, the Stage-1 translation is probably still cached after the translation fault). Actually, I suspect this is more correct than just looking at SCTLR_EL1, as it doesn't assume that the guest uses always cacheable memory when enables the caches. Shame HPFAR_EL2 doesn't report this information though... M.
diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h index a5f28e2..a71a9bf 100644 --- a/arch/arm64/include/asm/kvm_arm.h +++ b/arch/arm64/include/asm/kvm_arm.h @@ -60,10 +60,12 @@ /* * The bits we set in HCR: * RW: 64bit by default, can be overriden for 32bit VMs + * TVM: Trap writes to VM registers (until MMU is on) * TAC: Trap ACTLR * TSC: Trap SMC * TSW: Trap cache operations by set/way * TWI: Trap WFI + * DC: Default Cacheable (until MMU is on) * TIDCP: Trap L2CTLR/L2ECTLR * BSU_IS: Upgrade barriers to the inner shareable domain * FB: Force broadcast of all maintainance operations @@ -74,7 +76,7 @@ */ #define HCR_GUEST_FLAGS (HCR_TSC | HCR_TSW | HCR_TWI | HCR_VM | HCR_BSU_IS | \ HCR_FB | HCR_TAC | HCR_AMO | HCR_IMO | HCR_FMO | \ - HCR_SWIO | HCR_TIDCP | HCR_RW) + HCR_DC | HCR_TVM | HCR_SWIO | HCR_TIDCP | HCR_RW) #define HCR_VIRT_EXCP_MASK (HCR_VA | HCR_VI | HCR_VF) /* Hyp System Control Register (SCTLR_EL2) bits */ diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c index 9492360..6d81d87 100644 --- a/arch/arm64/kvm/sys_regs.c +++ b/arch/arm64/kvm/sys_regs.c @@ -121,6 +121,42 @@ done: } /* + * Generic accessor for VM registers. Only called as long as HCR_TVM + * is set. + */ +static bool access_vm_reg(struct kvm_vcpu *vcpu, + const struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + BUG_ON(!p->is_write); + + vcpu_sys_reg(vcpu, r->reg) = *vcpu_reg(vcpu, p->Rt); + return true; +} + +/* + * SCTLR_EL1 accessor. Only called as long as HCR_TVM is set. If the + * guest enables the MMU, we stop trapping the VM sys_regs and leave + * it in complete control of the caches. + */ +static bool access_sctlr_el1(struct kvm_vcpu *vcpu, + const struct sys_reg_params *p, + const struct sys_reg_desc *r) +{ + unsigned long val; + + BUG_ON(!p->is_write); + + val = *vcpu_reg(vcpu, p->Rt); + vcpu_sys_reg(vcpu, r->reg) = val; + + if (val & (1 << 0)) /* MMU Enabled? */ + vcpu->arch.hcr_el2 &= ~(HCR_DC | HCR_TVM); + + return true; +} + +/* * We could trap ID_DFR0 and tell the guest we don't support performance * monitoring. Unfortunately the patch to make the kernel check ID_DFR0 was * NAKed, so it will read the PMCR anyway. @@ -185,32 +221,32 @@ static const struct sys_reg_desc sys_reg_descs[] = { NULL, reset_mpidr, MPIDR_EL1 }, /* SCTLR_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b0001), CRm(0b0000), Op2(0b000), - NULL, reset_val, SCTLR_EL1, 0x00C50078 }, + access_sctlr_el1, reset_val, SCTLR_EL1, 0x00C50078 }, /* CPACR_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b0001), CRm(0b0000), Op2(0b010), NULL, reset_val, CPACR_EL1, 0 }, /* TTBR0_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b000), - NULL, reset_unknown, TTBR0_EL1 }, + access_vm_reg, reset_unknown, TTBR0_EL1 }, /* TTBR1_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b001), - NULL, reset_unknown, TTBR1_EL1 }, + access_vm_reg, reset_unknown, TTBR1_EL1 }, /* TCR_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b0010), CRm(0b0000), Op2(0b010), - NULL, reset_val, TCR_EL1, 0 }, + access_vm_reg, reset_val, TCR_EL1, 0 }, /* AFSR0_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0001), Op2(0b000), - NULL, reset_unknown, AFSR0_EL1 }, + access_vm_reg, reset_unknown, AFSR0_EL1 }, /* AFSR1_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0001), Op2(0b001), - NULL, reset_unknown, AFSR1_EL1 }, + access_vm_reg, reset_unknown, AFSR1_EL1 }, /* ESR_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b0101), CRm(0b0010), Op2(0b000), - NULL, reset_unknown, ESR_EL1 }, + access_vm_reg, reset_unknown, ESR_EL1 }, /* FAR_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b0110), CRm(0b0000), Op2(0b000), - NULL, reset_unknown, FAR_EL1 }, + access_vm_reg, reset_unknown, FAR_EL1 }, /* PMINTENSET_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b1001), CRm(0b1110), Op2(0b001), @@ -221,17 +257,17 @@ static const struct sys_reg_desc sys_reg_descs[] = { /* MAIR_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0010), Op2(0b000), - NULL, reset_unknown, MAIR_EL1 }, + access_vm_reg, reset_unknown, MAIR_EL1 }, /* AMAIR_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b1010), CRm(0b0011), Op2(0b000), - NULL, reset_amair_el1, AMAIR_EL1 }, + access_vm_reg, reset_amair_el1, AMAIR_EL1 }, /* VBAR_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b1100), CRm(0b0000), Op2(0b000), NULL, reset_val, VBAR_EL1, 0 }, /* CONTEXTIDR_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b1101), CRm(0b0000), Op2(0b001), - NULL, reset_val, CONTEXTIDR_EL1, 0 }, + access_vm_reg, reset_val, CONTEXTIDR_EL1, 0 }, /* TPIDR_EL1 */ { Op0(0b11), Op1(0b000), CRn(0b1101), CRm(0b0000), Op2(0b100), NULL, reset_unknown, TPIDR_EL1 },