Message ID | 20240819070827.3620020-4-kirill.shutemov@linux.intel.com |
---|---|
State | Superseded |
Headers | show |
Series | x86: Reduce code duplication on page table initialization | expand |
On Mon, 2024-08-19 at 10:08 +0300, Kirill A. Shutemov wrote: > The init_transition_pgtable() function sets up transitional page tables. > It ensures that the relocate_kernel() function is present in the > identity mapping at the same location as in the kernel page tables. > relocate_kernel() switches to the identity mapping, and the function > must be present at the same location in the virtual address space before > and after switching page tables. > > init_transition_pgtable() maps a copy of relocate_kernel() in > image->control_code_page at the relocate_kernel() virtual address, but > the original physical address of relocate_kernel() would also work. > > It is safe to use original relocate_kernel() physical address cannot be > overwritten until swap_pages() is called, and the relocate_kernel() > virtual address will not be used by then. > > Map the original relocate_kernel() at the relocate_kernel() virtual > address in the identity mapping. It is preparation to replace the > init_transition_pgtable() implementation with a call to > kernel_ident_mapping_init(). > > Note that while relocate_kernel() switches to the identity mapping, it > does not flush global TLB entries (CR4.PGE is not cleared). This means > that in most cases, the kernel still runs relocate_kernel() from the > original physical address before the change. > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > --- > arch/x86/kernel/machine_kexec_64.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c > index 9c9ac606893e..645690e81c2d 100644 > --- a/arch/x86/kernel/machine_kexec_64.c > +++ b/arch/x86/kernel/machine_kexec_64.c > @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) > pte_t *pte; > > vaddr = (unsigned long)relocate_kernel; > - paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE); > + paddr = __pa(relocate_kernel); > pgd += pgd_index(vaddr); > if (!pgd_present(*pgd)) { > p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL); IIUC, this breaks KEXEC_JUMP (image->preserve_context is true). The relocate_kernel() first saves couple of regs and some other data like PA of swap page to the control page. Note here the VA_CONTROL_PAGE is used to access the control page, so those data are saved to the control page. SYM_CODE_START_NOALIGN(relocate_kernel) UNWIND_HINT_END_OF_STACK ANNOTATE_NOENDBR /* * %rdi indirection_page * %rsi page_list * %rdx start address * %rcx preserve_context * %r8 bare_metal */ ... movq PTR(VA_CONTROL_PAGE)(%rsi), %r11 movq %rsp, RSP(%r11) movq %cr0, %rax movq %rax, CR0(%r11) movq %cr3, %rax movq %rax, CR3(%r11) movq %cr4, %rax movq %rax, CR4(%r11) ... /* * get physical address of control page now * this is impossible after page table switch */ movq PTR(PA_CONTROL_PAGE)(%rsi), %r8 /* get physical address of page table now too */ movq PTR(PA_TABLE_PAGE)(%rsi), %r9 /* get physical address of swap page now */ movq PTR(PA_SWAP_PAGE)(%rsi), %r10 /* save some information for jumping back */ movq %r9, CP_PA_TABLE_PAGE(%r11) movq %r10, CP_PA_SWAP_PAGE(%r11) movq %rdi, CP_PA_BACKUP_PAGES_MAP(%r11) ... And after jumping back from the second kernel, relocate_kernel() tries to restore the saved data: ... /* get the re-entry point of the peer system */ movq 0(%rsp), %rbp leaq relocate_kernel(%rip), %r8 <--------- (*) movq CP_PA_SWAP_PAGE(%r8), %r10 movq CP_PA_BACKUP_PAGES_MAP(%r8), %rdi movq CP_PA_TABLE_PAGE(%r8), %rax movq %rax, %cr3 lea PAGE_SIZE(%r8), %rsp call swap_pages movq $virtual_mapped, %rax pushq %rax ANNOTATE_UNRET_SAFE ret int3 SYM_CODE_END(identity_mapped) Note the above code (*) uses the VA of relocate_kernel() to access the control page. IIUC, that means if we map VA of relocate_kernel() to the original PA where the code relocate_kernel() resides, then the above code will never be able to read those data back since they were saved to the control page. Did I miss anything?
On Mon, Aug 19, 2024 at 11:16:52AM +0000, Huang, Kai wrote: > On Mon, 2024-08-19 at 10:08 +0300, Kirill A. Shutemov wrote: > > The init_transition_pgtable() function sets up transitional page tables. > > It ensures that the relocate_kernel() function is present in the > > identity mapping at the same location as in the kernel page tables. > > relocate_kernel() switches to the identity mapping, and the function > > must be present at the same location in the virtual address space before > > and after switching page tables. > > > > init_transition_pgtable() maps a copy of relocate_kernel() in > > image->control_code_page at the relocate_kernel() virtual address, but > > the original physical address of relocate_kernel() would also work. > > > > It is safe to use original relocate_kernel() physical address cannot be > > overwritten until swap_pages() is called, and the relocate_kernel() > > virtual address will not be used by then. > > > > Map the original relocate_kernel() at the relocate_kernel() virtual > > address in the identity mapping. It is preparation to replace the > > init_transition_pgtable() implementation with a call to > > kernel_ident_mapping_init(). > > > > Note that while relocate_kernel() switches to the identity mapping, it > > does not flush global TLB entries (CR4.PGE is not cleared). This means > > that in most cases, the kernel still runs relocate_kernel() from the > > original physical address before the change. > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > --- > > arch/x86/kernel/machine_kexec_64.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c > > index 9c9ac606893e..645690e81c2d 100644 > > --- a/arch/x86/kernel/machine_kexec_64.c > > +++ b/arch/x86/kernel/machine_kexec_64.c > > @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) > > pte_t *pte; > > > > vaddr = (unsigned long)relocate_kernel; > > - paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE); > > + paddr = __pa(relocate_kernel); > > pgd += pgd_index(vaddr); > > if (!pgd_present(*pgd)) { > > p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL); > > > IIUC, this breaks KEXEC_JUMP (image->preserve_context is true). > > The relocate_kernel() first saves couple of regs and some other data like PA > of swap page to the control page. Note here the VA_CONTROL_PAGE is used to > access the control page, so those data are saved to the control page. > > SYM_CODE_START_NOALIGN(relocate_kernel) > UNWIND_HINT_END_OF_STACK > ANNOTATE_NOENDBR > /* > * %rdi indirection_page > * %rsi page_list > * %rdx start address > * %rcx preserve_context > * %r8 bare_metal > */ > > ... > > movq PTR(VA_CONTROL_PAGE)(%rsi), %r11 > movq %rsp, RSP(%r11) > movq %cr0, %rax > movq %rax, CR0(%r11) > movq %cr3, %rax > movq %rax, CR3(%r11) > movq %cr4, %rax > movq %rax, CR4(%r11) > > ... > > /* > * get physical address of control page now > * this is impossible after page table switch > */ > movq PTR(PA_CONTROL_PAGE)(%rsi), %r8 > > /* get physical address of page table now too */ > movq PTR(PA_TABLE_PAGE)(%rsi), %r9 > > /* get physical address of swap page now */ > movq PTR(PA_SWAP_PAGE)(%rsi), %r10 > > /* save some information for jumping back */ > movq %r9, CP_PA_TABLE_PAGE(%r11) > movq %r10, CP_PA_SWAP_PAGE(%r11) > movq %rdi, CP_PA_BACKUP_PAGES_MAP(%r11) > > ... > > And after jumping back from the second kernel, relocate_kernel() tries to > restore the saved data: > > ... > > /* get the re-entry point of the peer system */ > movq 0(%rsp), %rbp > leaq relocate_kernel(%rip), %r8 <--------- (*) > movq CP_PA_SWAP_PAGE(%r8), %r10 > movq CP_PA_BACKUP_PAGES_MAP(%r8), %rdi > movq CP_PA_TABLE_PAGE(%r8), %rax > movq %rax, %cr3 > lea PAGE_SIZE(%r8), %rsp > call swap_pages > movq $virtual_mapped, %rax > pushq %rax > ANNOTATE_UNRET_SAFE > ret > int3 > SYM_CODE_END(identity_mapped) > > Note the above code (*) uses the VA of relocate_kernel() to access the control > page. IIUC, that means if we map VA of relocate_kernel() to the original PA > where the code relocate_kernel() resides, then the above code will never be > able to read those data back since they were saved to the control page. > > Did I miss anything? Note that relocate_kernel() usage at (*) is inside identity_mapped(). We run from identity mapping there. Nothing changed to identity mapping around relocate_kernel(), only top mapping (at __START_KERNEL_map) is affected. But I didn't test kexec jump thing. Do you (or anybody else) have setup to test it?
On Mon, 2024-08-19 at 14:57 +0300, kirill.shutemov@linux.intel.com wrote: > On Mon, Aug 19, 2024 at 11:16:52AM +0000, Huang, Kai wrote: > > On Mon, 2024-08-19 at 10:08 +0300, Kirill A. Shutemov wrote: > > > The init_transition_pgtable() function sets up transitional page tables. > > > It ensures that the relocate_kernel() function is present in the > > > identity mapping at the same location as in the kernel page tables. > > > relocate_kernel() switches to the identity mapping, and the function > > > must be present at the same location in the virtual address space before > > > and after switching page tables. > > > > > > init_transition_pgtable() maps a copy of relocate_kernel() in > > > image->control_code_page at the relocate_kernel() virtual address, but > > > the original physical address of relocate_kernel() would also work. > > > > > > It is safe to use original relocate_kernel() physical address cannot be > > > overwritten until swap_pages() is called, and the relocate_kernel() > > > virtual address will not be used by then. > > > > > > Map the original relocate_kernel() at the relocate_kernel() virtual > > > address in the identity mapping. It is preparation to replace the > > > init_transition_pgtable() implementation with a call to > > > kernel_ident_mapping_init(). > > > > > > Note that while relocate_kernel() switches to the identity mapping, it > > > does not flush global TLB entries (CR4.PGE is not cleared). This means > > > that in most cases, the kernel still runs relocate_kernel() from the > > > original physical address before the change. > > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > > --- > > > arch/x86/kernel/machine_kexec_64.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c > > > index 9c9ac606893e..645690e81c2d 100644 > > > --- a/arch/x86/kernel/machine_kexec_64.c > > > +++ b/arch/x86/kernel/machine_kexec_64.c > > > @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) > > > pte_t *pte; > > > > > > vaddr = (unsigned long)relocate_kernel; > > > - paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE); > > > + paddr = __pa(relocate_kernel); > > > pgd += pgd_index(vaddr); > > > if (!pgd_present(*pgd)) { > > > p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL); > > > > > > IIUC, this breaks KEXEC_JUMP (image->preserve_context is true). > > > > The relocate_kernel() first saves couple of regs and some other data like PA > > of swap page to the control page. Note here the VA_CONTROL_PAGE is used to > > access the control page, so those data are saved to the control page. > > > > SYM_CODE_START_NOALIGN(relocate_kernel) > > UNWIND_HINT_END_OF_STACK > > ANNOTATE_NOENDBR > > /* > > * %rdi indirection_page > > * %rsi page_list > > * %rdx start address > > * %rcx preserve_context > > * %r8 bare_metal > > */ > > > > ... > > > > movq PTR(VA_CONTROL_PAGE)(%rsi), %r11 > > movq %rsp, RSP(%r11) > > movq %cr0, %rax > > movq %rax, CR0(%r11) > > movq %cr3, %rax > > movq %rax, CR3(%r11) > > movq %cr4, %rax > > movq %rax, CR4(%r11) > > > > ... > > > > /* > > * get physical address of control page now > > * this is impossible after page table switch > > */ > > movq PTR(PA_CONTROL_PAGE)(%rsi), %r8 > > > > /* get physical address of page table now too */ > > movq PTR(PA_TABLE_PAGE)(%rsi), %r9 > > > > /* get physical address of swap page now */ > > movq PTR(PA_SWAP_PAGE)(%rsi), %r10 > > > > /* save some information for jumping back */ > > movq %r9, CP_PA_TABLE_PAGE(%r11) > > movq %r10, CP_PA_SWAP_PAGE(%r11) > > movq %rdi, CP_PA_BACKUP_PAGES_MAP(%r11) > > > > ... > > > > And after jumping back from the second kernel, relocate_kernel() tries to > > restore the saved data: > > > > ... > > > > /* get the re-entry point of the peer system */ > > movq 0(%rsp), %rbp > > leaq relocate_kernel(%rip), %r8 <--------- (*) > > movq CP_PA_SWAP_PAGE(%r8), %r10 > > movq CP_PA_BACKUP_PAGES_MAP(%r8), %rdi > > movq CP_PA_TABLE_PAGE(%r8), %rax > > movq %rax, %cr3 > > lea PAGE_SIZE(%r8), %rsp > > call swap_pages > > movq $virtual_mapped, %rax > > pushq %rax > > ANNOTATE_UNRET_SAFE > > ret > > int3 > > SYM_CODE_END(identity_mapped) > > > > Note the above code (*) uses the VA of relocate_kernel() to access the control > > page. IIUC, that means if we map VA of relocate_kernel() to the original PA > > where the code relocate_kernel() resides, then the above code will never be > > able to read those data back since they were saved to the control page. > > > > Did I miss anything? > > Note that relocate_kernel() usage at (*) is inside identity_mapped(). We > run from identity mapping there. Nothing changed to identity mapping > around relocate_kernel(), only top mapping (at __START_KERNEL_map) is > affected. Yes, but before this patch the VA of relocate_kernel() is mapped to the copied one, which resides in the control page: control_page = page_address(image->control_code_page) + PAGE_SIZE; __memcpy(control_page, relocate_kernel, KEXEC_CONTROL_CODE_MAX_SIZE); page_list[PA_CONTROL_PAGE] = virt_to_phys(control_page); page_list[VA_CONTROL_PAGE] = (unsigned long)control_page; So the (*) can actually access to the control page IIUC. Now if we change to map VA of relocate_kernel() to the original one, then (*) won't be able to access the control page. > > But I didn't test kexec jump thing. Do you (or anybody else) have setup to > test it? > No I don't know how to test either, just my understanding on the code :-( Git blame says Ying is the original author, so +Ying here hoping he can provide some insight. Anyway, my opinion is we should do patch 4 first but still map VA of relocate_kernel() to control page so there will be no functional change. This patchset is about to reduce duplicated code anyway.
On Mon, Aug 19, 2024 at 12:39:23PM +0000, Huang, Kai wrote: > On Mon, 2024-08-19 at 14:57 +0300, kirill.shutemov@linux.intel.com wrote: > > On Mon, Aug 19, 2024 at 11:16:52AM +0000, Huang, Kai wrote: > > > On Mon, 2024-08-19 at 10:08 +0300, Kirill A. Shutemov wrote: > > > > The init_transition_pgtable() function sets up transitional page tables. > > > > It ensures that the relocate_kernel() function is present in the > > > > identity mapping at the same location as in the kernel page tables. > > > > relocate_kernel() switches to the identity mapping, and the function > > > > must be present at the same location in the virtual address space before > > > > and after switching page tables. > > > > > > > > init_transition_pgtable() maps a copy of relocate_kernel() in > > > > image->control_code_page at the relocate_kernel() virtual address, but > > > > the original physical address of relocate_kernel() would also work. > > > > > > > > It is safe to use original relocate_kernel() physical address cannot be > > > > overwritten until swap_pages() is called, and the relocate_kernel() > > > > virtual address will not be used by then. > > > > > > > > Map the original relocate_kernel() at the relocate_kernel() virtual > > > > address in the identity mapping. It is preparation to replace the > > > > init_transition_pgtable() implementation with a call to > > > > kernel_ident_mapping_init(). > > > > > > > > Note that while relocate_kernel() switches to the identity mapping, it > > > > does not flush global TLB entries (CR4.PGE is not cleared). This means > > > > that in most cases, the kernel still runs relocate_kernel() from the > > > > original physical address before the change. > > > > > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > > > --- > > > > arch/x86/kernel/machine_kexec_64.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c > > > > index 9c9ac606893e..645690e81c2d 100644 > > > > --- a/arch/x86/kernel/machine_kexec_64.c > > > > +++ b/arch/x86/kernel/machine_kexec_64.c > > > > @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) > > > > pte_t *pte; > > > > > > > > vaddr = (unsigned long)relocate_kernel; > > > > - paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE); > > > > + paddr = __pa(relocate_kernel); > > > > pgd += pgd_index(vaddr); > > > > if (!pgd_present(*pgd)) { > > > > p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL); > > > > > > > > > IIUC, this breaks KEXEC_JUMP (image->preserve_context is true). > > > > > > The relocate_kernel() first saves couple of regs and some other data like PA > > > of swap page to the control page. Note here the VA_CONTROL_PAGE is used to > > > access the control page, so those data are saved to the control page. > > > > > > SYM_CODE_START_NOALIGN(relocate_kernel) > > > UNWIND_HINT_END_OF_STACK > > > ANNOTATE_NOENDBR > > > /* > > > * %rdi indirection_page > > > * %rsi page_list > > > * %rdx start address > > > * %rcx preserve_context > > > * %r8 bare_metal > > > */ > > > > > > ... > > > > > > movq PTR(VA_CONTROL_PAGE)(%rsi), %r11 > > > movq %rsp, RSP(%r11) > > > movq %cr0, %rax > > > movq %rax, CR0(%r11) > > > movq %cr3, %rax > > > movq %rax, CR3(%r11) > > > movq %cr4, %rax > > > movq %rax, CR4(%r11) > > > > > > ... > > > > > > /* > > > * get physical address of control page now > > > * this is impossible after page table switch > > > */ > > > movq PTR(PA_CONTROL_PAGE)(%rsi), %r8 > > > > > > /* get physical address of page table now too */ > > > movq PTR(PA_TABLE_PAGE)(%rsi), %r9 > > > > > > /* get physical address of swap page now */ > > > movq PTR(PA_SWAP_PAGE)(%rsi), %r10 > > > > > > /* save some information for jumping back */ > > > movq %r9, CP_PA_TABLE_PAGE(%r11) > > > movq %r10, CP_PA_SWAP_PAGE(%r11) > > > movq %rdi, CP_PA_BACKUP_PAGES_MAP(%r11) > > > > > > ... > > > > > > And after jumping back from the second kernel, relocate_kernel() tries to > > > restore the saved data: > > > > > > ... > > > > > > /* get the re-entry point of the peer system */ > > > movq 0(%rsp), %rbp > > > leaq relocate_kernel(%rip), %r8 <--------- (*) > > > movq CP_PA_SWAP_PAGE(%r8), %r10 > > > movq CP_PA_BACKUP_PAGES_MAP(%r8), %rdi > > > movq CP_PA_TABLE_PAGE(%r8), %rax > > > movq %rax, %cr3 > > > lea PAGE_SIZE(%r8), %rsp > > > call swap_pages > > > movq $virtual_mapped, %rax > > > pushq %rax > > > ANNOTATE_UNRET_SAFE > > > ret > > > int3 > > > SYM_CODE_END(identity_mapped) > > > > > > Note the above code (*) uses the VA of relocate_kernel() to access the control > > > page. IIUC, that means if we map VA of relocate_kernel() to the original PA > > > where the code relocate_kernel() resides, then the above code will never be > > > able to read those data back since they were saved to the control page. > > > > > > Did I miss anything? > > > > Note that relocate_kernel() usage at (*) is inside identity_mapped(). We > > run from identity mapping there. Nothing changed to identity mapping > > around relocate_kernel(), only top mapping (at __START_KERNEL_map) is > > affected. > > Yes, but before this patch the VA of relocate_kernel() is mapped to the copied > one, which resides in the control page: > > control_page = page_address(image->control_code_page) + PAGE_SIZE; > __memcpy(control_page, relocate_kernel, KEXEC_CONTROL_CODE_MAX_SIZE); > > page_list[PA_CONTROL_PAGE] = virt_to_phys(control_page); > page_list[VA_CONTROL_PAGE] = (unsigned long)control_page; > > So the (*) can actually access to the control page IIUC. > > Now if we change to map VA of relocate_kernel() to the original one, then (*) > won't be able to access the control page. No, it still will be able to access control page. So we call relocate_kernel() in normal kernel text (within __START_KERNEL_map). relocate_kernel() switches to identity mapping, VA is still the same. relocate_kernel() jumps to identity_mapped() in the control page: /* * get physical address of control page now * this is impossible after page table switch */ movq PTR(PA_CONTROL_PAGE)(%rsi), %r8 ... /* jump to identity mapped page */ addq $(identity_mapped - relocate_kernel), %r8 pushq %r8 ANNOTATE_UNRET_SAFE ret The ADDQ finds offset of identity_mapped() in the control page. identity_mapping() finds start of the control page from *relative* position of relocate_page() to the current RIP in the control page: leaq relocate_kernel(%rip), %r8 It looks like this in my kernel binary: lea -0xfa(%rip),%r8 What PA is mapped at the normal kernel text VA of relocate_kernel() makes zero affect to the calculation. Does it make sense?
> > > > So the (*) can actually access to the control page IIUC. > > > > Now if we change to map VA of relocate_kernel() to the original one, then (*) > > won't be able to access the control page. > > No, it still will be able to access control page. > > So we call relocate_kernel() in normal kernel text (within > __START_KERNEL_map). > > relocate_kernel() switches to identity mapping, VA is still the same. > > relocate_kernel() jumps to identity_mapped() in the control page: > > > /* > * get physical address of control page now > * this is impossible after page table switch > */ > movq PTR(PA_CONTROL_PAGE)(%rsi), %r8 > > ... > > /* jump to identity mapped page */ > addq $(identity_mapped - relocate_kernel), %r8 > pushq %r8 > ANNOTATE_UNRET_SAFE > ret > > The ADDQ finds offset of identity_mapped() in the control page. > > identity_mapping() finds start of the control page from *relative* > position of relocate_page() to the current RIP in the control page: > > leaq relocate_kernel(%rip), %r8 > > It looks like this in my kernel binary: > > lea -0xfa(%rip),%r8 Ah I see. I missed the *relative* addressing. :-) > > What PA is mapped at the normal kernel text VA of relocate_kernel() makes > zero affect to the calculation. Yeah. > > Does it make sense? > Yes. Thanks for explanation. At later time: call swap_pages movq $virtual_mapped, %rax <---- (1) pushq %rax ANNOTATE_UNRET_SAFE ret <---- (2) (1) will load the VA which has __START_KERNEL_map to %rax, and after (2) the kernel will run at VA of the original relocate_kernel() which maps to the PA of the original relcoate_kernel(). But I think the memory page of the original relocate_kernel() won't get corrupted after returning from the second kernel, so should be safe to use?
On Tue, Aug 20, 2024 at 11:06:34AM +0000, Huang, Kai wrote: > At later time: > > call swap_pages > movq $virtual_mapped, %rax <---- (1) > pushq %rax > ANNOTATE_UNRET_SAFE > ret <---- (2) > > (1) will load the VA which has __START_KERNEL_map to %rax, and after (2) the > kernel will run at VA of the original relocate_kernel() which maps to the PA > of the original relcoate_kernel(). But I think the memory page of the > original relocate_kernel() won't get corrupted after returning from the second > kernel, so should be safe to use? Yes.
On Tue, 2024-08-20 at 14:14 +0300, kirill.shutemov@linux.intel.com wrote: > On Tue, Aug 20, 2024 at 11:06:34AM +0000, Huang, Kai wrote: > > At later time: > > > > call swap_pages > > movq $virtual_mapped, %rax <---- (1) > > pushq %rax > > ANNOTATE_UNRET_SAFE > > ret <---- (2) > > > > (1) will load the VA which has __START_KERNEL_map to %rax, and after (2) the > > kernel will run at VA of the original relocate_kernel() which maps to the PA > > of the original relcoate_kernel(). But I think the memory page of the > > original relocate_kernel() won't get corrupted after returning from the second > > kernel, so should be safe to use? > > Yes. > Reviewed-by: Kai Huang <kai.huang@intel.com>
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c index 9c9ac606893e..645690e81c2d 100644 --- a/arch/x86/kernel/machine_kexec_64.c +++ b/arch/x86/kernel/machine_kexec_64.c @@ -157,7 +157,7 @@ static int init_transition_pgtable(struct kimage *image, pgd_t *pgd) pte_t *pte; vaddr = (unsigned long)relocate_kernel; - paddr = __pa(page_address(image->control_code_page)+PAGE_SIZE); + paddr = __pa(relocate_kernel); pgd += pgd_index(vaddr); if (!pgd_present(*pgd)) { p4d = (p4d_t *)get_zeroed_page(GFP_KERNEL);
The init_transition_pgtable() function sets up transitional page tables. It ensures that the relocate_kernel() function is present in the identity mapping at the same location as in the kernel page tables. relocate_kernel() switches to the identity mapping, and the function must be present at the same location in the virtual address space before and after switching page tables. init_transition_pgtable() maps a copy of relocate_kernel() in image->control_code_page at the relocate_kernel() virtual address, but the original physical address of relocate_kernel() would also work. It is safe to use original relocate_kernel() physical address cannot be overwritten until swap_pages() is called, and the relocate_kernel() virtual address will not be used by then. Map the original relocate_kernel() at the relocate_kernel() virtual address in the identity mapping. It is preparation to replace the init_transition_pgtable() implementation with a call to kernel_ident_mapping_init(). Note that while relocate_kernel() switches to the identity mapping, it does not flush global TLB entries (CR4.PGE is not cleared). This means that in most cases, the kernel still runs relocate_kernel() from the original physical address before the change. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- arch/x86/kernel/machine_kexec_64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)