Message ID | 20170609170100.3599-4-alex.bennee@linaro.org |
---|---|
State | New |
Headers | show |
Series | debug patch for lookup-ptr hang | expand |
On 06/09/2017 10:01 AM, Alex Bennée wrote: > THIS IS A DEBUG PATCH DO NOT MERGE > > I include all the comments to show my working. I was trying to > isolate which instructions cause the problem. It turns out it is the > RET instruction. I don't understand why because AFAICT it is a > pretty much a BR instruction. Yeah, same thing for Alpha. It has been my guess that not chaining through RET means that we get back to the main loop regularly and often, letting interrupts be recognized in a timely manner. I can't figure out why that would be, however, since interrupts *ought* to be setting icount_decr, and the TB to which we chain *is* checking that to return to the main loop. Since changing the timing affects the outcome (e.g. -d exec), it follows that this *must* be some sort of race condition. But since this still happens with single-threaded mode, I can't imagine what sort of race condition it might be. More data points. I removed the tb_htable_lookup, and that by itself is enough to fix Alpha booting. But it doesn't help the aarch64 kernel+image that I have. Which does still boot with -d nochain (which, along with disabling goto_tb chaining, also disables all goto_ptr). Not really sure where to go from here. r~
Richard Henderson <rth@twiddle.net> writes: > On 06/09/2017 10:01 AM, Alex Bennée wrote: >> THIS IS A DEBUG PATCH DO NOT MERGE >> >> I include all the comments to show my working. I was trying to >> isolate which instructions cause the problem. It turns out it is the >> RET instruction. I don't understand why because AFAICT it is a >> pretty much a BR instruction. > > Yeah, same thing for Alpha. > > It has been my guess that not chaining through RET means that we get > back to the main loop regularly and often, letting interrupts be > recognized in a timely manner. > > I can't figure out why that would be, however, since interrupts > *ought* to be setting icount_decr, and the TB to which we chain *is* > checking that to return to the main loop. Indeed - if that was broken a lot more stuff wouldn't work. > Since changing the timing affects the outcome (e.g. -d exec), it > follows that this *must* be some sort of race condition. But since > this still happens with single-threaded mode, I can't imagine what > sort of race condition it might be. Apart from timer expiry I can't think what other interactions the other threads have on the main TCG thread. I guess there is IO but my test hangs way before the kernel starts poking the disk. Is there an interaction between IRQs and QEMU's serial driver? > > More data points. I removed the tb_htable_lookup, and that by itself > is enough to fix Alpha booting. But it doesn't help the aarch64 > kernel+image that I have. Which does still boot with -d nochain > (which, along with disabling goto_tb chaining, also disables all > goto_ptr). I wonder what is different about your aarch64 image and mine then? Because mine works just with suppressing the chaining for RET. > > Not really sure where to go from here. I would agree with Emilio that we revert but I can't quite shake the feeling we are missing an underlying problem. Would just skipping the htable lookup (but keeping the tb_jmp_cache) be an OK fix for now? Have we just been lucky that whatever mechanism causes the "hang" wasn't due to? > > > r~ -- Alex Bennée
On 06/10/2017 01:51 AM, Alex Bennée wrote: > > Richard Henderson <rth@twiddle.net> writes: > >> On 06/09/2017 10:01 AM, Alex Bennée wrote: >>> THIS IS A DEBUG PATCH DO NOT MERGE >>> >>> I include all the comments to show my working. I was trying to >>> isolate which instructions cause the problem. It turns out it is the >>> RET instruction. I don't understand why because AFAICT it is a >>> pretty much a BR instruction. >> >> Yeah, same thing for Alpha. >> >> It has been my guess that not chaining through RET means that we get >> back to the main loop regularly and often, letting interrupts be >> recognized in a timely manner. >> >> I can't figure out why that would be, however, since interrupts >> *ought* to be setting icount_decr, and the TB to which we chain *is* >> checking that to return to the main loop. > > Indeed - if that was broken a lot more stuff wouldn't work. > >> Since changing the timing affects the outcome (e.g. -d exec), it >> follows that this *must* be some sort of race condition. But since >> this still happens with single-threaded mode, I can't imagine what >> sort of race condition it might be. > > Apart from timer expiry I can't think what other interactions the other > threads have on the main TCG thread. I guess there is IO but my test > hangs way before the kernel starts poking the disk. Is there an > interaction between IRQs and QEMU's serial driver? The Alpha hang appears to be timer expiry. In that it happens as soon as the kernel spawns some kthreads to finish up the boot process. The kernel then sits in the idle loop for an unreasonably long time. But, bizarrely, it will complete the boot eventually. But it takes ~5 minutes to do so, when we ought to be able to boot to prompt in seconds. >> More data points. I removed the tb_htable_lookup, and that by itself >> is enough to fix Alpha booting. But it doesn't help the aarch64 >> kernel+image that I have. Which does still boot with -d nochain >> (which, along with disabling goto_tb chaining, also disables all >> goto_ptr). > > I wonder what is different about your aarch64 image and mine then? > Because mine works just with suppressing the chaining for RET. Oh I just tried -d nochain because it doesn't require source modification. >> Not really sure where to go from here. > > I would agree with Emilio that we revert but I can't quite shake the > feeling we are missing an underlying problem. Would just skipping the > htable lookup (but keeping the tb_jmp_cache) be an OK fix for now? I agree. It seems like there's some real problem that this is uncovering. Dropping the htable lookup is certainly ok by me. If that's enough to un-stick your regression testing for aarch64 guest. r~
On Sat, Jun 10, 2017 at 09:59:19 -0700, Richard Henderson wrote: > On 06/10/2017 01:51 AM, Alex Bennée wrote: > >I would agree with Emilio that we revert but I can't quite shake the > >feeling we are missing an underlying problem. Would just skipping the > >htable lookup (but keeping the tb_jmp_cache) be an OK fix for now? > > I agree. It seems like there's some real problem that this is uncovering. > > Dropping the htable lookup is certainly ok by me. If that's enough to > un-stick your regression testing for aarch64 guest. Unfortunately not checking the htable isn't enough to fix it for me. I spent a few hours on Friday trying (and failing) to understand what's going on. Regarding interrupts, I see *more* TB_EXIT_INTERRUPT exits than without the patch, so my current best guess is that this is a livelock kind of thing. I'll keep digging, but regardless I think we should revert e75449a346 ("target/aarch64: optimize indirect branches") ASAP, and then reenable the feature ("revert the revert") once we get to the bottom of this. E.
Emilio G. Cota <cota@braap.org> writes: > On Sat, Jun 10, 2017 at 09:59:19 -0700, Richard Henderson wrote: >> On 06/10/2017 01:51 AM, Alex Bennée wrote: >> >I would agree with Emilio that we revert but I can't quite shake the >> >feeling we are missing an underlying problem. Would just skipping the >> >htable lookup (but keeping the tb_jmp_cache) be an OK fix for now? >> >> I agree. It seems like there's some real problem that this is uncovering. >> >> Dropping the htable lookup is certainly ok by me. If that's enough to >> un-stick your regression testing for aarch64 guest. > > Unfortunately not checking the htable isn't enough to fix it for me. Interesting. It was for me on my more complex command line (i.e. with network and disk hardware) but while checking on my plain kernel+initrd image I can reproduce the "hang" purely with the jmp_cache lookup.... > I spent a few hours on Friday trying (and failing) to understand > what's going on. Regarding interrupts, I see *more* TB_EXIT_INTERRUPT > exits than without the patch, so my current best guess is that this is > a livelock kind of thing. > > I'll keep digging, but regardless I think we should revert > e75449a346 ("target/aarch64: optimize indirect branches") ASAP, and > then reenable the feature ("revert the revert") once we get to the > bottom of this. > > E. -- Alex Bennée
On 06/13/2017 03:53 PM, Emilio G. Cota wrote: > The appended fixes it for me. Can you please test? > [ apply with `git am --scissors' ] > > Thanks, > > Emilio > > ---- 8< ---- > > Commit e75449a3 ("target/aarch64: optimize indirect branches") causes > a regression by which aarch64 guests freeze under TCG with -smp > 1, > even with `-accel accel=tcg,thread=single' (i.e. MTTCG disabled). > > I isolated the problem to the MSR handler. This patch forces an exit > after the handler is executed, which fixes the regression. Why would that be? The cpu_get_tb_cpu_state within helper_lookup_tb_ptr is supposed to read the new state that the msr handler would have installed. r~
Emilio G. Cota <cota@braap.org> writes: > The appended fixes it for me. Can you please test? > [ apply with `git am --scissors' ] > > Thanks, > > Emilio > > ---- 8< ---- > > Commit e75449a3 ("target/aarch64: optimize indirect branches") causes > a regression by which aarch64 guests freeze under TCG with -smp > 1, > even with `-accel accel=tcg,thread=single' (i.e. MTTCG disabled). > > I isolated the problem to the MSR handler. This patch forces an exit > after the handler is executed, which fixes the regression. > > Signed-off-by: Emilio G. Cota <cota@braap.org> Tested-by: Alex Bennée <alex.bennee@linaro.org> But what exactly is the mechanism here? DISAS_UPDATE should have ensured that the PC was updated before we get to the helper. Is this a case of msr_i_pstate somehow getting missed or not causing a flag update which confuses the next TB calculation? > --- > target/arm/translate-a64.c | 6 +++++- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c > index 860e279..5a609a0 100644 > --- a/target/arm/translate-a64.c > +++ b/target/arm/translate-a64.c > @@ -1422,7 +1422,7 @@ static void handle_msr_i(DisasContext *s, uint32_t insn, > gen_helper_msr_i_pstate(cpu_env, tcg_op, tcg_imm); > tcg_temp_free_i32(tcg_imm); > tcg_temp_free_i32(tcg_op); > - s->is_jmp = DISAS_UPDATE; > + s->is_jmp = DISAS_EXIT; > break; > } > default: > @@ -11362,6 +11362,10 @@ void gen_intermediate_code_a64(ARMCPU *cpu, TranslationBlock *tb) > case DISAS_NEXT: > gen_goto_tb(dc, 1, dc->pc); > break; > + case DISAS_EXIT: > + gen_a64_set_pc_im(dc->pc); > + tcg_gen_exit_tb(0); > + break; > default: > case DISAS_UPDATE: > gen_a64_set_pc_im(dc->pc); -- Alex Bennée
On 14/06/2017 06:48, Richard Henderson wrote: >> >> Commit e75449a3 ("target/aarch64: optimize indirect branches") causes >> a regression by which aarch64 guests freeze under TCG with -smp > 1, >> even with `-accel accel=tcg,thread=single' (i.e. MTTCG disabled). >> >> I isolated the problem to the MSR handler. This patch forces an exit >> after the handler is executed, which fixes the regression. > > Why would that be? The cpu_get_tb_cpu_state within helper_lookup_tb_ptr > is supposed to read the new state that the msr handler would have > installed. Could some of these cause an interrupt, or some other change in the cpu_exec flow? Thanks, Paolo
Paolo Bonzini <pbonzini@redhat.com> writes: > On 14/06/2017 06:48, Richard Henderson wrote: >>> >>> Commit e75449a3 ("target/aarch64: optimize indirect branches") causes >>> a regression by which aarch64 guests freeze under TCG with -smp > 1, >>> even with `-accel accel=tcg,thread=single' (i.e. MTTCG disabled). >>> >>> I isolated the problem to the MSR handler. This patch forces an exit >>> after the handler is executed, which fixes the regression. >> >> Why would that be? The cpu_get_tb_cpu_state within helper_lookup_tb_ptr >> is supposed to read the new state that the msr handler would have >> installed. > > Could some of these cause an interrupt, or some other change in the > cpu_exec flow? Well what I was observing was the secondary_start_kernel stalling and leaving the main cpu spinning. The msr is actually: local_irq_enable(); local_fiq_enable(); Which I assume would re-enable IRQs if they are ready to go. However I guess if we sink into our cpu_idle without exiting the main loop we never set any pending IRQs? > > Thanks, > > Paolo -- Alex Bennée
On 14/06/2017 13:45, Alex Bennée wrote: > > Paolo Bonzini <pbonzini@redhat.com> writes: > >> On 14/06/2017 06:48, Richard Henderson wrote: >>>> >>>> Commit e75449a3 ("target/aarch64: optimize indirect branches") causes >>>> a regression by which aarch64 guests freeze under TCG with -smp > 1, >>>> even with `-accel accel=tcg,thread=single' (i.e. MTTCG disabled). >>>> >>>> I isolated the problem to the MSR handler. This patch forces an exit >>>> after the handler is executed, which fixes the regression. >>> >>> Why would that be? The cpu_get_tb_cpu_state within helper_lookup_tb_ptr >>> is supposed to read the new state that the msr handler would have >>> installed. >> >> Could some of these cause an interrupt, or some other change in the >> cpu_exec flow? > > Well what I was observing was the secondary_start_kernel stalling and > leaving the main cpu spinning. The msr is actually: > > local_irq_enable(); > local_fiq_enable(); > > Which I assume would re-enable IRQs if they are ready to go. However I > guess if we sink into our cpu_idle without exiting the main loop we > never set any pending IRQs? Then Emilio's patch, if a bit of a heavy hammer, is correct. After aa64_daif_write needs you need an exit_tb so that arm_cpu_exec_interrupt is executed again. Compare with this from the x86 front-end: /* if irq were inhibited with HF_INHIBIT_IRQ_MASK, we clear the flag and abort the translation to give the irqs a change to be happen */ if (dc->tf || dc->singlestep_enabled || (flags & HF_INHIBIT_IRQ_MASK)) { gen_jmp_im(pc_ptr - dc->cs_base); gen_eob(dc); break; } (This triggers one instruction after a STI instruction, due to how x86's "interrupt shadow" work, so it doesn't happen immediately after helper_sti; but the idea is the same). Paolo
Paolo Bonzini <pbonzini@redhat.com> writes: > On 14/06/2017 13:45, Alex Bennée wrote: >> >> Paolo Bonzini <pbonzini@redhat.com> writes: >> >>> On 14/06/2017 06:48, Richard Henderson wrote: >>>>> >>>>> Commit e75449a3 ("target/aarch64: optimize indirect branches") causes >>>>> a regression by which aarch64 guests freeze under TCG with -smp > 1, >>>>> even with `-accel accel=tcg,thread=single' (i.e. MTTCG disabled). >>>>> >>>>> I isolated the problem to the MSR handler. This patch forces an exit >>>>> after the handler is executed, which fixes the regression. >>>> >>>> Why would that be? The cpu_get_tb_cpu_state within helper_lookup_tb_ptr >>>> is supposed to read the new state that the msr handler would have >>>> installed. >>> >>> Could some of these cause an interrupt, or some other change in the >>> cpu_exec flow? >> >> Well what I was observing was the secondary_start_kernel stalling and >> leaving the main cpu spinning. The msr is actually: >> >> local_irq_enable(); >> local_fiq_enable(); >> >> Which I assume would re-enable IRQs if they are ready to go. However I >> guess if we sink into our cpu_idle without exiting the main loop we >> never set any pending IRQs? > > Then Emilio's patch, if a bit of a heavy hammer, is correct. After > aa64_daif_write needs you need an exit_tb so that arm_cpu_exec_interrupt > is executed again. This is a case of cpu->interrupt_request being pending but not having set cpu->icount_decr yet to signal the exit. Wouldn't another approach (that didn't involve futzing with each front-end) to be to check cpu->interrupt_request and force the exit in lookup_tb_ptr? > > Compare with this from the x86 front-end: > > /* if irq were inhibited with HF_INHIBIT_IRQ_MASK, we clear > the flag and abort the translation to give the irqs a > change to be happen */ > if (dc->tf || dc->singlestep_enabled || > (flags & HF_INHIBIT_IRQ_MASK)) { > gen_jmp_im(pc_ptr - dc->cs_base); > gen_eob(dc); > break; > } > > (This triggers one instruction after a STI instruction, due to how x86's > "interrupt shadow" work, so it doesn't happen immediately after > helper_sti; but the idea is the same). > > Paolo -- Alex Bennée
On 14/06/2017 14:14, Alex Bennée wrote: >> Then Emilio's patch, if a bit of a heavy hammer, is correct. After >> aa64_daif_write needs you need an exit_tb so that arm_cpu_exec_interrupt >> is executed again. > > This is a case of cpu->interrupt_request being pending but not having > set cpu->icount_decr yet to signal the exit. Rather than "yet", "anymore". So far it has always been an invariant that anything that re-enabled an interrupt had to do exit_tb. > Wouldn't another approach > (that didn't involve futzing with each front-end) to be to check > cpu->interrupt_request and force the exit in lookup_tb_ptr? That would cause an unnecessary slowdown in code that runs with interrupts disabled but does a lot of indirect jumps... ppc's SLOF firmware probably qualifies. Paolo
Paolo Bonzini <pbonzini@redhat.com> writes: > On 14/06/2017 14:14, Alex Bennée wrote: >>> Then Emilio's patch, if a bit of a heavy hammer, is correct. After >>> aa64_daif_write needs you need an exit_tb so that arm_cpu_exec_interrupt >>> is executed again. >> >> This is a case of cpu->interrupt_request being pending but not having >> set cpu->icount_decr yet to signal the exit. > > Rather than "yet", "anymore". So far it has always been an invariant > that anything that re-enabled an interrupt had to do exit_tb. > >> Wouldn't another approach >> (that didn't involve futzing with each front-end) to be to check >> cpu->interrupt_request and force the exit in lookup_tb_ptr? > > That would cause an unnecessary slowdown in code that runs with > interrupts disabled but does a lot of indirect jumps... ppc's SLOF > firmware probably qualifies. Really? I'd have to measure the change it makes. Is there a benchmark stanza for measuring the PPC slof firmware time? I have 3 patches now which all fix the same thing so we can pick and choose which we should apply. Patches incoming... -- Alex Bennée
On 14/06/2017 14:35, Alex Bennée wrote: >> That would cause an unnecessary slowdown in code that runs with >> interrupts disabled but does a lot of indirect jumps... ppc's SLOF >> firmware probably qualifies. > > Really? Yes. :) SLOF basically runs a Forth interpreter. If you run "qemu-system-ppc64 -d in_asm", you'll see a lot of "bctr" and "bctrl" instruction (respectively ARM's "br" and "blr" IIRC). > I'd have to measure the change it makes. Is there a benchmark > stanza for measuring the PPC slof firmware time? Just booting. PPC doesn't have tcg_gen_lookup_and_goto_ptr support yet, so it would be a theoretical slowdown at this time. Paolo > I have 3 patches now which all fix the same thing so we can pick and > choose which we should apply. Patches incoming...
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h index 87ae10bcc9..6c0c4825aa 100644 --- a/include/exec/exec-all.h +++ b/include/exec/exec-all.h @@ -326,6 +326,8 @@ struct TranslationBlock { uint16_t invalid; + bool is_magic; + void *tc_ptr; /* pointer to the translated code */ uint8_t *tc_search; /* pointer to search data */ /* original tb when cflags has CF_NOCACHE */ diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c index 860e279658..6dd6fd70d0 100644 --- a/target/arm/translate-a64.c +++ b/target/arm/translate-a64.c @@ -1394,6 +1394,7 @@ static void handle_sync(DisasContext *s, uint32_t insn, * any pending interrupts immediately. */ s->is_jmp = DISAS_UPDATE; + /* s->is_magic = true; */ return; default: unallocated_encoding(s); @@ -1423,6 +1424,7 @@ static void handle_msr_i(DisasContext *s, uint32_t insn, tcg_temp_free_i32(tcg_imm); tcg_temp_free_i32(tcg_op); s->is_jmp = DISAS_UPDATE; + /* s->is_magic = true; */ break; } default: @@ -1592,12 +1594,14 @@ static void handle_sys(DisasContext *s, uint32_t insn, bool isread, /* I/O operations must end the TB here (whether read or write) */ gen_io_end(); s->is_jmp = DISAS_UPDATE; + /* s->is_magic = true; */ } else if (!isread && !(ri->type & ARM_CP_SUPPRESS_TB_END)) { /* We default to ending the TB on a coprocessor register write, * but allow this to be suppressed by the register definition * (usually only necessary to work around guest bugs). */ s->is_jmp = DISAS_UPDATE; + /* s->is_magic = true; */ } } @@ -1772,13 +1776,18 @@ static void disas_uncond_b_reg(DisasContext *s, uint32_t insn) switch (opc) { case 0: /* BR */ + /* s->is_magic = true; */ + gen_a64_set_pc(s, cpu_reg(s, rn)); + break; case 1: /* BLR */ - case 2: /* RET */ + /* s->is_magic = true; */ gen_a64_set_pc(s, cpu_reg(s, rn)); /* BLR also needs to load return address */ - if (opc == 1) { - tcg_gen_movi_i64(cpu_reg(s, 30), s->pc); - } + tcg_gen_movi_i64(cpu_reg(s, 30), s->pc); + break; + case 2: /* RET */ + s->is_magic = true; + gen_a64_set_pc(s, cpu_reg(s, rn)); break; case 4: /* ERET */ if (s->current_el == 0) { @@ -1787,6 +1796,7 @@ static void disas_uncond_b_reg(DisasContext *s, uint32_t insn) } gen_helper_exception_return(cpu_env); s->is_jmp = DISAS_JUMP; + /* s->is_magic = true; */ return; case 5: /* DRPS */ if (rn != 0x1f) { @@ -11250,6 +11260,7 @@ void gen_intermediate_code_a64(ARMCPU *cpu, TranslationBlock *tb) dc->pstate_ss = ARM_TBFLAG_PSTATE_SS(tb->flags); dc->is_ldex = false; dc->ss_same_el = (arm_debug_target_el(env) == dc->current_el); + dc->is_magic = false; init_tmp_a64_array(dc); @@ -11281,6 +11292,7 @@ void gen_intermediate_code_a64(ARMCPU *cpu, TranslationBlock *tb) gen_helper_check_breakpoints(cpu_env); /* End the TB early; it likely won't be executed */ dc->is_jmp = DISAS_UPDATE; + /* dc->is_magic = true; */ } else { gen_exception_internal_insn(dc, 0, EXCP_DEBUG); /* The address covered by the breakpoint must be @@ -11367,6 +11379,7 @@ void gen_intermediate_code_a64(ARMCPU *cpu, TranslationBlock *tb) gen_a64_set_pc_im(dc->pc); /* fall through */ case DISAS_JUMP: + tb->is_magic = dc->is_magic; tcg_gen_lookup_and_goto_ptr(cpu_pc); break; case DISAS_TB_JUMP: diff --git a/target/arm/translate.h b/target/arm/translate.h index 15d383d9af..786eb19335 100644 --- a/target/arm/translate.h +++ b/target/arm/translate.h @@ -62,6 +62,8 @@ typedef struct DisasContext { bool ss_same_el; /* Bottom two bits of XScale c15_cpar coprocessor access control reg */ int c15_cpar; + + bool is_magic; /* TCG op index of the current insn_start. */ int insn_start_idx; #define TMP_A64_MAX 16 diff --git a/tcg-runtime.c b/tcg-runtime.c index e987c1f6bb..d9d7c52ae9 100644 --- a/tcg-runtime.c +++ b/tcg-runtime.c @@ -168,7 +168,9 @@ void *HELPER(lookup_tb_ptr)(CPUArchState *env, target_ulong addr) tb = tb_htable_lookup(cpu, addr, cs_base, flags); if (likely(tb)) { atomic_set(&cpu->tb_jmp_cache[addr_hash], tb); - code_ptr = tb->tc_ptr; + if (!tb->is_magic) { + code_ptr = tb->tc_ptr; + } } } }
THIS IS A DEBUG PATCH DO NOT MERGE I include all the comments to show my working. I was trying to isolate which instructions cause the problem. It turns out it is the RET instruction. I don't understand why because AFAICT it is a pretty much a BR instruction. Signed-off-by: Alex Bennée <alex.bennee@linaro.org> --- include/exec/exec-all.h | 2 ++ target/arm/translate-a64.c | 21 +++++++++++++++++---- target/arm/translate.h | 2 ++ tcg-runtime.c | 4 +++- 4 files changed, 24 insertions(+), 5 deletions(-) -- 2.13.0