Message ID | 20181005154910.3099-1-alex.bennee@linaro.org |
---|---|
Headers | show |
Series | Trace updates and plugin RFC | expand |
On Fri, Oct 05, 2018 at 16:48:49 +0100, Alex Bennée wrote: (snip) > ==Known Limitations== > > Currently there is only one hook allowed per trace event. We could > make this more flexible or simply just error out if two plugins try > and hook to the same point. What are the expectations of running > multiple plugins hooking into the same point in QEMU? It's very common. All popular instrumentation tools (e.g. PANDA, DynamoRIO, Pin) support multiple plugins. > ==TCG Hooks== > > Thanks to Lluís' work the trace API already splits up TCG events into > translation time and exec time events and provides the machinery for > hooking a trace helper into the translation stream. Currently that > helper is unconditionally added although perhaps we could expand the > call convention a little for TCG events to allow the translation time > event to filter out planting the execution time helper? A TCG helper is suboptimal for these kind of events, e.g. instruction/TB/ mem callbacks, because (1) these events happen *very* often, and (2) a helper then has to iterate over a list of callbacks (assuming we support multiple plugins). That is, one TCG helper call, plus cache misses for the callback pointers, plus function calls to call the callbacks. That adds up to 2x average slowdown for SPEC06int, instead of 1.5x slowdown when embedding the callbacks directly into the generated code. Yes, you have to flush the code when unsubscribing from the event, but that cost is amortized by the savings you get when the callbacks occur, which are way more frequent. Besides performance, to provide a pleasant plugin experience we need something better than the current tracing callbacks. > ===Instruction Tracing=== > > Pavel's series had a specific hook for instrumenting individual > instructions. I have not yet added it to this series but I think it be > done in a slightly cleaner way now we have the ability to insert TCG > ops into the instruction stream. I thought Peter explicitly disallowed TCG generation from plugins. Also, IIRC others also mentioned that exposing QEMU internals (e.g. "struct TranslationBlock", or "struct CPUState") to plugins was not on the table. > If we add a tracepoint for post > instruction generation which passes a buffer with the instruction > translated and method to insert a helper before or after the > instruction. This would avoid exposing the cpu_ldst macros to the > plugins. Again, for performance you'd avoid the tracepoint (i.e. calling a helper to call another function) and embed directly the callback from TCG. Same thing applies to TB's. > So what do people think? Could this be a viable way to extend QEMU > with plugins? For frequent events such as the ones mentioned above, I am not sure plugins can be efficiently implemented under tracing. For others (e.g. cpu_init events), sure, they could. But still, differently from tracers, plugins can come and go anytime, so I am not convinced that merging the two features is a good idea. Thanks, Emilio
Emilio G. Cota <cota@braap.org> writes: > On Fri, Oct 05, 2018 at 16:48:49 +0100, Alex Bennée wrote: > (snip) >> ==Known Limitations== >> >> Currently there is only one hook allowed per trace event. We could >> make this more flexible or simply just error out if two plugins try >> and hook to the same point. What are the expectations of running >> multiple plugins hooking into the same point in QEMU? > > It's very common. All popular instrumentation tools (e.g. PANDA, > DynamoRIO, Pin) support multiple plugins. Fair enough. > >> ==TCG Hooks== >> >> Thanks to Lluís' work the trace API already splits up TCG events into >> translation time and exec time events and provides the machinery for >> hooking a trace helper into the translation stream. Currently that >> helper is unconditionally added although perhaps we could expand the >> call convention a little for TCG events to allow the translation time >> event to filter out planting the execution time helper? > > A TCG helper is suboptimal for these kind of events, e.g. instruction/TB/ > mem callbacks, because (1) these events happen *very* often, and > (2) a helper then has to iterate over a list of callbacks (assuming > we support multiple plugins). That is, one TCG helper call, > plus cache misses for the callback pointers, plus function calls > to call the callbacks. That adds up to 2x average slowdown > for SPEC06int, instead of 1.5x slowdown when embedding the > callbacks directly into the generated code. Yes, you have to > flush the code when unsubscribing from the event, but that cost > is amortized by the savings you get when the callbacks occur, > which are way more frequent. What would you want instead of a TCG helper? But certainly being able be selective about which instances of each trace point are instrumented will be valuable. > Besides performance, to provide a pleasant plugin experience we need > something better than the current tracing callbacks. What I hope to avoid in re-using trace points is having a whole bunch of a additional hook points just for plugins. However nothing stops us adding more tracepoints at more useful places for instrumentation. We could also do it on a whitelist basis similar to the way the tcg events are marked. > >> ===Instruction Tracing=== >> >> Pavel's series had a specific hook for instrumenting individual >> instructions. I have not yet added it to this series but I think it be >> done in a slightly cleaner way now we have the ability to insert TCG >> ops into the instruction stream. > > I thought Peter explicitly disallowed TCG generation from plugins. > Also, IIRC others also mentioned that exposing QEMU internals > (e.g. "struct TranslationBlock", or "struct CPUState") to plugins > was not on the table. We definitely want to avoid plugin controlled code generation but the tcg tracepoint mechanism is transparent to the plugin itself. I think the pointers should really be treated as anonymous handles rather than windows into QEMU's internals. Arguably some of the tracepoints should be exporting more useful numbers (I used cpu->cpu_index in the TLB trace points) but I don't know if we can change existing trace point definitions to clean that up. Again if we whitelist tracepoints for plugins we can be more careful about the data exported. > >> If we add a tracepoint for post >> instruction generation which passes a buffer with the instruction >> translated and method to insert a helper before or after the >> instruction. This would avoid exposing the cpu_ldst macros to the >> plugins. > > Again, for performance you'd avoid the tracepoint (i.e. calling > a helper to call another function) and embed directly the > callback from TCG. Same thing applies to TB's. OK I see what you mean. I think that is doable although it might take a bit more tcg plumbing. > >> So what do people think? Could this be a viable way to extend QEMU >> with plugins? > > For frequent events such as the ones mentioned above, I am > not sure plugins can be efficiently implemented under > tracing. I assume some form of helper-per-instrumented-event/insn is still going to be needed though? We are not considering some sort of EBF craziness? > For others (e.g. cpu_init events), sure, they could. > But still, differently from tracers, plugins can come and go > anytime, so I am not convinced that merging the two features > is a good idea. I don't think we have to mirror tracepoints and plugin points but I'm in favour of sharing the general mechanism and tooling rather than having a whole separate set of hooks. We certainly don't want anything like: trace_exec_tb(tb, pc); plugin_exec_tb(tb, pc); scattered throughout the code where the two do align. > > Thanks, > > Emilio -- Alex Bennée
Hi Alex, On 08/10/2018 12:28, Alex Bennée wrote: > > Emilio G. Cota <cota@braap.org> writes: > >> On Fri, Oct 05, 2018 at 16:48:49 +0100, Alex Bennée wrote: >> (snip) >>> ==Known Limitations== >>> >>> Currently there is only one hook allowed per trace event. We could >>> make this more flexible or simply just error out if two plugins try >>> and hook to the same point. What are the expectations of running >>> multiple plugins hooking into the same point in QEMU? >> >> It's very common. All popular instrumentation tools (e.g. PANDA, >> DynamoRIO, Pin) support multiple plugins. > > Fair enough. > >> >>> ==TCG Hooks== >>> >>> Thanks to Lluís' work the trace API already splits up TCG events into >>> translation time and exec time events and provides the machinery for >>> hooking a trace helper into the translation stream. Currently that >>> helper is unconditionally added although perhaps we could expand the >>> call convention a little for TCG events to allow the translation time >>> event to filter out planting the execution time helper? >> >> A TCG helper is suboptimal for these kind of events, e.g. instruction/TB/ >> mem callbacks, because (1) these events happen *very* often, and >> (2) a helper then has to iterate over a list of callbacks (assuming >> we support multiple plugins). That is, one TCG helper call, >> plus cache misses for the callback pointers, plus function calls >> to call the callbacks. That adds up to 2x average slowdown >> for SPEC06int, instead of 1.5x slowdown when embedding the >> callbacks directly into the generated code. Yes, you have to >> flush the code when unsubscribing from the event, but that cost >> is amortized by the savings you get when the callbacks occur, >> which are way more frequent. > > What would you want instead of a TCG helper? But certainly being able be > selective about which instances of each trace point are instrumented > will be valuable. > >> Besides performance, to provide a pleasant plugin experience we need >> something better than the current tracing callbacks. > > What I hope to avoid in re-using trace points is having a whole bunch of > a additional hook points just for plugins. However nothing stops us > adding more tracepoints at more useful places for instrumentation. We > could also do it on a whitelist basis similar to the way the tcg events > are marked. > >> >>> ===Instruction Tracing=== >>> >>> Pavel's series had a specific hook for instrumenting individual >>> instructions. I have not yet added it to this series but I think it be >>> done in a slightly cleaner way now we have the ability to insert TCG >>> ops into the instruction stream. >> >> I thought Peter explicitly disallowed TCG generation from plugins. >> Also, IIRC others also mentioned that exposing QEMU internals >> (e.g. "struct TranslationBlock", or "struct CPUState") to plugins >> was not on the table. > > We definitely want to avoid plugin controlled code generation but the > tcg tracepoint mechanism is transparent to the plugin itself. I think > the pointers should really be treated as anonymous handles rather than > windows into QEMU's internals. Arguably some of the tracepoints should > be exporting more useful numbers (I used cpu->cpu_index in the TLB trace > points) but I don't know if we can change existing trace point > definitions to clean that up. > > Again if we whitelist tracepoints for plugins we can be more careful > about the data exported. > >> >>> If we add a tracepoint for post >>> instruction generation which passes a buffer with the instruction >>> translated and method to insert a helper before or after the >>> instruction. This would avoid exposing the cpu_ldst macros to the >>> plugins. >> >> Again, for performance you'd avoid the tracepoint (i.e. calling >> a helper to call another function) and embed directly the >> callback from TCG. Same thing applies to TB's. > > OK I see what you mean. I think that is doable although it might take a > bit more tcg plumbing. > >> >>> So what do people think? Could this be a viable way to extend QEMU >>> with plugins? >> >> For frequent events such as the ones mentioned above, I am >> not sure plugins can be efficiently implemented under >> tracing. > > I assume some form of helper-per-instrumented-event/insn is still going > to be needed though? We are not considering some sort of EBF craziness? > >> For others (e.g. cpu_init events), sure, they could. >> But still, differently from tracers, plugins can come and go >> anytime, so I am not convinced that merging the two features >> is a good idea. > > I don't think we have to mirror tracepoints and plugin points but I'm in > favour of sharing the general mechanism and tooling rather than having a > whole separate set of hooks. We certainly don't want anything like: > > trace_exec_tb(tb, pc); > plugin_exec_tb(tb, pc); > > scattered throughout the code where the two do align. What about turning the tracepoints into the default instrumentation plugin? (the first of Emilio's list of plugins). >> >> Thanks, >> >> Emilio > > > -- > Alex Bennée >
On Mon, Oct 08, 2018 at 11:28:38 +0100, Alex Bennée wrote: > Emilio G. Cota <cota@braap.org> writes: > > Again, for performance you'd avoid the tracepoint (i.e. calling > > a helper to call another function) and embed directly the > > callback from TCG. Same thing applies to TB's. > > OK I see what you mean. I think that is doable although it might take a > bit more tcg plumbing. I have patches to do it, it's not complicated. > >> So what do people think? Could this be a viable way to extend QEMU > >> with plugins? > > > > For frequent events such as the ones mentioned above, I am > > not sure plugins can be efficiently implemented under > > tracing. > > I assume some form of helper-per-instrumented-event/insn is still going > to be needed though? We are not considering some sort of EBF craziness? Helper, yes. But one that points directly to plugin code. > > For others (e.g. cpu_init events), sure, they could. > > But still, differently from tracers, plugins can come and go > > anytime, so I am not convinced that merging the two features > > is a good idea. > > I don't think we have to mirror tracepoints and plugin points but I'm in > favour of sharing the general mechanism and tooling rather than having a > whole separate set of hooks. We certainly don't want anything like: > > trace_exec_tb(tb, pc); > plugin_exec_tb(tb, pc); > > scattered throughout the code where the two do align. We could have something like plugin_trace_exec_tb(tb, pc); that would expand to the two lines above. Or similar. So I agree with you that in some cases the "trace points" for both tracing and plugin might be the same, perhaps identical. But that doesn't necessarily mean that making plugins a subset of tracing is a good idea. I think sharing my plugin implementation will help the discussion. I'll share it as soon as I can (my QEMU plate is full already trying to merge a couple of other features first). Thanks, Emilio
Emilio G. Cota <cota@braap.org> writes: > On Mon, Oct 08, 2018 at 11:28:38 +0100, Alex Bennée wrote: >> Emilio G. Cota <cota@braap.org> writes: >> > Again, for performance you'd avoid the tracepoint (i.e. calling >> > a helper to call another function) and embed directly the >> > callback from TCG. Same thing applies to TB's. >> >> OK I see what you mean. I think that is doable although it might take a >> bit more tcg plumbing. > > I have patches to do it, it's not complicated. Right that would be useful. > >> >> So what do people think? Could this be a viable way to extend QEMU >> >> with plugins? >> > >> > For frequent events such as the ones mentioned above, I am >> > not sure plugins can be efficiently implemented under >> > tracing. >> >> I assume some form of helper-per-instrumented-event/insn is still going >> to be needed though? We are not considering some sort of EBF craziness? > > Helper, yes. But one that points directly to plugin code. It would be nice if the logic the inserts the trace helper vs a direct call could be shared. I guess I'd have to see the implementation to see how ugly it gets. > >> > For others (e.g. cpu_init events), sure, they could. >> > But still, differently from tracers, plugins can come and go >> > anytime, so I am not convinced that merging the two features >> > is a good idea. >> >> I don't think we have to mirror tracepoints and plugin points but I'm in >> favour of sharing the general mechanism and tooling rather than having a >> whole separate set of hooks. We certainly don't want anything like: >> >> trace_exec_tb(tb, pc); >> plugin_exec_tb(tb, pc); >> >> scattered throughout the code where the two do align. > > We could have something like > > plugin_trace_exec_tb(tb, pc); > > that would expand to the two lines above. Or similar. > > So I agree with you that in some cases the "trace points" > for both tracing and plugin might be the same, perhaps > identical. But that doesn't necessarily mean that making > plugins a subset of tracing is a good idea. But we can avoid having plugin-points and trace-events duplicating stuff as well? I guess you want to avoid having the generated code fragments for plugins? The other nice property was avoiding re-duplicating output logic for "filter" style operations. However I didn't actually included such an example in the series. I was pondering a QEMU powered PLT/library call tracer to demonstrate that sort of thing. > I think sharing my plugin implementation will help the > discussion. I'll share it as soon as I can (my QEMU plate > is full already trying to merge a couple of other features > first). Sounds good. > > Thanks, > > Emilio -- Alex Bennée
> From: Alex Bennée [mailto:alex.bennee@linaro.org] > Any serious analysis tool should allow for us to track all memory > accesses so I think the guest_mem_before trace point should probably > be split into guest_mem_before_store and guest_mem_after_load. We > could go the whole hog and add potential trace points for start/end of > all memory operations. I wanted to ask about memory tracing and found this one. Is it possible to use tracepoints for capturing all memory accesses? In our implementation we insert helpers before and after tcg read/write operations. Pavel Dovgalyuk
Pavel Dovgalyuk <dovgaluk@ispras.ru> writes: >> From: Alex Bennée [mailto:alex.bennee@linaro.org] >> Any serious analysis tool should allow for us to track all memory >> accesses so I think the guest_mem_before trace point should probably >> be split into guest_mem_before_store and guest_mem_after_load. We >> could go the whole hog and add potential trace points for start/end of >> all memory operations. > > I wanted to ask about memory tracing and found this one. > Is it possible to use tracepoints for capturing all memory accesses? > In our implementation we insert helpers before and after tcg > read/write operations. The current tracepoint isn't enough but yes I think we could. The first thing I need to do is de-macrofy the atomic helpers a little just to make it a bit simpler to add the before/after tracepoints. > > Pavel Dovgalyuk -- Alex Bennée
> From: Alex Bennée [mailto:alex.bennee@linaro.org] > Pavel Dovgalyuk <dovgaluk@ispras.ru> writes: > > >> From: Alex Bennée [mailto:alex.bennee@linaro.org] > >> Any serious analysis tool should allow for us to track all memory > >> accesses so I think the guest_mem_before trace point should probably > >> be split into guest_mem_before_store and guest_mem_after_load. We > >> could go the whole hog and add potential trace points for start/end of > >> all memory operations. > > > > I wanted to ask about memory tracing and found this one. > > Is it possible to use tracepoints for capturing all memory accesses? > > In our implementation we insert helpers before and after tcg > > read/write operations. > > The current tracepoint isn't enough but yes I think we could. The first > thing I need to do is de-macrofy the atomic helpers a little just to > make it a bit simpler to add the before/after tracepoints. But memory accesses can use 'fast path' without the helpers. Thus you still need inserting the new helper for that case. Pavel Dovgalyuk
Pavel Dovgalyuk <dovgaluk@ispras.ru> writes: >> From: Alex Bennée [mailto:alex.bennee@linaro.org] >> Pavel Dovgalyuk <dovgaluk@ispras.ru> writes: >> >> >> From: Alex Bennée [mailto:alex.bennee@linaro.org] >> >> Any serious analysis tool should allow for us to track all memory >> >> accesses so I think the guest_mem_before trace point should probably >> >> be split into guest_mem_before_store and guest_mem_after_load. We >> >> could go the whole hog and add potential trace points for start/end of >> >> all memory operations. >> > >> > I wanted to ask about memory tracing and found this one. >> > Is it possible to use tracepoints for capturing all memory accesses? >> > In our implementation we insert helpers before and after tcg >> > read/write operations. >> >> The current tracepoint isn't enough but yes I think we could. The first >> thing I need to do is de-macrofy the atomic helpers a little just to >> make it a bit simpler to add the before/after tracepoints. > > But memory accesses can use 'fast path' without the helpers. > Thus you still need inserting the new helper for that case. trace_guest_mem_before_tcg in tcg-op.c already does this but currently only before operations. That's why I want to split it and pass the load/store value register values into the helpers. > > Pavel Dovgalyuk -- Alex Bennée
> From: Alex Bennée [mailto:alex.bennee@linaro.org] > Pavel Dovgalyuk <dovgaluk@ispras.ru> writes: > > >> From: Alex Bennée [mailto:alex.bennee@linaro.org] > >> Pavel Dovgalyuk <dovgaluk@ispras.ru> writes: > >> > >> >> From: Alex Bennée [mailto:alex.bennee@linaro.org] > >> >> Any serious analysis tool should allow for us to track all memory > >> >> accesses so I think the guest_mem_before trace point should probably > >> >> be split into guest_mem_before_store and guest_mem_after_load. We > >> >> could go the whole hog and add potential trace points for start/end of > >> >> all memory operations. > >> > > >> > I wanted to ask about memory tracing and found this one. > >> > Is it possible to use tracepoints for capturing all memory accesses? > >> > In our implementation we insert helpers before and after tcg > >> > read/write operations. > >> > >> The current tracepoint isn't enough but yes I think we could. The first > >> thing I need to do is de-macrofy the atomic helpers a little just to > >> make it a bit simpler to add the before/after tracepoints. > > > > But memory accesses can use 'fast path' without the helpers. > > Thus you still need inserting the new helper for that case. > > trace_guest_mem_before_tcg in tcg-op.c already does this but currently > only before operations. That's why I want to split it and pass the > load/store value register values into the helpers. One more question about your trace points. In case of using trace point on every instruction execution, we may need accessing vCPU registers (including the flags). Are they valid in such cases? I'm asking, because at least i386 translation optimizes writebacks. Pavel Dovgalyuk
Pavel Dovgalyuk <dovgaluk@ispras.ru> writes: >> From: Alex Bennée [mailto:alex.bennee@linaro.org] >> Pavel Dovgalyuk <dovgaluk@ispras.ru> writes: >> >> >> From: Alex Bennée [mailto:alex.bennee@linaro.org] >> >> Pavel Dovgalyuk <dovgaluk@ispras.ru> writes: >> >> >> >> >> From: Alex Bennée [mailto:alex.bennee@linaro.org] >> >> >> Any serious analysis tool should allow for us to track all memory >> >> >> accesses so I think the guest_mem_before trace point should probably >> >> >> be split into guest_mem_before_store and guest_mem_after_load. We >> >> >> could go the whole hog and add potential trace points for start/end of >> >> >> all memory operations. >> >> > >> >> > I wanted to ask about memory tracing and found this one. >> >> > Is it possible to use tracepoints for capturing all memory accesses? >> >> > In our implementation we insert helpers before and after tcg >> >> > read/write operations. >> >> >> >> The current tracepoint isn't enough but yes I think we could. The first >> >> thing I need to do is de-macrofy the atomic helpers a little just to >> >> make it a bit simpler to add the before/after tracepoints. >> > >> > But memory accesses can use 'fast path' without the helpers. >> > Thus you still need inserting the new helper for that case. >> >> trace_guest_mem_before_tcg in tcg-op.c already does this but currently >> only before operations. That's why I want to split it and pass the >> load/store value register values into the helpers. > > One more question about your trace points. > In case of using trace point on every instruction execution, we may need > accessing vCPU registers (including the flags). Are they valid in such > cases? They are probably valid but the tricky bit will be doing it in a way that doesn't expose the internals of the TCG. Maybe we could exploit the GDB interface for this or come up with a named referencex API. > I'm asking, because at least i386 translation optimizes writebacks. How so? I have to admit the i386 translation code is the most opaque to me but I wouldn't have thought changing the semantics of the guests load/store operations would be a sensible idea. Of course now you mention it my thoughts about memory tracing have been influenced by nice clean RISCy load/store architectures where it's rare to have calculation ops working directly with memory. > > Pavel Dovgalyuk -- Alex Bennée
> From: Alex Bennée [mailto:alex.bennee@linaro.org] > Pavel Dovgalyuk <dovgaluk@ispras.ru> writes: > > One more question about your trace points. > > In case of using trace point on every instruction execution, we may need > > accessing vCPU registers (including the flags). Are they valid in such > > cases? > > They are probably valid but the tricky bit will be doing it in a way > that doesn't expose the internals of the TCG. Maybe we could exploit the > GDB interface for this or come up with a named referencex API. > > > I'm asking, because at least i386 translation optimizes writebacks. > > How so? I have to admit the i386 translation code is the most opaque to > me but I wouldn't have thought changing the semantics of the guests > load/store operations would be a sensible idea. Writeback to the registers (say EFLAGS), not to the memory. Pavel Dovgalyuk
Pavel Dovgalyuk <dovgaluk@ispras.ru> writes: >> From: Alex Bennée [mailto:alex.bennee@linaro.org] >> Pavel Dovgalyuk <dovgaluk@ispras.ru> writes: >> > One more question about your trace points. >> > In case of using trace point on every instruction execution, we may need >> > accessing vCPU registers (including the flags). Are they valid in such >> > cases? >> >> They are probably valid but the tricky bit will be doing it in a way >> that doesn't expose the internals of the TCG. Maybe we could exploit the >> GDB interface for this or come up with a named referencex API. >> >> > I'm asking, because at least i386 translation optimizes writebacks. >> >> How so? I have to admit the i386 translation code is the most opaque to >> me but I wouldn't have thought changing the semantics of the guests >> load/store operations would be a sensible idea. > > Writeback to the registers (say EFLAGS), not to the memory. Ahh lazy evaluation of (status) flags. Well having dug around gdbstub it looks like we may get that wrong in for eflags anyway. I think what is probably required is: - a hook in TranslatorOps, resolve_all? - an interface to named things registered with tcg_global_mem_new_* And a way for the plugin to assert that any register accessed via this is consistent at the point in the runtime the plugin hook is called. I wonder what other front ends might have this sort of lazy/partial evaluation? -- Alex Bennée