Message ID | 52274838.7010902@linaro.org |
---|---|
State | Rejected |
Headers | show |
On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote: > > The string benchmarks can be affected by physical page placement, so > running them multiple times is required to account for this. Also > backup the results of the previous run like is done for the other > benchmarks. > You do not need to do this. We should instead randomize addresses used which handles this problem. > ChangeLog: > > 2013-09-03 Will Newton <will.newton@linaro.org> > > * benchtests/Makefile (BENCH_RUNS): New variable. > (bench-set): Run tests BENCH_RUNS times and backup old run > results. > --- > benchtests/Makefile | 16 ++++++++++++++-- > 1 file changed, 14 insertions(+), 2 deletions(-) > > diff --git a/benchtests/Makefile b/benchtests/Makefile > index 6037e5c..d72c98f 100644 > --- a/benchtests/Makefile > +++ b/benchtests/Makefile > @@ -118,6 +118,11 @@ ifndef BENCH_DURATION > BENCH_DURATION := 10 > endif > > +# The default number of runs of a benchmark: 4. > +ifndef BENCH_RUNS > +BENCH_RUNS := 4 > +endif > + > CPPFLAGS-nonlib += -DDURATION=$(BENCH_DURATION) > > # Use clock_gettime to measure performance of functions. The default is to use > @@ -146,8 +151,15 @@ bench: bench-set bench-func > > bench-set: $(binaries-benchset) > for run in $^; do \ > - echo "Running $${run}"; \ > - $(run-bench) > $${run}.out; \ > + for old in $${run}.*.out; do \ > + if [ -f $$old ]; then \ > + mv $$old $${old}.old; \ > + fi; \ > + done; \ > + for count in $$(seq 1 $(BENCH_RUNS)); do \ > + echo "Running $${run} ($${count})"; \ > + $(run-bench) > $${run}.$${count}.out; \ > + done; \ > done > > bench-func: $(binaries-bench) > -- > 1.8.1.4
On 4 September 2013 17:17, Ondřej Bílka <neleai@seznam.cz> wrote: > On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote: >> >> The string benchmarks can be affected by physical page placement, so >> running them multiple times is required to account for this. Also >> backup the results of the previous run like is done for the other >> benchmarks. >> > You do not need to do this. We should instead randomize addresses used > which handles this problem. That seems like it would be considerably more complicated. Do you have a reason why your approach is better?
On Wed, Sep 04, 2013 at 05:20:23PM +0100, Will Newton wrote: > On 4 September 2013 17:17, Ondřej Bílka <neleai@seznam.cz> wrote: > > On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote: > >> > >> The string benchmarks can be affected by physical page placement, so > >> running them multiple times is required to account for this. Also > >> backup the results of the previous run like is done for the other > >> benchmarks. > >> > > You do not need to do this. We should instead randomize addresses used > > which handles this problem. > > That seems like it would be considerably more complicated. Do you have > a reason why your approach is better? > It does not matter if it is complicated, it is required to get reasonable results.
On Wed, Sep 4, 2013 at 11:52 AM, Ondřej Bílka <neleai@seznam.cz> wrote: > On Wed, Sep 04, 2013 at 05:20:23PM +0100, Will Newton wrote: >> On 4 September 2013 17:17, Ondřej Bílka <neleai@seznam.cz> wrote: >> > On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote: >> >> >> >> The string benchmarks can be affected by physical page placement, so >> >> running them multiple times is required to account for this. Also >> >> backup the results of the previous run like is done for the other >> >> benchmarks. >> >> >> > You do not need to do this. We should instead randomize addresses used >> > which handles this problem. What is the objective of randomizing the addresses? What are we trying to accomplish by different page placement? >> That seems like it would be considerably more complicated. Do you have >> a reason why your approach is better? >> > It does not matter if it is complicated, it is required to get > reasonable results. Your proposed solution (in a later email) is in-fact not overly-complicated, but for the record, there is a very good reason to avoid complexity in a project like glibc, which is already extremely complex. Some poor soul will need to maintain all of the complicated solutions long into the future. Ryan
On 4 September 2013 17:52, Ondřej Bílka <neleai@seznam.cz> wrote: > On Wed, Sep 04, 2013 at 05:20:23PM +0100, Will Newton wrote: >> On 4 September 2013 17:17, Ondřej Bílka <neleai@seznam.cz> wrote: >> > On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote: >> >> >> >> The string benchmarks can be affected by physical page placement, so >> >> running them multiple times is required to account for this. Also >> >> backup the results of the previous run like is done for the other >> >> benchmarks. >> >> >> > You do not need to do this. We should instead randomize addresses used >> > which handles this problem. >> >> That seems like it would be considerably more complicated. Do you have >> a reason why your approach is better? >> > It does not matter if it is complicated, it is required to get > reasonable results. How do you define "reasonable results"? The intention of my patch - which I may have not made completely clear in the commit message - is to improve test stability. What I mean by this is that with a physically indexed cache the physical pages allocated to the test can have a significant effect on the performance at large (e.g. cache size / ways and above) buffer sizes and this will cause variation when running the same test multiple times. My aim is to average out these differences as it is hard to control for them without understanding the details of the cache subsystem of the system you are running on. Your test appears to be addressing concerns of test validity by running a wider range of buffer alignments, which is an important but separate concern IMO.
On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote: > > The string benchmarks can be affected by physical page placement, so > running them multiple times is required to account for this. Also > backup the results of the previous run like is done for the other > benchmarks. I wonder if this would make a lot of difference. On an otherwise idle system (which is essentially a pre-condition for benchmark runs), I would expect more or less similar page placement (i.e. reachable via comparable indices) for consecutive executions of a program. You would probably need to have something spawned in parallel to actually make a difference. Siddhesh
On 5 September 2013 11:31, Siddhesh Poyarekar <siddhesh@redhat.com> wrote: > On Wed, Sep 04, 2013 at 03:48:24PM +0100, Will Newton wrote: >> >> The string benchmarks can be affected by physical page placement, so >> running them multiple times is required to account for this. Also >> backup the results of the previous run like is done for the other >> benchmarks. > > I wonder if this would make a lot of difference. On an otherwise idle > system (which is essentially a pre-condition for benchmark runs), I > would expect more or less similar page placement (i.e. reachable via > comparable indices) for consecutive executions of a program. You > would probably need to have something spawned in parallel to actually > make a difference. There is no guarantee the OS will give you a different set of physical pages but it does seem to have the desired effect in practice.
On Thu, Sep 05, 2013 at 11:36:00AM +0100, Will Newton wrote: > > I wonder if this would make a lot of difference. On an otherwise idle > > system (which is essentially a pre-condition for benchmark runs), I > > would expect more or less similar page placement (i.e. reachable via > > comparable indices) for consecutive executions of a program. You > > would probably need to have something spawned in parallel to actually > > make a difference. > > There is no guarantee the OS will give you a different set of physical > pages but it does seem to have the desired effect in practice. OK, multiple runs are harmless, so I don't mind this change if it smooths out any inconsistencies in practice. The patch also looks OK to me but I'd like a consolidated ouput instead of the four output files, maybe as a separate change. I'd like to see if anyone else has objections to this, so please wait for another ack before you commit it. Siddhesh
On Thu, Sep 05, 2013 at 08:51:53AM +0100, Will Newton wrote: > The intention of my patch - which I may have not made completely clear > in the commit message - is to improve test stability. What I mean by > this is that with a physically indexed cache the physical pages > allocated to the test can have a significant effect on the performance > at large (e.g. cache size / ways and above) buffer sizes and this will > cause variation when running the same test multiple times. My aim is > to average out these differences as it is hard to control for them > without understanding the details of the cache subsystem of the system > you are running on. > This can be just explained just by having more data. Simply multiplying iteration count by four would then do same job. Please run your patch ten times and calculate variance. Compare that to variance when iteration count is increased 4 times and show if there is improvement. > Your test appears to be addressing concerns of test validity by > running a wider range of buffer alignments, which is an important but > separate concern IMO. > No, your patch will pick src pointer at 4 different physical pages (one allocated in each run) and calculate average performance. Mine will pick src pointers in 2000000/4096 = 488 different pages and calculate average. As this makes your patch redundant you will add unnecessary complexity to glibc, there is a very good reason to avoid complexity in a project like glibc, which is already extremely complex. Some poor soul will need to maintain all of the complicated solutions long into the future.
On 5 September 2013 16:03, Ondřej Bílka <neleai@seznam.cz> wrote: > On Thu, Sep 05, 2013 at 08:51:53AM +0100, Will Newton wrote: >> The intention of my patch - which I may have not made completely clear >> in the commit message - is to improve test stability. What I mean by >> this is that with a physically indexed cache the physical pages >> allocated to the test can have a significant effect on the performance >> at large (e.g. cache size / ways and above) buffer sizes and this will >> cause variation when running the same test multiple times. My aim is >> to average out these differences as it is hard to control for them >> without understanding the details of the cache subsystem of the system >> you are running on. >> > This can be just explained just by having more data. Simply multiplying > iteration count by four would then do same job. No, it wouldn't. That would just mean four times as much data resulting in a reduced variance but the same systematic error. >> Your test appears to be addressing concerns of test validity by >> running a wider range of buffer alignments, which is an important but >> separate concern IMO. >> > No, your patch will pick src pointer at 4 different physical pages (one > allocated in each run) and calculate average performance. > > Mine will pick src pointers in 2000000/4096 = 488 different pages and > calculate average. Yes, this would work too. But it has a number of flaws: 1. It does not allow one to analyze the performance of the code across alignments, everything gets averaged together. 2. It has no mechanism for showing variance, whereas multiple runs of the same test the variance of the means can at least be seen. 3. It only works for one test (memcpy).
On Thu, Sep 05, 2013 at 04:18:18PM +0100, Will Newton wrote: > On 5 September 2013 16:03, Ondřej Bílka <neleai@seznam.cz> wrote: > > On Thu, Sep 05, 2013 at 08:51:53AM +0100, Will Newton wrote: > >> The intention of my patch - which I may have not made completely clear > >> in the commit message - is to improve test stability. What I mean by > >> this is that with a physically indexed cache the physical pages > >> allocated to the test can have a significant effect on the performance > >> at large (e.g. cache size / ways and above) buffer sizes and this will > >> cause variation when running the same test multiple times. My aim is > >> to average out these differences as it is hard to control for them > >> without understanding the details of the cache subsystem of the system > >> you are running on. > >> > > This can be just explained just by having more data. Simply multiplying > > iteration count by four would then do same job. > > No, it wouldn't. That would just mean four times as much data > resulting in a reduced variance but the same systematic error. > That is you claim. I am now asking you second time to prove it. As I write in previous mail in same place: Please run your patch ten times and calculate variance. Compare that to variance when iteration count is increased 4 times and show if there is improvement. > >> Your test appears to be addressing concerns of test validity by > >> running a wider range of buffer alignments, which is an important but > >> separate concern IMO. > >> > > No, your patch will pick src pointer at 4 different physical pages (one > > allocated in each run) and calculate average performance. > > > > Mine will pick src pointers in 2000000/4096 = 488 different pages and > > calculate average. > > Yes, this would work too. But it has a number of flaws: > > 1. It does not allow one to analyze the performance of the code across > alignments, everything gets averaged together. You cannot analyse performance across alignments now as benchmarks do not print necessary data. For such analysis make sense you would need much more data. This would make benchmarks as they are slower so it would need be used as option. If you would like analysis done in separate program you would need to print much more data. As this would make .out files harder to read for humans a better way would print results in separate part where you would use separate format. If you would do analysis inside benchmark then you would implement your own logic making this issue also moot. > 2. It has no mechanism for showing variance, whereas multiple runs of > the same test the variance of the means can at least be seen. There is a pretty good merchanism of showing variance and it is called calculating variance. However adding variance calculation is separate issue. > 3. It only works for one test (memcpy). > It is first step. A randomization is needed for all string functions and it is better to start on concrete example. > -- > Will Newton > Toolchain Working Group, Linaro
On 5 September 2013 17:04, Ondřej Bílka <neleai@seznam.cz> wrote: > On Thu, Sep 05, 2013 at 04:18:18PM +0100, Will Newton wrote: >> On 5 September 2013 16:03, Ondřej Bílka <neleai@seznam.cz> wrote: >> > On Thu, Sep 05, 2013 at 08:51:53AM +0100, Will Newton wrote: >> >> The intention of my patch - which I may have not made completely clear >> >> in the commit message - is to improve test stability. What I mean by >> >> this is that with a physically indexed cache the physical pages >> >> allocated to the test can have a significant effect on the performance >> >> at large (e.g. cache size / ways and above) buffer sizes and this will >> >> cause variation when running the same test multiple times. My aim is >> >> to average out these differences as it is hard to control for them >> >> without understanding the details of the cache subsystem of the system >> >> you are running on. >> >> >> > This can be just explained just by having more data. Simply multiplying >> > iteration count by four would then do same job. >> >> No, it wouldn't. That would just mean four times as much data >> resulting in a reduced variance but the same systematic error. >> > That is you claim. I am now asking you second time to prove it. > > As I write in previous mail in same place: > > Please run your patch ten times and calculate variance. Compare that to > variance when iteration count is increased 4 times and show if there is > improvement. The benchmarks do not currently have any measure of variance so it's not possible to do this with the benchmarks as they stand. I have seen this effect with other benchmarks however. >> >> Your test appears to be addressing concerns of test validity by >> >> running a wider range of buffer alignments, which is an important but >> >> separate concern IMO. >> >> >> > No, your patch will pick src pointer at 4 different physical pages (one >> > allocated in each run) and calculate average performance. >> > >> > Mine will pick src pointers in 2000000/4096 = 488 different pages and >> > calculate average. >> >> Yes, this would work too. But it has a number of flaws: >> >> 1. It does not allow one to analyze the performance of the code across >> alignments, everything gets averaged together. > > You cannot analyse performance across alignments now as benchmarks do > not print necessary data. It currently prints the alignments of the buffers, that is all that is required. The alignments chosen are a rather poor selection though I would agree. > >> 2. It has no mechanism for showing variance, whereas multiple runs of >> the same test the variance of the means can at least be seen. > > There is a pretty good merchanism of showing variance and it is called > calculating variance. However adding variance calculation is separate > issue. I think you misunderstand me. The benchmarks as they stand do not output any measure of variance. Multiple runs is a quick and easy way to get a measure of variance without modifying the benchmarks or their output. >> 3. It only works for one test (memcpy). >> > It is first step. A randomization is needed for all string functions and > it is better to start on concrete example. I agree completely, lets start by finding the best way to fix the benchmarks, but once we have consensus i think it would be best to fix all the benchmarks rather than leave some unfixed.
On 09/05/2013 01:06 PM, Will Newton wrote: > I agree completely, lets start by finding the best way to fix the > benchmarks, but once we have consensus i think it would be best to fix > all the benchmarks rather than leave some unfixed. No, please fix things as soon as you have them fixed. There is no need to artificially fix all of them for the sake of perfection :-) Cheers, Carlos.
On 09/04/2013 10:48 AM, Will Newton wrote: > > The string benchmarks can be affected by physical page placement, so > running them multiple times is required to account for this. Also > backup the results of the previous run like is done for the other > benchmarks. Siddhesh already commented on what I was going to say which was that you have no guarantee that what you want is going to actually happen. The OS might give you back exactly the same physical pages on the subsequent run (though unlikely). I agree on many of Ondrej's points. Driving engineering rigour into these benchmarks is important. > ChangeLog: > > 2013-09-03 Will Newton <will.newton@linaro.org> > > * benchtests/Makefile (BENCH_RUNS): New variable. > (bench-set): Run tests BENCH_RUNS times and backup old run > results. > --- > benchtests/Makefile | 16 ++++++++++++++-- > 1 file changed, 14 insertions(+), 2 deletions(-) > > diff --git a/benchtests/Makefile b/benchtests/Makefile > index 6037e5c..d72c98f 100644 > --- a/benchtests/Makefile > +++ b/benchtests/Makefile > @@ -118,6 +118,11 @@ ifndef BENCH_DURATION > BENCH_DURATION := 10 > endif > > +# The default number of runs of a benchmark: 4. This needs a lengthy comment on the purpose of BENCH_RUNS. Why do we have it? What do we use it for? Why is 4 a good default? > +ifndef BENCH_RUNS > +BENCH_RUNS := 4 > +endif > + > CPPFLAGS-nonlib += -DDURATION=$(BENCH_DURATION) > > # Use clock_gettime to measure performance of functions. The default is to use > @@ -146,8 +151,15 @@ bench: bench-set bench-func > > bench-set: $(binaries-benchset) > for run in $^; do \ > - echo "Running $${run}"; \ > - $(run-bench) > $${run}.out; \ > + for old in $${run}.*.out; do \ > + if [ -f $$old ]; then \ > + mv $$old $${old}.old; \ > + fi; \ > + done; \ > + for count in $$(seq 1 $(BENCH_RUNS)); do \ > + echo "Running $${run} ($${count})"; \ > + $(run-bench) > $${run}.$${count}.out; \ > + done; \ > done > > bench-func: $(binaries-bench) > Given that extra runs have no adverse effects I'm OK with this patch in principle. Cheers, Carlos.
On Thu, Sep 05, 2013 at 07:06:53PM -0400, Carlos O'Donell wrote: > On 09/05/2013 01:06 PM, Will Newton wrote: > > I agree completely, lets start by finding the best way to fix the > > benchmarks, but once we have consensus i think it would be best to fix > > all the benchmarks rather than leave some unfixed. > > No, please fix things as soon as you have them fixed. There is no need > to artificially fix all of them for the sake of perfection :-) The context here is that Ondrej posted a patch to allocate a larger buffer just for memcpy when it should have been done in the common header. Siddhesh
diff --git a/benchtests/Makefile b/benchtests/Makefile index 6037e5c..d72c98f 100644 --- a/benchtests/Makefile +++ b/benchtests/Makefile @@ -118,6 +118,11 @@ ifndef BENCH_DURATION BENCH_DURATION := 10 endif +# The default number of runs of a benchmark: 4. +ifndef BENCH_RUNS +BENCH_RUNS := 4 +endif + CPPFLAGS-nonlib += -DDURATION=$(BENCH_DURATION) # Use clock_gettime to measure performance of functions. The default is to use @@ -146,8 +151,15 @@ bench: bench-set bench-func bench-set: $(binaries-benchset) for run in $^; do \ - echo "Running $${run}"; \ - $(run-bench) > $${run}.out; \ + for old in $${run}.*.out; do \ + if [ -f $$old ]; then \ + mv $$old $${old}.old; \ + fi; \ + done; \ + for count in $$(seq 1 $(BENCH_RUNS)); do \ + echo "Running $${run} ($${count})"; \ + $(run-bench) > $${run}.$${count}.out; \ + done; \ done bench-func: $(binaries-bench)