Message ID | 1483628157-25585-2-git-send-email-bill.fischofer@linaro.org |
---|---|
State | New |
Headers | show |
This patch is now merged, although I had some doubts that it has bad impact on performance. Here are some performance results for couple of simple, single thread packet alloc/free test cases. No references involved, just plain packets as before. Test results before and after "linux-generic: packet: implement reference apis": CPU cycles per operation before after packet_alloc 127.8 250.9 +96 % packet_alloc_multi 873.7 1538.8 +76 % packet_free 31 116.8 +277 % packet_free_multi 214.5 1369.2 +538 % packet_alloc_free 73.4 193.7 +164 % packet_alloc_free_multi 286.1 1228.8 +330 % Huge performance degradation. Numbers are now many times worse than before or after my optimizations. To me this shows that almost a complete rewrite (or revert) is needed. -Petri > -----Original Message----- > From: lng-odp [mailto:lng-odp-bounces@lists.linaro.org] On Behalf Of Bill > Fischofer > Sent: Thursday, January 05, 2017 4:56 PM > To: lng-odp@lists.linaro.org > Subject: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: implement > reference apis > > Implement the APIs: > - odp_packet_ref_static() > - odp_packet_ref() > - odp_packet_ref_pkt() > - odp_packet_is_ref() > - odp_packet_unshared_len() > > This also involves functional upgrades to the existing packet manipulation > APIs to work with packet references as input arguments. > > Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org> > --- > .../linux-generic/include/odp_packet_internal.h | 73 ++- > platform/linux-generic/odp_packet.c | 542
On Thu, Jan 12, 2017 at 6:12 AM, Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-labs.com> wrote: > This patch is now merged, although I had some doubts that it has bad impact on performance. Here are some performance results for couple of simple, single thread packet alloc/free test cases. No references involved, just plain packets as before. > > Test results before and after "linux-generic: packet: implement reference apis": > > CPU cycles per operation > before after > packet_alloc 127.8 250.9 +96 % > packet_alloc_multi 873.7 1538.8 +76 % > packet_free 31 116.8 +277 % > packet_free_multi 214.5 1369.2 +538 % > packet_alloc_free 73.4 193.7 +164 % > packet_alloc_free_multi 286.1 1228.8 +330 % > > > Huge performance degradation. Numbers are now many times worse than before or after my optimizations. To me this shows that almost a complete rewrite (or revert) is needed. My guess is this is due to the atomics needed for reference counting not being properly inlined internally. I know you did similar optimizations for ticketlocks. Let me look into this and post a patch that does the same for the atomics. > > -Petri > > >> -----Original Message----- >> From: lng-odp [mailto:lng-odp-bounces@lists.linaro.org] On Behalf Of Bill >> Fischofer >> Sent: Thursday, January 05, 2017 4:56 PM >> To: lng-odp@lists.linaro.org >> Subject: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: implement >> reference apis >> >> Implement the APIs: >> - odp_packet_ref_static() >> - odp_packet_ref() >> - odp_packet_ref_pkt() >> - odp_packet_is_ref() >> - odp_packet_unshared_len() >> >> This also involves functional upgrades to the existing packet manipulation >> APIs to work with packet references as input arguments. >> >> Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org> >> --- >> .../linux-generic/include/odp_packet_internal.h | 73 ++- >> platform/linux-generic/odp_packet.c | 542
> -----Original Message----- > From: Bill Fischofer [mailto:bill.fischofer@linaro.org] > Sent: Thursday, January 12, 2017 2:22 PM > To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell- > labs.com> > Cc: lng-odp@lists.linaro.org > Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: > implement reference apis > > On Thu, Jan 12, 2017 at 6:12 AM, Savolainen, Petri (Nokia - FI/Espoo) > <petri.savolainen@nokia-bell-labs.com> wrote: > > This patch is now merged, although I had some doubts that it has bad > impact on performance. Here are some performance results for couple of > simple, single thread packet alloc/free test cases. No references > involved, just plain packets as before. > > > > Test results before and after "linux-generic: packet: implement > reference apis": > > > > CPU cycles per operation > > before after > > packet_alloc 127.8 250.9 +96 % > > packet_alloc_multi 873.7 1538.8 +76 % > > packet_free 31 116.8 +277 % > > packet_free_multi 214.5 1369.2 +538 % > > packet_alloc_free 73.4 193.7 +164 % > > packet_alloc_free_multi 286.1 1228.8 +330 % > > > > > > Huge performance degradation. Numbers are now many times worse than > before or after my optimizations. To me this shows that almost a complete > rewrite (or revert) is needed. > > My guess is this is due to the atomics needed for reference counting > not being properly inlined internally. I know you did similar > optimizations for ticketlocks. Let me look into this and post a patch > that does the same for the atomics. Atomics are inlined already. Also atomic operations should not be required if there's a single refence. -Petri
On Thu, Jan 12, 2017 at 6:27 AM, Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-labs.com> wrote: > > >> -----Original Message----- >> From: Bill Fischofer [mailto:bill.fischofer@linaro.org] >> Sent: Thursday, January 12, 2017 2:22 PM >> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell- >> labs.com> >> Cc: lng-odp@lists.linaro.org >> Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: >> implement reference apis >> >> On Thu, Jan 12, 2017 at 6:12 AM, Savolainen, Petri (Nokia - FI/Espoo) >> <petri.savolainen@nokia-bell-labs.com> wrote: >> > This patch is now merged, although I had some doubts that it has bad >> impact on performance. Here are some performance results for couple of >> simple, single thread packet alloc/free test cases. No references >> involved, just plain packets as before. >> > >> > Test results before and after "linux-generic: packet: implement >> reference apis": >> > >> > CPU cycles per operation >> > before after >> > packet_alloc 127.8 250.9 +96 % >> > packet_alloc_multi 873.7 1538.8 +76 % >> > packet_free 31 116.8 +277 % >> > packet_free_multi 214.5 1369.2 +538 % >> > packet_alloc_free 73.4 193.7 +164 % >> > packet_alloc_free_multi 286.1 1228.8 +330 % >> > >> > >> > Huge performance degradation. Numbers are now many times worse than >> before or after my optimizations. To me this shows that almost a complete >> rewrite (or revert) is needed. >> >> My guess is this is due to the atomics needed for reference counting >> not being properly inlined internally. I know you did similar >> optimizations for ticketlocks. Let me look into this and post a patch >> that does the same for the atomics. > > Atomics are inlined already. Also atomic operations should not be required if there's a single refence. The ref_count still needs to be initialized on alloc and decremented on free to see if the packet should really be freed. The only way to know whether you're dealing with a single reference or not is to check. Do you have a breakdown of where the hotspots are? These numbers do seem high. > > -Petri > >
> >> > Huge performance degradation. Numbers are now many times worse than > >> before or after my optimizations. To me this shows that almost a > complete > >> rewrite (or revert) is needed. > >> > >> My guess is this is due to the atomics needed for reference counting > >> not being properly inlined internally. I know you did similar > >> optimizations for ticketlocks. Let me look into this and post a patch > >> that does the same for the atomics. > > > > Atomics are inlined already. Also atomic operations should not be > required if there's a single refence. > > The ref_count still needs to be initialized on alloc and decremented > on free to see if the packet should really be freed. The only way to > know whether you're dealing with a single reference or not is to > check. Do you have a breakdown of where the hotspots are? These > numbers do seem high. > No, I didn't look into it any deeper. -Petri
On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-labs.com> wrote: > >> >> > Huge performance degradation. Numbers are now many times worse than >> >> before or after my optimizations. To me this shows that almost a >> complete >> >> rewrite (or revert) is needed. >> >> >> >> My guess is this is due to the atomics needed for reference counting >> >> not being properly inlined internally. I know you did similar >> >> optimizations for ticketlocks. Let me look into this and post a patch >> >> that does the same for the atomics. >> > >> > Atomics are inlined already. Also atomic operations should not be >> required if there's a single refence. >> >> The ref_count still needs to be initialized on alloc and decremented >> on free to see if the packet should really be freed. The only way to >> know whether you're dealing with a single reference or not is to >> check. Do you have a breakdown of where the hotspots are? These >> numbers do seem high. >> > > No, I didn't look into it any deeper. If we look at the comparative pathlength for packet_alloc(), the only changes are to the internal routines packet_init() and init_segments(). The rest of the code is identical. For packet_init() the delta is these added two lines: /* By default packet has no references */ pkt_hdr->unshared_len = len; pkt_hdr->ref_hdr = NULL; For init_segments() the delta is similarly tiny: packet_ref_count_set(hdr, 1); Where this is an inline function that expands to odp_atomic_init_u32(&hdr->ref_count, 1); The changes on the packet_free() side are similarly tiny. Basically we decrement the ref_count and only actually free the buffer if the ref_count is now zero. I don't see how these translate into a near doubling of cycles for these tests. It seems there must be something else going on. > > -Petri
On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer <bill.fischofer@linaro.org> wrote: > On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo) > <petri.savolainen@nokia-bell-labs.com> wrote: >> >>> >> > Huge performance degradation. Numbers are now many times worse than >>> >> before or after my optimizations. To me this shows that almost a >>> complete >>> >> rewrite (or revert) is needed. >>> >> >>> >> My guess is this is due to the atomics needed for reference counting >>> >> not being properly inlined internally. I know you did similar >>> >> optimizations for ticketlocks. Let me look into this and post a patch >>> >> that does the same for the atomics. >>> > >>> > Atomics are inlined already. Also atomic operations should not be >>> required if there's a single refence. >>> >>> The ref_count still needs to be initialized on alloc and decremented >>> on free to see if the packet should really be freed. The only way to >>> know whether you're dealing with a single reference or not is to >>> check. Do you have a breakdown of where the hotspots are? These >>> numbers do seem high. >>> >> >> No, I didn't look into it any deeper. > > If we look at the comparative pathlength for packet_alloc(), the only > changes are to the internal routines packet_init() and > init_segments(). The rest of the code is identical. > > For packet_init() the delta is these added two lines: > > /* By default packet has no references */ > pkt_hdr->unshared_len = len; > pkt_hdr->ref_hdr = NULL; > > For init_segments() the delta is similarly tiny: > > packet_ref_count_set(hdr, 1); > > Where this is an inline function that expands to > odp_atomic_init_u32(&hdr->ref_count, 1); > > The changes on the packet_free() side are similarly tiny. Basically we > decrement the ref_count and only actually free the buffer if the > ref_count is now zero. > > I don't see how these translate into a near doubling of cycles for > these tests. It seems there must be something else going on. I think I see some potential larger deltas in the packet free path. I'll dig into this further and post a patch . > >> >> -Petri
I've posted patch http://patches.opendataplane.org/patch/7857/ as an initial step towards resolving this issue. Petri: can you make your test program available to me so I can try to reproduce your results locally. I don't have quite the scale or controlled environment that you do but this should save iteration steps. This patch adds a fastpath to odp_packet_free() that should more closely match the pre-refs code for non-reference packets. odp_packet_free_multi() is difficult because there is no guarantee that all of the packets passed to this are non-references so I'm not sure how to match the prior code exactly in that case. This code just calls packet_free() for each of the buffers so these individual calls will be optimized but this still done via individual calls rather than a single batched call. On the alloc side I'm not sure what could be going on other than the fact that to support references requires additional packet metadata which probably spills onto an extra cache line compared to the original code. As noted earlier, the alloc() changes are simply initializing this data, which shouldn't be more than a few extra cycles. Thanks. On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer <bill.fischofer@linaro.org> wrote: > On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer > <bill.fischofer@linaro.org> wrote: >> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo) >> <petri.savolainen@nokia-bell-labs.com> wrote: >>> >>>> >> > Huge performance degradation. Numbers are now many times worse than >>>> >> before or after my optimizations. To me this shows that almost a >>>> complete >>>> >> rewrite (or revert) is needed. >>>> >> >>>> >> My guess is this is due to the atomics needed for reference counting >>>> >> not being properly inlined internally. I know you did similar >>>> >> optimizations for ticketlocks. Let me look into this and post a patch >>>> >> that does the same for the atomics. >>>> > >>>> > Atomics are inlined already. Also atomic operations should not be >>>> required if there's a single refence. >>>> >>>> The ref_count still needs to be initialized on alloc and decremented >>>> on free to see if the packet should really be freed. The only way to >>>> know whether you're dealing with a single reference or not is to >>>> check. Do you have a breakdown of where the hotspots are? These >>>> numbers do seem high. >>>> >>> >>> No, I didn't look into it any deeper. >> >> If we look at the comparative pathlength for packet_alloc(), the only >> changes are to the internal routines packet_init() and >> init_segments(). The rest of the code is identical. >> >> For packet_init() the delta is these added two lines: >> >> /* By default packet has no references */ >> pkt_hdr->unshared_len = len; >> pkt_hdr->ref_hdr = NULL; >> >> For init_segments() the delta is similarly tiny: >> >> packet_ref_count_set(hdr, 1); >> >> Where this is an inline function that expands to >> odp_atomic_init_u32(&hdr->ref_count, 1); >> >> The changes on the packet_free() side are similarly tiny. Basically we >> decrement the ref_count and only actually free the buffer if the >> ref_count is now zero. >> >> I don't see how these translate into a near doubling of cycles for >> these tests. It seems there must be something else going on. > > I think I see some potential larger deltas in the packet free path. > I'll dig into this further and post a patch . > >> >>> >>> -Petri
Hi Petri, I've posted v2 of the tuning patch at http://patches.opendataplane.org/patch/7878/ which adds optimization to the odp_packet_free_multi() routine as well as odp_packet_free(). So both of these should now be a very similar for freeing reference and non-reference packets. I'd still like access to the program that you were using to generate your cycle stats. The odp_scheduling test in the performance test dir seems too erratic from run to run to enable this sort of comparison. One restructure included in this patch should have benefit for non-reference paths as well as reference paths. I noticed that the buffer_free_multi() routine is prepared to handle buffer lists coming from different pools and does the checking for pool crossings in that routine. This is wasteful because this routine is used internally when dealing with multi-segment packets and these internal uses never mix buffers from different pools. The only place that can happen is in the external odp_packet_free_multi() and odp_buffer_free_multi() APIs, so I've moved that checking up to those APIs so buffer_free_multi() now just front-ends buffer_free_to_pool(). The optimized internal packet free routines now call buffer_free_to_pool() directly. If you can confirm that these are heading in the right direction we can see what else might be needed from a tuning perspective. Thanks. On Thu, Jan 12, 2017 at 4:15 PM, Bill Fischofer <bill.fischofer@linaro.org> wrote: > I've posted patch http://patches.opendataplane.org/patch/7857/ as an > initial step towards resolving this issue. Petri: can you make your > test program available to me so I can try to reproduce your results > locally. I don't have quite the scale or controlled environment that > you do but this should save iteration steps. > > This patch adds a fastpath to odp_packet_free() that should more > closely match the pre-refs code for non-reference packets. > odp_packet_free_multi() is difficult because there is no guarantee > that all of the packets passed to this are non-references so I'm not > sure how to match the prior code exactly in that case. This code just > calls packet_free() for each of the buffers so these individual calls > will be optimized but this still done via individual calls rather than > a single batched call. > > On the alloc side I'm not sure what could be going on other than the > fact that to support references requires additional packet metadata > which probably spills onto an extra cache line compared to the > original code. As noted earlier, the alloc() changes are simply > initializing this data, which shouldn't be more than a few extra > cycles. > > Thanks. > > On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer > <bill.fischofer@linaro.org> wrote: >> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer >> <bill.fischofer@linaro.org> wrote: >>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo) >>> <petri.savolainen@nokia-bell-labs.com> wrote: >>>> >>>>> >> > Huge performance degradation. Numbers are now many times worse than >>>>> >> before or after my optimizations. To me this shows that almost a >>>>> complete >>>>> >> rewrite (or revert) is needed. >>>>> >> >>>>> >> My guess is this is due to the atomics needed for reference counting >>>>> >> not being properly inlined internally. I know you did similar >>>>> >> optimizations for ticketlocks. Let me look into this and post a patch >>>>> >> that does the same for the atomics. >>>>> > >>>>> > Atomics are inlined already. Also atomic operations should not be >>>>> required if there's a single refence. >>>>> >>>>> The ref_count still needs to be initialized on alloc and decremented >>>>> on free to see if the packet should really be freed. The only way to >>>>> know whether you're dealing with a single reference or not is to >>>>> check. Do you have a breakdown of where the hotspots are? These >>>>> numbers do seem high. >>>>> >>>> >>>> No, I didn't look into it any deeper. >>> >>> If we look at the comparative pathlength for packet_alloc(), the only >>> changes are to the internal routines packet_init() and >>> init_segments(). The rest of the code is identical. >>> >>> For packet_init() the delta is these added two lines: >>> >>> /* By default packet has no references */ >>> pkt_hdr->unshared_len = len; >>> pkt_hdr->ref_hdr = NULL; >>> >>> For init_segments() the delta is similarly tiny: >>> >>> packet_ref_count_set(hdr, 1); >>> >>> Where this is an inline function that expands to >>> odp_atomic_init_u32(&hdr->ref_count, 1); >>> >>> The changes on the packet_free() side are similarly tiny. Basically we >>> decrement the ref_count and only actually free the buffer if the >>> ref_count is now zero. >>> >>> I don't see how these translate into a near doubling of cycles for >>> these tests. It seems there must be something else going on. >> >> I think I see some potential larger deltas in the packet free path. >> I'll dig into this further and post a patch . >> >>> >>>> >>>> -Petri
Please ignore v2. It had a memory leak that Matias' excellent odp_bench_packet test found. A corrected v3 is posted at http://patches.opendataplane.org/patch/7879/ On Sat, Jan 14, 2017 at 10:04 AM, Bill Fischofer <bill.fischofer@linaro.org> wrote: > Hi Petri, > > I've posted v2 of the tuning patch at > http://patches.opendataplane.org/patch/7878/ which adds optimization > to the odp_packet_free_multi() routine as well as odp_packet_free(). > So both of these should now be a very similar for freeing reference > and non-reference packets. > > I'd still like access to the program that you were using to generate > your cycle stats. The odp_scheduling test in the performance test dir > seems too erratic from run to run to enable this sort of comparison. > > One restructure included in this patch should have benefit for > non-reference paths as well as reference paths. I noticed that the > buffer_free_multi() routine is prepared to handle buffer lists coming > from different pools and does the checking for pool crossings in that > routine. This is wasteful because this routine is used internally when > dealing with multi-segment packets and these internal uses never mix > buffers from different pools. The only place that can happen is in the > external odp_packet_free_multi() and odp_buffer_free_multi() APIs, so > I've moved that checking up to those APIs so buffer_free_multi() now > just front-ends buffer_free_to_pool(). The optimized internal packet > free routines now call buffer_free_to_pool() directly. > > If you can confirm that these are heading in the right direction we > can see what else might be needed from a tuning perspective. > > Thanks. > > On Thu, Jan 12, 2017 at 4:15 PM, Bill Fischofer > <bill.fischofer@linaro.org> wrote: >> I've posted patch http://patches.opendataplane.org/patch/7857/ as an >> initial step towards resolving this issue. Petri: can you make your >> test program available to me so I can try to reproduce your results >> locally. I don't have quite the scale or controlled environment that >> you do but this should save iteration steps. >> >> This patch adds a fastpath to odp_packet_free() that should more >> closely match the pre-refs code for non-reference packets. >> odp_packet_free_multi() is difficult because there is no guarantee >> that all of the packets passed to this are non-references so I'm not >> sure how to match the prior code exactly in that case. This code just >> calls packet_free() for each of the buffers so these individual calls >> will be optimized but this still done via individual calls rather than >> a single batched call. >> >> On the alloc side I'm not sure what could be going on other than the >> fact that to support references requires additional packet metadata >> which probably spills onto an extra cache line compared to the >> original code. As noted earlier, the alloc() changes are simply >> initializing this data, which shouldn't be more than a few extra >> cycles. >> >> Thanks. >> >> On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer >> <bill.fischofer@linaro.org> wrote: >>> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer >>> <bill.fischofer@linaro.org> wrote: >>>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo) >>>> <petri.savolainen@nokia-bell-labs.com> wrote: >>>>> >>>>>> >> > Huge performance degradation. Numbers are now many times worse than >>>>>> >> before or after my optimizations. To me this shows that almost a >>>>>> complete >>>>>> >> rewrite (or revert) is needed. >>>>>> >> >>>>>> >> My guess is this is due to the atomics needed for reference counting >>>>>> >> not being properly inlined internally. I know you did similar >>>>>> >> optimizations for ticketlocks. Let me look into this and post a patch >>>>>> >> that does the same for the atomics. >>>>>> > >>>>>> > Atomics are inlined already. Also atomic operations should not be >>>>>> required if there's a single refence. >>>>>> >>>>>> The ref_count still needs to be initialized on alloc and decremented >>>>>> on free to see if the packet should really be freed. The only way to >>>>>> know whether you're dealing with a single reference or not is to >>>>>> check. Do you have a breakdown of where the hotspots are? These >>>>>> numbers do seem high. >>>>>> >>>>> >>>>> No, I didn't look into it any deeper. >>>> >>>> If we look at the comparative pathlength for packet_alloc(), the only >>>> changes are to the internal routines packet_init() and >>>> init_segments(). The rest of the code is identical. >>>> >>>> For packet_init() the delta is these added two lines: >>>> >>>> /* By default packet has no references */ >>>> pkt_hdr->unshared_len = len; >>>> pkt_hdr->ref_hdr = NULL; >>>> >>>> For init_segments() the delta is similarly tiny: >>>> >>>> packet_ref_count_set(hdr, 1); >>>> >>>> Where this is an inline function that expands to >>>> odp_atomic_init_u32(&hdr->ref_count, 1); >>>> >>>> The changes on the packet_free() side are similarly tiny. Basically we >>>> decrement the ref_count and only actually free the buffer if the >>>> ref_count is now zero. >>>> >>>> I don't see how these translate into a near doubling of cycles for >>>> these tests. It seems there must be something else going on. >>> >>> I think I see some potential larger deltas in the packet free path. >>> I'll dig into this further and post a patch . >>> >>>> >>>>> >>>>> -Petri
I've been playing around with odp_bench_packet and there are two problems with using this test: 1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT times, which is good, however it only reports the average cycle time for these runs, There is considerable variability when running in a non-isolated environment, so what we really want to report is not just the average times, but also the minimum and maximum times encountered over those runs. From a microbenchmarking standpoint, the minimum times are what are of most interest since those are the "pure" measures of the functions being tested, with the least amount of interference from other sources. Statistically speaking, with a large TEST_SIZE_RUN_COUNT the minimums should get close to the actual isolated time for each of the tests without requiring the system be fully isolated. 2. Within each test, the function being tested is performed TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason this is bad is that longer run times give more opportunity for "cut aways" and other interruptions that distort the measurement trying to be made. I tried reducing TEST_REPEAT_COUNT to 1 but that doesn't seem to work (need to debug that further) but TEST_REPEAT_COUNT_2 does work and we'll consider that good enough for now. With TEST_REPEAT_COUNT set to 2 and TEST_SIZE_RUN_COUNT set to 1000, and adding the accumulation and reporting of min/max as well as average times, we get closer to what we're really trying to measure here. So here are the results I'm seeing for the alloc/free tests. I'm only looking at the 64-byte runs here (after "warmup") as that's the most relevant to what we're trying to measure and tune in this case. For the "baseline" without packet references support added we see this: Packet length: 64 bytes --------------------------- bench_empty : 18 min 59 max 11.4 avg bench_packet_alloc : 92 min 1230 max 57.0 avg bench_packet_alloc_multi : 374 min 5440 max 205.0 avg bench_packet_free : 54 min 1360 max 39.6 avg bench_packet_free_multi : 178 min 36343 max 116.0 avg bench_packet_alloc_free : 140 min 41347 max 123.0 avg bench_packet_alloc_free_multi : 540 min 41632 max 330.4 avg The original reference code (what's currently in api-next) shows this: Packet length: 64 bytes --------------------------- bench_empty : 18 min 788 max 11.5 avg bench_packet_alloc : 93 min 1084 max 61.9 avg bench_packet_alloc_multi : 416 min 6874 max 242.5 avg bench_packet_free : 102 min 744 max 74.6 avg bench_packet_free_multi : 951 min 1513 max 499.4 avg bench_packet_alloc_free : 223 min 37317 max 164.6 avg bench_packet_alloc_free_multi : 1403 min 52753 max 795.1 avg With the v1 optimization patch applied: Packet length: 64 bytes --------------------------- bench_empty : 18 min 519 max 11.7 avg bench_packet_alloc : 93 min 35733 max 97.2 avg bench_packet_alloc_multi : 419 min 50064 max 249.9 avg bench_packet_free : 66 min 35564 max 62.2 avg bench_packet_free_multi : 538 min 717 max 283.2 avg bench_packet_alloc_free : 178 min 40694 max 135.5 avg bench_packet_alloc_free_multi : 1000 min 41731 max 614.6 avg And finally, with the v3 optimization patch applied: Packet length: 64 bytes --------------------------- bench_empty : 18 min 305 max 11.6 avg bench_packet_alloc : 93 min 1076 max 62.8 avg bench_packet_alloc_multi : 413 min 122717 max 364.2 avg bench_packet_free : 57 min 522 max 39.4 avg bench_packet_free_multi : 268 min 1852 max 159.9 avg bench_packet_alloc_free : 181 min 1760 max 128.2 avg bench_packet_alloc_free_multi : 812 min 34288 max 476.4 avg A few things to note: 1. The min/max reports are raw deltas from odp_cpu_cycles_diff() while the averages are computed results that are scaled differently. As a result, they cannot be compared directly with the min/max numbers, which is why we have anomalies like the "average" being lower than the "min". I should normalize these to be on the same scales but I haven't done that yet. So for now ignore the averages except to note that they move around a lot. 2. Bench_empty, which is the "do nothing" test used to calibrate the fixed overhead of these tests, reports the same minimums across all of these while the maxes and averages differ significantly. This proves the point about the need to use minimums for these sort of tests. 3. The mins for bench_packet_alloc is pretty much the same across all of these (92 vs 93). This is reassuring since as noted in my original response, the code deltas here are just a couple of extra metadata fields being initialized, so the widely differing averages are an artifact of the problems noted at the top of this post. 4. The bench_packet_alloc_multi test shows a bit more overhead (374 vs. 413-419). This warrants a bit more investigation, but I haven't done that yet. Note, however, that this is nowhere near the doubling that was originally reported. 5. The bench_packet_free test shows the problem originally identified with the new code (54 vs 102) but the optimizations (v1 = 66, v3 = 57) have brought that back down to what should be insignificant levels. 6. The bench_packet_free_multi test also shows the original problem (178 vs 951) as well as the improvements made by the optimizations (v1 = 538, v3 = 268). There's probably still some room for improvement here, but clearly this is a tractable problem at this point. I'll work my experiments with odp_bench_packet into a patch set that can be used to improve this very worthwhile framework. On Sat, Jan 14, 2017 at 2:02 PM, Bill Fischofer <bill.fischofer@linaro.org> wrote: > Please ignore v2. It had a memory leak that Matias' excellent > odp_bench_packet test found. A corrected v3 is posted at > http://patches.opendataplane.org/patch/7879/ > > On Sat, Jan 14, 2017 at 10:04 AM, Bill Fischofer > <bill.fischofer@linaro.org> wrote: >> Hi Petri, >> >> I've posted v2 of the tuning patch at >> http://patches.opendataplane.org/patch/7878/ which adds optimization >> to the odp_packet_free_multi() routine as well as odp_packet_free(). >> So both of these should now be a very similar for freeing reference >> and non-reference packets. >> >> I'd still like access to the program that you were using to generate >> your cycle stats. The odp_scheduling test in the performance test dir >> seems too erratic from run to run to enable this sort of comparison. >> >> One restructure included in this patch should have benefit for >> non-reference paths as well as reference paths. I noticed that the >> buffer_free_multi() routine is prepared to handle buffer lists coming >> from different pools and does the checking for pool crossings in that >> routine. This is wasteful because this routine is used internally when >> dealing with multi-segment packets and these internal uses never mix >> buffers from different pools. The only place that can happen is in the >> external odp_packet_free_multi() and odp_buffer_free_multi() APIs, so >> I've moved that checking up to those APIs so buffer_free_multi() now >> just front-ends buffer_free_to_pool(). The optimized internal packet >> free routines now call buffer_free_to_pool() directly. >> >> If you can confirm that these are heading in the right direction we >> can see what else might be needed from a tuning perspective. >> >> Thanks. >> >> On Thu, Jan 12, 2017 at 4:15 PM, Bill Fischofer >> <bill.fischofer@linaro.org> wrote: >>> I've posted patch http://patches.opendataplane.org/patch/7857/ as an >>> initial step towards resolving this issue. Petri: can you make your >>> test program available to me so I can try to reproduce your results >>> locally. I don't have quite the scale or controlled environment that >>> you do but this should save iteration steps. >>> >>> This patch adds a fastpath to odp_packet_free() that should more >>> closely match the pre-refs code for non-reference packets. >>> odp_packet_free_multi() is difficult because there is no guarantee >>> that all of the packets passed to this are non-references so I'm not >>> sure how to match the prior code exactly in that case. This code just >>> calls packet_free() for each of the buffers so these individual calls >>> will be optimized but this still done via individual calls rather than >>> a single batched call. >>> >>> On the alloc side I'm not sure what could be going on other than the >>> fact that to support references requires additional packet metadata >>> which probably spills onto an extra cache line compared to the >>> original code. As noted earlier, the alloc() changes are simply >>> initializing this data, which shouldn't be more than a few extra >>> cycles. >>> >>> Thanks. >>> >>> On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer >>> <bill.fischofer@linaro.org> wrote: >>>> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer >>>> <bill.fischofer@linaro.org> wrote: >>>>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo) >>>>> <petri.savolainen@nokia-bell-labs.com> wrote: >>>>>> >>>>>>> >> > Huge performance degradation. Numbers are now many times worse than >>>>>>> >> before or after my optimizations. To me this shows that almost a >>>>>>> complete >>>>>>> >> rewrite (or revert) is needed. >>>>>>> >> >>>>>>> >> My guess is this is due to the atomics needed for reference counting >>>>>>> >> not being properly inlined internally. I know you did similar >>>>>>> >> optimizations for ticketlocks. Let me look into this and post a patch >>>>>>> >> that does the same for the atomics. >>>>>>> > >>>>>>> > Atomics are inlined already. Also atomic operations should not be >>>>>>> required if there's a single refence. >>>>>>> >>>>>>> The ref_count still needs to be initialized on alloc and decremented >>>>>>> on free to see if the packet should really be freed. The only way to >>>>>>> know whether you're dealing with a single reference or not is to >>>>>>> check. Do you have a breakdown of where the hotspots are? These >>>>>>> numbers do seem high. >>>>>>> >>>>>> >>>>>> No, I didn't look into it any deeper. >>>>> >>>>> If we look at the comparative pathlength for packet_alloc(), the only >>>>> changes are to the internal routines packet_init() and >>>>> init_segments(). The rest of the code is identical. >>>>> >>>>> For packet_init() the delta is these added two lines: >>>>> >>>>> /* By default packet has no references */ >>>>> pkt_hdr->unshared_len = len; >>>>> pkt_hdr->ref_hdr = NULL; >>>>> >>>>> For init_segments() the delta is similarly tiny: >>>>> >>>>> packet_ref_count_set(hdr, 1); >>>>> >>>>> Where this is an inline function that expands to >>>>> odp_atomic_init_u32(&hdr->ref_count, 1); >>>>> >>>>> The changes on the packet_free() side are similarly tiny. Basically we >>>>> decrement the ref_count and only actually free the buffer if the >>>>> ref_count is now zero. >>>>> >>>>> I don't see how these translate into a near doubling of cycles for >>>>> these tests. It seems there must be something else going on. >>>> >>>> I think I see some potential larger deltas in the packet free path. >>>> I'll dig into this further and post a patch . >>>> >>>>> >>>>>> >>>>>> -Petri
Might be off-topic but do we need to check somewhere in odp tests that cpu frequency scaling is turend off? My latop likes to play with frequencies and even warn up does not work because it goes to maximum speed, then overheated and cpu frequency lowered. If we count time in cpu cycles - measurements should give about the same numbers of cycles, but if we count real time, then there might be different results. Maxim. On 15 January 2017 at 18:08, Bill Fischofer <bill.fischofer@linaro.org> wrote: > I've been playing around with odp_bench_packet and there are two > problems with using this test: > > 1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT > times, which is good, however it only reports the average cycle time > for these runs, There is considerable variability when running in a > non-isolated environment, so what we really want to report is not just > the average times, but also the minimum and maximum times encountered > over those runs. From a microbenchmarking standpoint, the minimum > times are what are of most interest since those are the "pure" > measures of the functions being tested, with the least amount of > interference from other sources. Statistically speaking, with a large > TEST_SIZE_RUN_COUNT the minimums should get close to the actual > isolated time for each of the tests without requiring the system be > fully isolated. > > 2. Within each test, the function being tested is performed > TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason > this is bad is that longer run times give more opportunity for "cut > aways" and other interruptions that distort the measurement trying to > be made. > > I tried reducing TEST_REPEAT_COUNT to 1 but that doesn't seem to work > (need to debug that further) but TEST_REPEAT_COUNT_2 does work and > we'll consider that good enough for now. With TEST_REPEAT_COUNT set to > 2 and TEST_SIZE_RUN_COUNT set to 1000, and adding the accumulation and > reporting of min/max as well as average times, we get closer to what > we're really trying to measure here. > > So here are the results I'm seeing for the alloc/free tests. I'm only > looking at the 64-byte runs here (after "warmup") as that's the most > relevant to what we're trying to measure and tune in this case. > > For the "baseline" without packet references support added we see this: > > Packet length: 64 bytes > --------------------------- > bench_empty : 18 min 59 max 11.4 avg > bench_packet_alloc : 92 min 1230 max 57.0 avg > bench_packet_alloc_multi : 374 min 5440 max 205.0 avg > bench_packet_free : 54 min 1360 max 39.6 avg > bench_packet_free_multi : 178 min 36343 max 116.0 avg > bench_packet_alloc_free : 140 min 41347 max 123.0 avg > bench_packet_alloc_free_multi : 540 min 41632 max 330.4 avg > > The original reference code (what's currently in api-next) shows this: > > Packet length: 64 bytes > --------------------------- > bench_empty : 18 min 788 max 11.5 avg > bench_packet_alloc : 93 min 1084 max 61.9 avg > bench_packet_alloc_multi : 416 min 6874 max 242.5 avg > bench_packet_free : 102 min 744 max 74.6 avg > bench_packet_free_multi : 951 min 1513 max 499.4 avg > bench_packet_alloc_free : 223 min 37317 max 164.6 avg > bench_packet_alloc_free_multi : 1403 min 52753 max 795.1 avg > > With the v1 optimization patch applied: > > Packet length: 64 bytes > --------------------------- > bench_empty : 18 min 519 max 11.7 avg > bench_packet_alloc : 93 min 35733 max 97.2 avg > bench_packet_alloc_multi : 419 min 50064 max 249.9 avg > bench_packet_free : 66 min 35564 max 62.2 avg > bench_packet_free_multi : 538 min 717 max 283.2 avg > bench_packet_alloc_free : 178 min 40694 max 135.5 avg > bench_packet_alloc_free_multi : 1000 min 41731 max 614.6 avg > > And finally, with the v3 optimization patch applied: > > Packet length: 64 bytes > --------------------------- > bench_empty : 18 min 305 max 11.6 avg > bench_packet_alloc : 93 min 1076 max 62.8 avg > bench_packet_alloc_multi : 413 min 122717 max 364.2 avg > bench_packet_free : 57 min 522 max 39.4 avg > bench_packet_free_multi : 268 min 1852 max 159.9 avg > bench_packet_alloc_free : 181 min 1760 max 128.2 avg > bench_packet_alloc_free_multi : 812 min 34288 max 476.4 avg > > A few things to note: > > 1. The min/max reports are raw deltas from odp_cpu_cycles_diff() while > the averages are computed results that are scaled differently. As a > result, they cannot be compared directly with the min/max numbers, > which is why we have anomalies like the "average" being lower than the > "min". I should normalize these to be on the same scales but I haven't > done that yet. So for now ignore the averages except to note that they > move around a lot. > > 2. Bench_empty, which is the "do nothing" test used to calibrate the > fixed overhead of these tests, reports the same minimums across all of > these while the maxes and averages differ significantly. This proves > the point about the need to use minimums for these sort of tests. > > 3. The mins for bench_packet_alloc is pretty much the same across all > of these (92 vs 93). This is reassuring since as noted in my original > response, the code deltas here are just a couple of extra metadata > fields being initialized, so the widely differing averages are an > artifact of the problems noted at the top of this post. > > 4. The bench_packet_alloc_multi test shows a bit more overhead (374 > vs. 413-419). This warrants a bit more investigation, but I haven't > done that yet. Note, however, that this is nowhere near the doubling > that was originally reported. > > 5. The bench_packet_free test shows the problem originally identified > with the new code (54 vs 102) but the optimizations (v1 = 66, v3 = 57) > have brought that back down to what should be insignificant levels. > > 6. The bench_packet_free_multi test also shows the original problem > (178 vs 951) as well as the improvements made by the optimizations (v1 > = 538, v3 = 268). There's probably still some room for improvement > here, but clearly this is a tractable problem at this point. > > I'll work my experiments with odp_bench_packet into a patch set that > can be used to improve this very worthwhile framework. > > > On Sat, Jan 14, 2017 at 2:02 PM, Bill Fischofer > <bill.fischofer@linaro.org> wrote: > > Please ignore v2. It had a memory leak that Matias' excellent > > odp_bench_packet test found. A corrected v3 is posted at > > http://patches.opendataplane.org/patch/7879/ > > > > On Sat, Jan 14, 2017 at 10:04 AM, Bill Fischofer > > <bill.fischofer@linaro.org> wrote: > >> Hi Petri, > >> > >> I've posted v2 of the tuning patch at > >> http://patches.opendataplane.org/patch/7878/ which adds optimization > >> to the odp_packet_free_multi() routine as well as odp_packet_free(). > >> So both of these should now be a very similar for freeing reference > >> and non-reference packets. > >> > >> I'd still like access to the program that you were using to generate > >> your cycle stats. The odp_scheduling test in the performance test dir > >> seems too erratic from run to run to enable this sort of comparison. > >> > >> One restructure included in this patch should have benefit for > >> non-reference paths as well as reference paths. I noticed that the > >> buffer_free_multi() routine is prepared to handle buffer lists coming > >> from different pools and does the checking for pool crossings in that > >> routine. This is wasteful because this routine is used internally when > >> dealing with multi-segment packets and these internal uses never mix > >> buffers from different pools. The only place that can happen is in the > >> external odp_packet_free_multi() and odp_buffer_free_multi() APIs, so > >> I've moved that checking up to those APIs so buffer_free_multi() now > >> just front-ends buffer_free_to_pool(). The optimized internal packet > >> free routines now call buffer_free_to_pool() directly. > >> > >> If you can confirm that these are heading in the right direction we > >> can see what else might be needed from a tuning perspective. > >> > >> Thanks. > >> > >> On Thu, Jan 12, 2017 at 4:15 PM, Bill Fischofer > >> <bill.fischofer@linaro.org> wrote: > >>> I've posted patch http://patches.opendataplane.org/patch/7857/ as an > >>> initial step towards resolving this issue. Petri: can you make your > >>> test program available to me so I can try to reproduce your results > >>> locally. I don't have quite the scale or controlled environment that > >>> you do but this should save iteration steps. > >>> > >>> This patch adds a fastpath to odp_packet_free() that should more > >>> closely match the pre-refs code for non-reference packets. > >>> odp_packet_free_multi() is difficult because there is no guarantee > >>> that all of the packets passed to this are non-references so I'm not > >>> sure how to match the prior code exactly in that case. This code just > >>> calls packet_free() for each of the buffers so these individual calls > >>> will be optimized but this still done via individual calls rather than > >>> a single batched call. > >>> > >>> On the alloc side I'm not sure what could be going on other than the > >>> fact that to support references requires additional packet metadata > >>> which probably spills onto an extra cache line compared to the > >>> original code. As noted earlier, the alloc() changes are simply > >>> initializing this data, which shouldn't be more than a few extra > >>> cycles. > >>> > >>> Thanks. > >>> > >>> On Thu, Jan 12, 2017 at 7:22 AM, Bill Fischofer > >>> <bill.fischofer@linaro.org> wrote: > >>>> On Thu, Jan 12, 2017 at 6:58 AM, Bill Fischofer > >>>> <bill.fischofer@linaro.org> wrote: > >>>>> On Thu, Jan 12, 2017 at 6:52 AM, Savolainen, Petri (Nokia - FI/Espoo) > >>>>> <petri.savolainen@nokia-bell-labs.com> wrote: > >>>>>> > >>>>>>> >> > Huge performance degradation. Numbers are now many times > worse than > >>>>>>> >> before or after my optimizations. To me this shows that almost a > >>>>>>> complete > >>>>>>> >> rewrite (or revert) is needed. > >>>>>>> >> > >>>>>>> >> My guess is this is due to the atomics needed for reference > counting > >>>>>>> >> not being properly inlined internally. I know you did similar > >>>>>>> >> optimizations for ticketlocks. Let me look into this and post a > patch > >>>>>>> >> that does the same for the atomics. > >>>>>>> > > >>>>>>> > Atomics are inlined already. Also atomic operations should not be > >>>>>>> required if there's a single refence. > >>>>>>> > >>>>>>> The ref_count still needs to be initialized on alloc and > decremented > >>>>>>> on free to see if the packet should really be freed. The only way > to > >>>>>>> know whether you're dealing with a single reference or not is to > >>>>>>> check. Do you have a breakdown of where the hotspots are? These > >>>>>>> numbers do seem high. > >>>>>>> > >>>>>> > >>>>>> No, I didn't look into it any deeper. > >>>>> > >>>>> If we look at the comparative pathlength for packet_alloc(), the only > >>>>> changes are to the internal routines packet_init() and > >>>>> init_segments(). The rest of the code is identical. > >>>>> > >>>>> For packet_init() the delta is these added two lines: > >>>>> > >>>>> /* By default packet has no references */ > >>>>> pkt_hdr->unshared_len = len; > >>>>> pkt_hdr->ref_hdr = NULL; > >>>>> > >>>>> For init_segments() the delta is similarly tiny: > >>>>> > >>>>> packet_ref_count_set(hdr, 1); > >>>>> > >>>>> Where this is an inline function that expands to > >>>>> odp_atomic_init_u32(&hdr->ref_count, 1); > >>>>> > >>>>> The changes on the packet_free() side are similarly tiny. Basically > we > >>>>> decrement the ref_count and only actually free the buffer if the > >>>>> ref_count is now zero. > >>>>> > >>>>> I don't see how these translate into a near doubling of cycles for > >>>>> these tests. It seems there must be something else going on. > >>>> > >>>> I think I see some potential larger deltas in the packet free path. > >>>> I'll dig into this further and post a patch . > >>>> > >>>>> > >>>>>> > >>>>>> -Petri >
> -----Original Message----- > From: Bill Fischofer [mailto:bill.fischofer@linaro.org] > Sent: Sunday, January 15, 2017 5:09 PM > To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell- > labs.com> > Cc: lng-odp@lists.linaro.org > Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: > implement reference apis > > I've been playing around with odp_bench_packet and there are two > problems with using this test: > > 1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT > times, which is good, however it only reports the average cycle time > for these runs, There is considerable variability when running in a > non-isolated environment, so what we really want to report is not just > the average times, but also the minimum and maximum times encountered > over those runs. From a microbenchmarking standpoint, the minimum > times are what are of most interest since those are the "pure" > measures of the functions being tested, with the least amount of > interference from other sources. Statistically speaking, with a large > TEST_SIZE_RUN_COUNT the minimums should get close to the actual > isolated time for each of the tests without requiring the system be > fully isolated. Performance should be tested in isolation, at least you need enough CPUs so that OS does not utilize the CPU we are testing for anything else. Also minimum is not a good measure, since it would report only the (potentially rare) case when all cache and TLB accesses hit. Cache hit rate affects performance a lot. E.g. a L1 hit rate of 98% may give you tens of percent worse performance than the perfect 100% hit rate. So, if we print out only one figure, an average over many runs in isolation is the best we can get (easily). > > 2. Within each test, the function being tested is performed > TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason > this is bad is that longer run times give more opportunity for "cut > aways" and other interruptions that distort the measurement trying to > be made. The repeat count could be even higher but it would require more packet in the pool. We used 1000 as a compromise. This needs to be large enough to hide one time initialization and CPU cycle measurement overheads, which may add cycles and measurement variation for these simple test cases. E.g. if it takes 100 cycles to read a CPU cycle counter. A 10 cycle operation must be repeated many times to hide it: 100 cycles / (1000*10 cycles) => 1% measurement overhead per operation. -Petri
On Mon, Jan 16, 2017 at 2:45 AM, Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell-labs.com> wrote: > > >> -----Original Message----- >> From: Bill Fischofer [mailto:bill.fischofer@linaro.org] >> Sent: Sunday, January 15, 2017 5:09 PM >> To: Savolainen, Petri (Nokia - FI/Espoo) <petri.savolainen@nokia-bell- >> labs.com> >> Cc: lng-odp@lists.linaro.org >> Subject: Re: [lng-odp] [API-NEXT PATCHv6 2/5] linux-generic: packet: >> implement reference apis >> >> I've been playing around with odp_bench_packet and there are two >> problems with using this test: >> >> 1. The test performs multiple runs of each test TEST_SIZE_RUN_COUNT >> times, which is good, however it only reports the average cycle time >> for these runs, There is considerable variability when running in a >> non-isolated environment, so what we really want to report is not just >> the average times, but also the minimum and maximum times encountered >> over those runs. From a microbenchmarking standpoint, the minimum >> times are what are of most interest since those are the "pure" >> measures of the functions being tested, with the least amount of >> interference from other sources. Statistically speaking, with a large >> TEST_SIZE_RUN_COUNT the minimums should get close to the actual >> isolated time for each of the tests without requiring the system be >> fully isolated. > > Performance should be tested in isolation, at least you need enough CPUs so that OS does not utilize the CPU we are testing for anything else. Also minimum is not a good measure, since it would report only the (potentially rare) case when all cache and TLB accesses hit. Cache hit rate affects performance a lot. E.g. a L1 hit rate of 98% may give you tens of percent worse performance than the perfect 100% hit rate. So, if we print out only one figure, an average over many runs in isolation is the best we can get (easily). Ideally yes, however that's not always possible or even necessary for basic tuning. Moreover, this is a microbenchmark, which means it's really designed to identify small deltas in pathlengths in individual API implementations rather than to assess overall performance at an application level. We're not printing only one figure, however just printing an average is inadequate. Recording and reporting the min, max, and avg values gives a fuller picture of what's going on. The mins show ideal performance and can be used to measure small pathlength differences while the max and avg give an insight as to how "isolated" the system really is. You may think you're running in an isolated environment, but without these additional data you really don't know that. On a true isolated system the max values should be consistent and not orders of magnitude larger than the mins. Only then can the averages be compared fairly. > > >> >> 2. Within each test, the function being tested is performed >> TEST_REPEAT_COUNT times, which is set at 1000. This is bad. The reason >> this is bad is that longer run times give more opportunity for "cut >> aways" and other interruptions that distort the measurement trying to >> be made. > > > The repeat count could be even higher but it would require more packet in the pool. We used 1000 as a compromise. This needs to be large enough to hide one time initialization and CPU cycle measurement overheads, which may add cycles and measurement variation for these simple test cases. E.g. if it takes 100 cycles to read a CPU cycle counter. A 10 cycle operation must be repeated many times to hide it: 100 cycles / (1000*10 cycles) => 1% measurement overhead per operation. That's what the bench_empty test does--it shows the fixed overhead imposed by the benchmark framework on each test. So those values can be subtracted from the individual tests to give a closer approximation to the relative impact of proposed changes. The goal here isn't exact cycle counts but rather a good and consistent approximation that can be used to assess the impact of proposed changes, which is what we're doing here. The problem with repeats is that the longer the test runs the more likely that the isolation assumption becomes invalid. When that happens, as reported here, the variability obscures anything else you're trying to measure. Ideally I'd like to see these be command line arguments to the test rather than #defines so that different measures can be made in different environments. > > -Petri > > >
diff --git a/platform/linux-generic/include/odp_packet_internal.h b/platform/linux-generic/include/odp_packet_internal.h index d09231e..ccb59b9 100644 --- a/platform/linux-generic/include/odp_packet_internal.h +++ b/platform/linux-generic/include/odp_packet_internal.h @@ -167,7 +167,7 @@ typedef struct { * packet_init(). Because of this any new fields added must be reviewed for * initialization requirements. */ -typedef struct { +typedef struct odp_packet_hdr_t { /* common buffer header */ odp_buffer_hdr_t buf_hdr; @@ -178,6 +178,13 @@ typedef struct { uint32_t headroom; uint32_t tailroom; + /* Fields used to support packet references */ + odp_atomic_u32_t ref_count; /* Number of refs to this pkt/seg */ + uint32_t unshared_len; /* Offset that sharing starts at */ + uint32_t ref_offset; /* Offset into base pkt for this ref */ + uint32_t ref_len; /* frame_len at time this ref created */ + struct odp_packet_hdr_t *ref_hdr; /* Ptr to the base pkt for this ref */ + odp_pktio_t input; /* Members below are not initialized by packet_init() */ @@ -200,6 +207,55 @@ static inline odp_packet_hdr_t *odp_packet_hdr(odp_packet_t pkt) return (odp_packet_hdr_t *)buf_hdl_to_hdr((odp_buffer_t)pkt); } +static inline odp_packet_hdr_t *odp_packet_last_hdr(odp_packet_t pkt, + uint32_t *offset) +{ + odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); + odp_packet_hdr_t *prev_hdr = pkt_hdr; + uint32_t ref_offset = 0; + + while (pkt_hdr->ref_hdr) { + ref_offset = pkt_hdr->ref_offset; + prev_hdr = pkt_hdr; + pkt_hdr = pkt_hdr->ref_hdr; + } + + if (offset) { + if (prev_hdr != pkt_hdr) + ref_offset += pkt_hdr->frame_len - prev_hdr->ref_len; + *offset = ref_offset; + } + + return pkt_hdr; +} + +static inline odp_packet_hdr_t *odp_packet_prev_hdr(odp_packet_hdr_t *pkt_hdr, + odp_packet_hdr_t *cur_hdr, + uint32_t *offset) +{ + uint32_t ref_offset = 0; + odp_packet_hdr_t *prev_hdr = pkt_hdr; + + while (pkt_hdr->ref_hdr != cur_hdr) { + ref_offset = pkt_hdr->ref_offset; + prev_hdr = pkt_hdr; + pkt_hdr = pkt_hdr->ref_hdr; + } + + if (offset) { + if (prev_hdr != pkt_hdr) + ref_offset += pkt_hdr->frame_len - prev_hdr->ref_len; + *offset = ref_offset; + } + + return pkt_hdr; +} + +static inline odp_packet_t _odp_packet_hdl(odp_packet_hdr_t *pkt_hdr) +{ + return (odp_packet_t)odp_hdr_to_buf(&pkt_hdr->buf_hdr); +} + static inline void copy_packet_parser_metadata(odp_packet_hdr_t *src_hdr, odp_packet_hdr_t *dst_hdr) { @@ -222,12 +278,25 @@ static inline void pull_tail(odp_packet_hdr_t *pkt_hdr, uint32_t len) pkt_hdr->tailroom += len; pkt_hdr->frame_len -= len; + pkt_hdr->unshared_len -= len; pkt_hdr->buf_hdr.seg[last].len -= len; } static inline uint32_t packet_len(odp_packet_hdr_t *pkt_hdr) { - return pkt_hdr->frame_len; + uint32_t pkt_len = 0; + uint32_t offset = 0; + + do { + pkt_len += pkt_hdr->frame_len - offset; + offset = pkt_hdr->ref_offset; + if (pkt_hdr->ref_hdr) + offset += (pkt_hdr->ref_hdr->frame_len - + pkt_hdr->ref_len); + pkt_hdr = pkt_hdr->ref_hdr; + } while (pkt_hdr); + + return pkt_len; } static inline void packet_set_len(odp_packet_hdr_t *pkt_hdr, uint32_t len) diff --git a/platform/linux-generic/odp_packet.c b/platform/linux-generic/odp_packet.c index 58b6f32..5de3b8e 100644 --- a/platform/linux-generic/odp_packet.c +++ b/platform/linux-generic/odp_packet.c @@ -30,13 +30,34 @@ static inline odp_buffer_t buffer_handle(odp_packet_hdr_t *pkt_hdr) return pkt_hdr->buf_hdr.handle.handle; } +static inline uint32_t packet_ref_count(odp_packet_hdr_t *pkt_hdr) +{ + return odp_atomic_load_u32(&pkt_hdr->ref_count); +} + +static inline void packet_ref_count_set(odp_packet_hdr_t *pkt_hdr, uint32_t n) +{ + odp_atomic_init_u32(&pkt_hdr->ref_count, n); +} + +static inline uint32_t packet_ref_inc(odp_packet_hdr_t *pkt_hdr) +{ + return odp_atomic_fetch_inc_u32(&pkt_hdr->ref_count); +} + +static inline uint32_t packet_ref_dec(odp_packet_hdr_t *pkt_hdr) +{ + return odp_atomic_fetch_dec_u32(&pkt_hdr->ref_count); +} + static inline uint32_t packet_seg_len(odp_packet_hdr_t *pkt_hdr, uint32_t seg_idx) { return pkt_hdr->buf_hdr.seg[seg_idx].len; } -static inline void *packet_seg_data(odp_packet_hdr_t *pkt_hdr, uint32_t seg_idx) +static inline uint8_t *packet_seg_data(odp_packet_hdr_t *pkt_hdr, + uint32_t seg_idx) { return pkt_hdr->buf_hdr.seg[seg_idx].data; } @@ -49,6 +70,11 @@ static inline int packet_last_seg(odp_packet_hdr_t *pkt_hdr) return pkt_hdr->buf_hdr.segcount - 1; } +static inline void *packet_data(odp_packet_hdr_t *pkt_hdr) +{ + return pkt_hdr->buf_hdr.seg[0].data; +} + static inline uint32_t packet_first_seg_len(odp_packet_hdr_t *pkt_hdr) { return packet_seg_len(pkt_hdr, 0); @@ -61,11 +87,6 @@ static inline uint32_t packet_last_seg_len(odp_packet_hdr_t *pkt_hdr) return packet_seg_len(pkt_hdr, last); } -static inline void *packet_data(odp_packet_hdr_t *pkt_hdr) -{ - return pkt_hdr->buf_hdr.seg[0].data; -} - static inline void *packet_tail(odp_packet_hdr_t *pkt_hdr) { int last = packet_last_seg(pkt_hdr); @@ -96,6 +117,7 @@ static inline void push_head(odp_packet_hdr_t *pkt_hdr, uint32_t len) { pkt_hdr->headroom -= len; pkt_hdr->frame_len += len; + pkt_hdr->unshared_len += len; pkt_hdr->buf_hdr.seg[0].data -= len; pkt_hdr->buf_hdr.seg[0].len += len; } @@ -104,6 +126,7 @@ static inline void pull_head(odp_packet_hdr_t *pkt_hdr, uint32_t len) { pkt_hdr->headroom += len; pkt_hdr->frame_len -= len; + pkt_hdr->unshared_len -= len; pkt_hdr->buf_hdr.seg[0].data += len; pkt_hdr->buf_hdr.seg[0].len -= len; } @@ -114,6 +137,7 @@ static inline void push_tail(odp_packet_hdr_t *pkt_hdr, uint32_t len) pkt_hdr->tailroom -= len; pkt_hdr->frame_len += len; + pkt_hdr->unshared_len += len; pkt_hdr->buf_hdr.seg[last].len += len; } @@ -141,6 +165,10 @@ static inline void packet_seg_copy_md(odp_packet_hdr_t *dst, dst->buf_hdr.uarea_addr = src->buf_hdr.uarea_addr; dst->buf_hdr.uarea_size = src->buf_hdr.uarea_size; + /* reference related metadata */ + dst->ref_len = src->ref_len; + dst->unshared_len = src->unshared_len; + /* segmentation data is not copied: * buf_hdr.seg[] * buf_hdr.segcount @@ -155,7 +183,15 @@ static inline void *packet_map(odp_packet_hdr_t *pkt_hdr, int seg = 0; int seg_count = pkt_hdr->buf_hdr.segcount; - if (odp_unlikely(offset >= pkt_hdr->frame_len)) + /* Special processing for references */ + while (offset >= pkt_hdr->frame_len && pkt_hdr->ref_hdr) { + offset -= (pkt_hdr->frame_len - pkt_hdr->ref_offset); + offset += (pkt_hdr->ref_hdr->frame_len - pkt_hdr->ref_len); + pkt_hdr = pkt_hdr->ref_hdr; + seg_count = pkt_hdr->buf_hdr.segcount; + } + + if (odp_unlikely(offset > pkt_hdr->frame_len)) return NULL; if (odp_likely(CONFIG_PACKET_MAX_SEGS == 1 || seg_count == 1)) { @@ -249,6 +285,10 @@ static inline void packet_init(odp_packet_hdr_t *pkt_hdr, uint32_t len, CONFIG_PACKET_TAILROOM; pkt_hdr->input = ODP_PKTIO_INVALID; + + /* By default packet has no references */ + pkt_hdr->ref_hdr = NULL; + pkt_hdr->unshared_len = len; } static inline odp_packet_hdr_t *init_segments(odp_buffer_t buf[], int num) @@ -261,6 +301,7 @@ static inline odp_packet_hdr_t *init_segments(odp_buffer_t buf[], int num) pkt_hdr->buf_hdr.seg[0].data = pkt_hdr->buf_hdr.base_data; pkt_hdr->buf_hdr.seg[0].len = pkt_hdr->buf_hdr.base_len; + packet_ref_count_set(pkt_hdr, 1); /* Link segments */ if (odp_unlikely(CONFIG_PACKET_MAX_SEGS != 1)) { @@ -272,6 +313,7 @@ static inline odp_packet_hdr_t *init_segments(odp_buffer_t buf[], int num) odp_buffer_hdr_t *b_hdr; hdr = odp_packet_hdr((odp_packet_t)buf[i]); + packet_ref_count_set(hdr, 1); b_hdr = &hdr->buf_hdr; pkt_hdr->buf_hdr.seg[i].hdr = hdr; @@ -375,9 +417,10 @@ static inline odp_packet_hdr_t *add_segments(odp_packet_hdr_t *pkt_hdr, new_hdr->buf_hdr.seg[0].len = seg_len; packet_seg_copy_md(new_hdr, pkt_hdr); - new_hdr->frame_len = pkt_hdr->frame_len + len; - new_hdr->headroom = pool->headroom + offset; - new_hdr->tailroom = pkt_hdr->tailroom; + new_hdr->frame_len = pkt_hdr->frame_len + len; + new_hdr->unshared_len = pkt_hdr->unshared_len + len; + new_hdr->headroom = pool->headroom + offset; + new_hdr->tailroom = pkt_hdr->tailroom; pkt_hdr = new_hdr; } else { @@ -390,8 +433,9 @@ static inline odp_packet_hdr_t *add_segments(odp_packet_hdr_t *pkt_hdr, last = packet_last_seg(pkt_hdr); pkt_hdr->buf_hdr.seg[last].len = seg_len; - pkt_hdr->frame_len += len; - pkt_hdr->tailroom = pool->tailroom + offset; + pkt_hdr->frame_len += len; + pkt_hdr->unshared_len += len; + pkt_hdr->tailroom = pool->tailroom + offset; } return pkt_hdr; @@ -399,13 +443,18 @@ static inline odp_packet_hdr_t *add_segments(odp_packet_hdr_t *pkt_hdr, static inline void free_bufs(odp_packet_hdr_t *pkt_hdr, int first, int num) { - int i; + int i, nfree; odp_buffer_t buf[num]; - for (i = 0; i < num; i++) - buf[i] = buffer_handle(pkt_hdr->buf_hdr.seg[first + i].hdr); + for (i = 0, nfree = 0; i < num; i++) { + odp_packet_hdr_t *hdr = pkt_hdr->buf_hdr.seg[first + i].hdr; - buffer_free_multi(buf, num); + if (packet_ref_dec(hdr) == 1) + buf[nfree++] = buffer_handle(hdr); + } + + if (nfree > 0) + buffer_free_multi(buf, nfree); } static inline odp_packet_hdr_t *free_segments(odp_packet_hdr_t *pkt_hdr, @@ -416,11 +465,15 @@ static inline odp_packet_hdr_t *free_segments(odp_packet_hdr_t *pkt_hdr, if (head) { odp_packet_hdr_t *new_hdr; - int i; + int i, nfree; odp_buffer_t buf[num]; - for (i = 0; i < num; i++) - buf[i] = buffer_handle(pkt_hdr->buf_hdr.seg[i].hdr); + for (i = 0, nfree = 0; i < num; i++) { + new_hdr = pkt_hdr->buf_hdr.seg[i].hdr; + + if (packet_ref_dec(new_hdr) == 1) + buf[nfree++] = buffer_handle(new_hdr); + } /* First remaining segment is the new packet descriptor */ new_hdr = pkt_hdr->buf_hdr.seg[num].hdr; @@ -429,15 +482,17 @@ static inline odp_packet_hdr_t *free_segments(odp_packet_hdr_t *pkt_hdr, packet_seg_copy_md(new_hdr, pkt_hdr); /* Tailroom not changed */ - new_hdr->tailroom = pkt_hdr->tailroom; - new_hdr->headroom = seg_headroom(new_hdr, 0); - new_hdr->frame_len = pkt_hdr->frame_len - free_len; + new_hdr->tailroom = pkt_hdr->tailroom; + new_hdr->headroom = seg_headroom(new_hdr, 0); + new_hdr->frame_len = pkt_hdr->frame_len - free_len; + new_hdr->unshared_len = pkt_hdr->unshared_len - free_len; pull_head(new_hdr, pull_len); pkt_hdr = new_hdr; - buffer_free_multi(buf, num); + if (nfree > 0) + buffer_free_multi(buf, nfree); } else { /* Free last 'num' bufs */ free_bufs(pkt_hdr, num_remain, num); @@ -446,6 +501,7 @@ static inline odp_packet_hdr_t *free_segments(odp_packet_hdr_t *pkt_hdr, * of the metadata. */ pkt_hdr->buf_hdr.segcount = num_remain; pkt_hdr->frame_len -= free_len; + pkt_hdr->unshared_len -= free_len; pkt_hdr->tailroom = seg_tailroom(pkt_hdr, num_remain - 1); pull_tail(pkt_hdr, pull_len); @@ -546,45 +602,34 @@ int odp_packet_alloc_multi(odp_pool_t pool_hdl, uint32_t len, return num; } -void odp_packet_free(odp_packet_t pkt) +static inline void packet_free(odp_packet_hdr_t *pkt_hdr) { - odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); - int num_seg = pkt_hdr->buf_hdr.segcount; + odp_packet_hdr_t *ref_hdr; + uint32_t ref_count; - if (odp_likely(CONFIG_PACKET_MAX_SEGS == 1 || num_seg == 1)) - buffer_free_multi((odp_buffer_t *)&pkt, 1); - else - free_bufs(pkt_hdr, 0, num_seg); -} + do { + ref_hdr = pkt_hdr->ref_hdr; + ref_count = packet_ref_count(pkt_hdr) - 1; + free_bufs(pkt_hdr, 0, pkt_hdr->buf_hdr.segcount); -void odp_packet_free_multi(const odp_packet_t pkt[], int num) -{ - if (CONFIG_PACKET_MAX_SEGS == 1) { - buffer_free_multi((const odp_buffer_t * const)pkt, num); - } else { - odp_buffer_t buf[num * CONFIG_PACKET_MAX_SEGS]; - int i, j; - int bufs = 0; - - for (i = 0; i < num; i++) { - odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt[i]); - int num_seg = pkt_hdr->buf_hdr.segcount; - odp_buffer_hdr_t *buf_hdr = &pkt_hdr->buf_hdr; + if (ref_count == 1) + pkt_hdr->unshared_len = pkt_hdr->frame_len; - buf[bufs] = (odp_buffer_t)pkt[i]; - bufs++; + pkt_hdr = ref_hdr; + } while (pkt_hdr); +} - if (odp_likely(num_seg == 1)) - continue; +void odp_packet_free(odp_packet_t pkt) +{ + packet_free(odp_packet_hdr(pkt)); +} - for (j = 1; j < num_seg; j++) { - buf[bufs] = buffer_handle(buf_hdr->seg[j].hdr); - bufs++; - } - } +void odp_packet_free_multi(const odp_packet_t pkt[], int num) +{ + int i; - buffer_free_multi(buf, bufs); - } + for (i = 0; i < num; i++) + packet_free(odp_packet_hdr(pkt[i])); } int odp_packet_reset(odp_packet_t pkt, uint32_t len) @@ -595,6 +640,9 @@ int odp_packet_reset(odp_packet_t pkt, uint32_t len) if (len > pool->headroom + pool->data_size + pool->tailroom) return -1; + if (pkt_hdr->ref_hdr) + packet_free(pkt_hdr->ref_hdr); + packet_init(pkt_hdr, len, 0); return 0; @@ -637,15 +685,21 @@ void *odp_packet_head(odp_packet_t pkt) uint32_t odp_packet_buf_len(odp_packet_t pkt) { odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); + uint32_t buf_len = 0; - return pkt_hdr->buf_hdr.size * pkt_hdr->buf_hdr.segcount; + do { + buf_len += pkt_hdr->buf_hdr.size * pkt_hdr->buf_hdr.segcount; + pkt_hdr = pkt_hdr->ref_hdr; + } while (pkt_hdr); + + return buf_len; } void *odp_packet_data(odp_packet_t pkt) { odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); - return packet_data(pkt_hdr); + return packet_map(pkt_hdr, 0, NULL, NULL); } uint32_t odp_packet_seg_len(odp_packet_t pkt) @@ -657,7 +711,32 @@ uint32_t odp_packet_seg_len(odp_packet_t pkt) uint32_t odp_packet_len(odp_packet_t pkt) { - return odp_packet_hdr(pkt)->frame_len; + return packet_len(odp_packet_hdr(pkt)); +} + +uint32_t odp_packet_unshared_len(odp_packet_t pkt) +{ + odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); + uint32_t pkt_len = 0, offset = 0; + + do { + if (packet_ref_count(pkt_hdr) > 1) { + if (offset == 0) + pkt_len += pkt_hdr->unshared_len; + break; + } + + pkt_len += pkt_hdr->frame_len - offset; + offset = pkt_hdr->ref_offset; + + if (pkt_hdr->ref_hdr) + offset += (pkt_hdr->ref_hdr->frame_len - + pkt_hdr->ref_len); + + pkt_hdr = pkt_hdr->ref_hdr; + } while (pkt_hdr); + + return pkt_len; } uint32_t odp_packet_headroom(odp_packet_t pkt) @@ -667,12 +746,12 @@ uint32_t odp_packet_headroom(odp_packet_t pkt) uint32_t odp_packet_tailroom(odp_packet_t pkt) { - return odp_packet_hdr(pkt)->tailroom; + return odp_packet_last_hdr(pkt, NULL)->tailroom; } void *odp_packet_tail(odp_packet_t pkt) { - odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); + odp_packet_hdr_t *pkt_hdr = odp_packet_last_hdr(pkt, NULL); return packet_tail(pkt_hdr); } @@ -871,7 +950,7 @@ int odp_packet_extend_head(odp_packet_t *pkt, uint32_t len, { odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(*pkt); uint32_t frame_len = pkt_hdr->frame_len; - uint32_t headroom = pkt_hdr->headroom; + uint32_t headroom = pkt_hdr->headroom; int ret = 0; if (len > headroom) { @@ -886,6 +965,46 @@ int odp_packet_extend_head(odp_packet_t *pkt, uint32_t len, segs = pkt_hdr->buf_hdr.segcount; if (odp_unlikely((segs + num) > CONFIG_PACKET_MAX_SEGS)) { + /* Handle recursively via references when + * working with referenced packets since another + * thread may be accessing it concurrently via + * its reference to it. */ + if (packet_ref_count(pkt_hdr) > 1) { + odp_packet_t ref; + uint32_t unshared_len; + + push_head(pkt_hdr, headroom); + unshared_len = pkt_hdr->unshared_len; + ref = odp_packet_ref(*pkt, 0); + + if (ref == ODP_PACKET_INVALID) { + pull_head(pkt_hdr, headroom); + return -1; + } + + ret = odp_packet_extend_head(&ref, + len - headroom, + data_ptr, + seg_len); + + if (ret < 0) { + odp_packet_free(ref); + pull_head(pkt_hdr, headroom); + return -1; + } + + /* Since this is a special ref, the + * base pkt's unshared len is unchanged */ + pkt_hdr->unshared_len = unshared_len; + + /* Remove extra ref to the base pkt */ + odp_packet_free(*pkt); + + /* Return the ref as the extension result */ + *pkt = ref; + return 1; + } + /* Cannot directly add new segments */ odp_packet_hdr_t *new_hdr; int new_segs = 0; @@ -938,6 +1057,7 @@ int odp_packet_extend_head(odp_packet_t *pkt, uint32_t len, pkt_hdr->buf_hdr.segcount = segs; pkt_hdr->frame_len = frame_len; + pkt_hdr->unshared_len = frame_len; pkt_hdr->headroom = offset + pool->headroom; pkt_hdr->tailroom = pool->tailroom; @@ -963,11 +1083,16 @@ int odp_packet_extend_head(odp_packet_t *pkt, uint32_t len, push_head(pkt_hdr, len); } - if (data_ptr) - *data_ptr = packet_data(pkt_hdr); + if (data_ptr || seg_len) { + uint32_t seg_ln = 0; + void *data = packet_map(pkt_hdr, 0, &seg_ln, NULL); - if (seg_len) - *seg_len = packet_first_seg_len(pkt_hdr); + if (data_ptr) + *data_ptr = data; + + if (seg_len) + *seg_len = seg_ln; + } return ret; } @@ -979,6 +1104,8 @@ void *odp_packet_pull_head(odp_packet_t pkt, uint32_t len) if (len > pkt_hdr->frame_len) return NULL; + ODP_ASSERT(len <= pkt_hdr->unshared_len); + pull_head(pkt_hdr, len); return packet_data(pkt_hdr); } @@ -986,15 +1113,35 @@ void *odp_packet_pull_head(odp_packet_t pkt, uint32_t len) int odp_packet_trunc_head(odp_packet_t *pkt, uint32_t len, void **data_ptr, uint32_t *seg_len_out) { - odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(*pkt); + odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(*pkt), *nxt_hdr; uint32_t seg_len = packet_first_seg_len(pkt_hdr); + int ret = 0; - if (len > pkt_hdr->frame_len) + if (len > packet_len(pkt_hdr)) return -1; - if (len < seg_len) { + ODP_ASSERT(len <= odp_packet_unshared_len(*pkt)); + + /* Special processing for references */ + while (len >= pkt_hdr->frame_len && pkt_hdr->ref_hdr) { + ODP_ASSERT(packet_ref_count(pkt_hdr) == 1); + nxt_hdr = pkt_hdr->ref_hdr; + len -= pkt_hdr->frame_len; + len += pkt_hdr->ref_offset + + (nxt_hdr->frame_len - pkt_hdr->ref_len); + pkt_hdr->ref_hdr = NULL; + packet_free(pkt_hdr); + pkt_hdr = nxt_hdr; + seg_len = packet_first_seg_len(pkt_hdr); + *pkt = packet_handle(pkt_hdr); + ret = 1; + } + + if (CONFIG_PACKET_MAX_SEGS == 1 || + len < seg_len || + pkt_hdr->buf_hdr.segcount == 1) { pull_head(pkt_hdr, len); - } else if (CONFIG_PACKET_MAX_SEGS != 1) { + } else { int num = 0; uint32_t pull_len = 0; @@ -1009,23 +1156,29 @@ int odp_packet_trunc_head(odp_packet_t *pkt, uint32_t len, *pkt = packet_handle(pkt_hdr); } - if (data_ptr) - *data_ptr = packet_data(pkt_hdr); + if (data_ptr || seg_len_out) { + void *data_head = packet_map(pkt_hdr, 0, &seg_len, NULL); - if (seg_len_out) - *seg_len_out = packet_first_seg_len(pkt_hdr); + if (data_ptr) + *data_ptr = data_head; - return 0; + if (seg_len_out) + *seg_len_out = seg_len; + } + + return ret; } void *odp_packet_push_tail(odp_packet_t pkt, uint32_t len) { - odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); + odp_packet_hdr_t *pkt_hdr = odp_packet_last_hdr(pkt, NULL); void *old_tail; if (len > pkt_hdr->tailroom) return NULL; + ODP_ASSERT(packet_ref_count(pkt_hdr) == 1); + old_tail = packet_tail(pkt_hdr); push_tail(pkt_hdr, len); @@ -1035,12 +1188,14 @@ void *odp_packet_push_tail(odp_packet_t pkt, uint32_t len) int odp_packet_extend_tail(odp_packet_t *pkt, uint32_t len, void **data_ptr, uint32_t *seg_len_out) { - odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(*pkt); + odp_packet_hdr_t *pkt_hdr = odp_packet_last_hdr(*pkt, NULL); uint32_t frame_len = pkt_hdr->frame_len; uint32_t tailroom = pkt_hdr->tailroom; uint32_t tail_off = frame_len; int ret = 0; + ODP_ASSERT(packet_ref_count(pkt_hdr) == 1); + if (len > tailroom) { pool_t *pool = pool_entry_from_hdl(pkt_hdr->buf_hdr.pool_hdl); int num; @@ -1132,6 +1287,7 @@ void *odp_packet_pull_tail(odp_packet_t pkt, uint32_t len) if (len > packet_last_seg_len(pkt_hdr)) return NULL; + ODP_ASSERT(packet_ref_count(pkt_hdr) == 1); pull_tail(pkt_hdr, len); return packet_tail(pkt_hdr); @@ -1142,17 +1298,34 @@ int odp_packet_trunc_tail(odp_packet_t *pkt, uint32_t len, { int last; uint32_t seg_len; - odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(*pkt); + uint32_t offset; + odp_packet_hdr_t *first_hdr = odp_packet_hdr(*pkt); + odp_packet_hdr_t *pkt_hdr, *prev_hdr; - if (len > pkt_hdr->frame_len) + if (len > packet_len(first_hdr)) return -1; + pkt_hdr = odp_packet_last_hdr(*pkt, &offset); + + /* Special processing for references */ + while (len >= pkt_hdr->frame_len - offset && first_hdr->ref_hdr) { + len -= (pkt_hdr->frame_len - offset); + prev_hdr = odp_packet_prev_hdr(first_hdr, pkt_hdr, &offset); + ODP_ASSERT(packet_ref_count(prev_hdr) == 1); + prev_hdr->ref_hdr = NULL; + packet_free(pkt_hdr); + pkt_hdr = prev_hdr; + } + + ODP_ASSERT(packet_ref_count(pkt_hdr) == 1); last = packet_last_seg(pkt_hdr); seg_len = packet_seg_len(pkt_hdr, last); - if (len < seg_len) { + if (CONFIG_PACKET_MAX_SEGS == 1 || + len < seg_len || + pkt_hdr->buf_hdr.segcount == 1) { pull_tail(pkt_hdr, len); - } else if (CONFIG_PACKET_MAX_SEGS != 1) { + } else { int num = 0; uint32_t pull_len = 0; @@ -1359,35 +1532,50 @@ void odp_packet_ts_set(odp_packet_t pkt, odp_time_t timestamp) int odp_packet_is_segmented(odp_packet_t pkt) { - return odp_packet_hdr(pkt)->buf_hdr.segcount > 1; + odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); + + return pkt_hdr->buf_hdr.segcount > 1 || pkt_hdr->ref_hdr != NULL; } int odp_packet_num_segs(odp_packet_t pkt) { odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); + uint32_t segcount = 0, i; + uint32_t seg_offset = 0, offset; + + do { + segcount += pkt_hdr->buf_hdr.segcount - seg_offset; + offset = pkt_hdr->ref_offset; + pkt_hdr = pkt_hdr->ref_hdr; + if (pkt_hdr) { + for (i = 0, seg_offset = 0; + i < pkt_hdr->buf_hdr.segcount; + i++, seg_offset++) { + if (offset < pkt_hdr->buf_hdr.seg[i].len) + break; + offset -= pkt_hdr->buf_hdr.seg[i].len; + } + } + } while (pkt_hdr); - return pkt_hdr->buf_hdr.segcount; + return segcount; } -odp_packet_seg_t odp_packet_first_seg(odp_packet_t pkt) +odp_packet_seg_t odp_packet_first_seg(odp_packet_t pkt ODP_UNUSED) { - (void)pkt; - return 0; } odp_packet_seg_t odp_packet_last_seg(odp_packet_t pkt) { - odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); - - return packet_last_seg(pkt_hdr); + return (odp_packet_seg_t)(odp_packet_num_segs(pkt) - 1); } odp_packet_seg_t odp_packet_next_seg(odp_packet_t pkt, odp_packet_seg_t seg) { odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); - if (odp_unlikely(seg >= (odp_packet_seg_t)packet_last_seg(pkt_hdr))) + if (odp_unlikely(seg >= packet_last_seg(pkt_hdr))) return ODP_PACKET_SEG_INVALID; return seg + 1; @@ -1403,21 +1591,51 @@ odp_packet_seg_t odp_packet_next_seg(odp_packet_t pkt, odp_packet_seg_t seg) void *odp_packet_seg_data(odp_packet_t pkt, odp_packet_seg_t seg) { odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); + uint32_t seg_offset = 0, offset = 0, i; + + while (seg >= pkt_hdr->buf_hdr.segcount - seg_offset && + pkt_hdr->ref_hdr) { + seg -= (pkt_hdr->buf_hdr.segcount - seg_offset); + offset = pkt_hdr->ref_offset; + pkt_hdr = pkt_hdr->ref_hdr; + for (i = 0, seg_offset = 0; + i < pkt_hdr->buf_hdr.segcount; + i++, seg_offset++) { + if (offset < pkt_hdr->buf_hdr.seg[i].len) + break; + offset -= pkt_hdr->buf_hdr.seg[i].len; + } + } - if (odp_unlikely(seg >= pkt_hdr->buf_hdr.segcount)) + if (odp_unlikely(seg + seg_offset >= pkt_hdr->buf_hdr.segcount)) return NULL; - return packet_seg_data(pkt_hdr, seg); + return packet_seg_data(pkt_hdr, seg + seg_offset) + offset; } uint32_t odp_packet_seg_data_len(odp_packet_t pkt, odp_packet_seg_t seg) { odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); + uint32_t seg_offset = 0, offset = 0, i; + + while (seg >= pkt_hdr->buf_hdr.segcount - seg_offset && + pkt_hdr->ref_hdr) { + seg -= (pkt_hdr->buf_hdr.segcount - seg_offset); + offset = pkt_hdr->ref_offset; + pkt_hdr = pkt_hdr->ref_hdr; + for (i = 0, seg_offset = 0; + i < pkt_hdr->buf_hdr.segcount; + i++, seg_offset++) { + if (offset < pkt_hdr->buf_hdr.seg[i].len) + break; + offset -= pkt_hdr->buf_hdr.seg[i].len; + } + } - if (odp_unlikely(seg >= pkt_hdr->buf_hdr.segcount)) + if (odp_unlikely(seg + seg_offset >= pkt_hdr->buf_hdr.segcount)) return 0; - return packet_seg_len(pkt_hdr, seg); + return packet_seg_len(pkt_hdr, seg + seg_offset) - offset; } /* @@ -1431,12 +1649,14 @@ int odp_packet_add_data(odp_packet_t *pkt_ptr, uint32_t offset, uint32_t len) { odp_packet_t pkt = *pkt_ptr; odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); - uint32_t pktlen = pkt_hdr->frame_len; + uint32_t pktlen = packet_len(pkt_hdr); odp_packet_t newpkt; if (offset > pktlen) return -1; + ODP_ASSERT(odp_packet_unshared_len(*pkt_ptr) >= offset); + newpkt = odp_packet_alloc(pkt_hdr->buf_hdr.pool_hdl, pktlen + len); if (newpkt == ODP_PACKET_INVALID) @@ -1499,6 +1719,8 @@ int odp_packet_align(odp_packet_t *pkt, uint32_t offset, uint32_t len, if (align > ODP_CACHE_LINE_SIZE) return -1; + ODP_ASSERT(odp_packet_is_ref(*pkt) == 0); + if (seglen >= len) { misalign = align <= 1 ? 0 : ODP_ALIGN_ROUNDUP(uaddr, align) - uaddr; @@ -1538,10 +1760,13 @@ int odp_packet_concat(odp_packet_t *dst, odp_packet_t src) uint32_t dst_len = dst_hdr->frame_len; uint32_t src_len = src_hdr->frame_len; + ODP_ASSERT(packet_ref_count(dst_hdr) == 1); + /* Do a copy if resulting packet would be out of segments or packets - * are from different pools. */ + * are from different pools or src is a reference. */ if (odp_unlikely((dst_segs + src_segs) > CONFIG_PACKET_MAX_SEGS) || - odp_unlikely(dst_pool != src_pool)) { + odp_unlikely(dst_pool != src_pool) || + odp_unlikely(packet_ref_count(src_hdr)) > 1) { if (odp_packet_extend_tail(dst, src_len, NULL, NULL) >= 0) { (void)odp_packet_copy_from_pkt(*dst, dst_len, src, 0, src_len); @@ -1556,8 +1781,9 @@ int odp_packet_concat(odp_packet_t *dst, odp_packet_t src) add_all_segs(dst_hdr, src_hdr); - dst_hdr->frame_len = dst_len + src_len; - dst_hdr->tailroom = src_hdr->tailroom; + dst_hdr->frame_len = dst_len + src_len; + dst_hdr->unshared_len = dst_len + src_len; + dst_hdr->tailroom = src_hdr->tailroom; /* Data was not moved in memory */ return 0; @@ -1570,6 +1796,7 @@ int odp_packet_split(odp_packet_t *pkt, uint32_t len, odp_packet_t *tail) if (len >= pktlen || tail == NULL) return -1; + ODP_ASSERT(odp_packet_unshared_len(*pkt) >= len); *tail = odp_packet_copy_part(*pkt, len, pktlen - len, odp_packet_pool(*pkt)); @@ -1580,6 +1807,108 @@ int odp_packet_split(odp_packet_t *pkt, uint32_t len, odp_packet_t *tail) } /* + * References + */ + +static inline void packet_ref(odp_packet_hdr_t *pkt_hdr) +{ + uint32_t i; + odp_packet_hdr_t *hdr; + + do { + for (i = 0; i < pkt_hdr->buf_hdr.segcount; i++) { + hdr = pkt_hdr->buf_hdr.seg[i].hdr; + packet_ref_inc(hdr); + } + + pkt_hdr = pkt_hdr->ref_hdr; + } while (pkt_hdr); +} + +static inline odp_packet_t packet_splice(odp_packet_hdr_t *pkt_hdr, + uint32_t offset, + odp_packet_hdr_t *ref_hdr) +{ + /* Catch attempted references to stale handles in debug builds */ + ODP_ASSERT(packet_ref_count(pkt_hdr) > 0); + + /* Splicing is from the last section of src pkt */ + while (ref_hdr->ref_hdr) + ref_hdr = ref_hdr->ref_hdr; + + /* Find section where splice begins */ + while (offset >= pkt_hdr->frame_len && pkt_hdr->ref_hdr) { + offset -= (pkt_hdr->frame_len - pkt_hdr->ref_offset); + offset += (pkt_hdr->ref_hdr->frame_len - pkt_hdr->ref_len); + pkt_hdr = pkt_hdr->ref_hdr; + } + + ref_hdr->ref_hdr = pkt_hdr; + ref_hdr->ref_offset = offset; + ref_hdr->ref_len = pkt_hdr->frame_len; + + if (offset < pkt_hdr->unshared_len) + pkt_hdr->unshared_len = offset; + + packet_ref(pkt_hdr); + return _odp_packet_hdl(ref_hdr); +} + +odp_packet_t odp_packet_ref_static(odp_packet_t pkt) +{ + odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); + + pkt_hdr->unshared_len = 0; + packet_ref(pkt_hdr); + return pkt; +} + +odp_packet_t odp_packet_ref(odp_packet_t pkt, uint32_t offset) +{ + odp_packet_t hdr; + odp_packet_hdr_t *pkt_hdr; + + if (pkt == ODP_PACKET_INVALID) + return ODP_PACKET_INVALID; + + pkt_hdr = odp_packet_hdr(pkt); + if (offset >= packet_len(pkt_hdr)) + return ODP_PACKET_INVALID; + + hdr = odp_packet_alloc(odp_packet_pool(pkt), 0); + + if (hdr == ODP_PACKET_INVALID) + return ODP_PACKET_INVALID; + + return packet_splice(pkt_hdr, offset, odp_packet_hdr(hdr)); +} + +odp_packet_t odp_packet_ref_pkt(odp_packet_t pkt, uint32_t offset, + odp_packet_t hdr) +{ + odp_packet_hdr_t *pkt_hdr; + + if (pkt == ODP_PACKET_INVALID || pkt == hdr) + return ODP_PACKET_INVALID; + + pkt_hdr = odp_packet_hdr(pkt); + if (offset >= packet_len(pkt_hdr)) + return ODP_PACKET_INVALID; + + if (hdr == ODP_PACKET_INVALID) + return odp_packet_ref(pkt, offset); + + return packet_splice(pkt_hdr, offset, odp_packet_hdr(hdr)); +} + +int odp_packet_is_ref(odp_packet_t pkt) +{ + odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); + + return pkt_hdr->ref_hdr || packet_ref_count(pkt_hdr) > 1; +} + +/* * * Copy * ******************************************************** @@ -1588,8 +1917,7 @@ int odp_packet_split(odp_packet_t *pkt, uint32_t len, odp_packet_t *tail) odp_packet_t odp_packet_copy(odp_packet_t pkt, odp_pool_t pool) { - odp_packet_hdr_t *srchdr = odp_packet_hdr(pkt); - uint32_t pktlen = srchdr->frame_len; + uint32_t pktlen = odp_packet_len(pkt); odp_packet_t newpkt = odp_packet_alloc(pool, pktlen); if (newpkt != ODP_PACKET_INVALID) { @@ -1628,7 +1956,7 @@ int odp_packet_copy_to_mem(odp_packet_t pkt, uint32_t offset, uint8_t *dstaddr = (uint8_t *)dst; odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); - if (offset + len > pkt_hdr->frame_len) + if (offset + len > packet_len(pkt_hdr)) return -1; while (len > 0) { @@ -1652,9 +1980,11 @@ int odp_packet_copy_from_mem(odp_packet_t pkt, uint32_t offset, const uint8_t *srcaddr = (const uint8_t *)src; odp_packet_hdr_t *pkt_hdr = odp_packet_hdr(pkt); - if (offset + len > pkt_hdr->frame_len) + if (offset + len > packet_len(pkt_hdr)) return -1; + ODP_ASSERT(odp_packet_unshared_len(pkt) >= offset + len); + while (len > 0) { mapaddr = packet_map(pkt_hdr, offset, &seglen, NULL); cpylen = len > seglen ? seglen : len; @@ -1680,10 +2010,12 @@ int odp_packet_copy_from_pkt(odp_packet_t dst, uint32_t dst_offset, uint32_t src_seglen = 0; /* GCC */ int overlap; - if (dst_offset + len > dst_hdr->frame_len || - src_offset + len > src_hdr->frame_len) + if (dst_offset + len > packet_len(dst_hdr) || + src_offset + len > packet_len(src_hdr)) return -1; + ODP_ASSERT(odp_packet_unshared_len(dst) >= dst_offset + len); + overlap = (dst_hdr == src_hdr && ((dst_offset <= src_offset && dst_offset + len >= src_offset) || @@ -1767,7 +2099,7 @@ void odp_packet_print(odp_packet_t pkt) len += snprintf(&str[len], n - len, " l4_offset %" PRIu32 "\n", hdr->p.l4_offset); len += snprintf(&str[len], n - len, - " frame_len %" PRIu32 "\n", hdr->frame_len); + " frame_len %" PRIu32 "\n", packet_len(hdr)); len += snprintf(&str[len], n - len, " input %" PRIu64 "\n", odp_pktio_to_u64(hdr->input));
Implement the APIs: - odp_packet_ref_static() - odp_packet_ref() - odp_packet_ref_pkt() - odp_packet_is_ref() - odp_packet_unshared_len() This also involves functional upgrades to the existing packet manipulation APIs to work with packet references as input arguments. Signed-off-by: Bill Fischofer <bill.fischofer@linaro.org> --- .../linux-generic/include/odp_packet_internal.h | 73 ++- platform/linux-generic/odp_packet.c | 542 +++++++++++++++++---- 2 files changed, 508 insertions(+), 107 deletions(-) -- 2.7.4