Message ID | 20250513113817.11962-1-00107082@163.com |
---|---|
State | New |
Headers | show |
Series | [v2,1/2] USB: core: add a memory pool to urb for host-controller private data | expand |
On Tue, May 13, 2025 at 07:38:17PM +0800, David Wang wrote: > --- > Changes: > 1. Use caller's mem_flags if a larger memory is needed. > Thanks to Oliver Neukum <oneukum@suse.com>'s review. > --- > URB objects have long lifecycle, an urb can be reused between > enqueue-dequeue loops; The private data needed by some host controller > has very short lifecycle, the memory is alloced when enqueue, and > released when dequeue. For example, on a system with xhci, several > minutes of usage of webcam/keyboard/mouse have memory alloc counts: > drivers/usb/core/urb.c:75 [usbcore] func:usb_alloc_urb 661 > drivers/usb/host/xhci.c:1555 [xhci_hcd] func:xhci_urb_enqueue 424863 > Memory allocation frequency for host-controller private data can reach > ~1k/s and above. > > High frequent allocations for host-controller private data can be > avoided if urb take over the ownership of memory, the memory then shares > the longer lifecycle with urb objects. > > Add a mempool to urb for hcpriv usage, the mempool only holds one block > of memory and grows when larger size is requested. > > Signed-off-by: David Wang <00107082@163.com> It should be possible to do what you want without touching the USB core code at all, changing only xhci-hcd. That's what Mathias is suggesting. Instead of having an URB keep ownership of its extra memory after it completes, xhci-hcd can put the memory area onto a free list. Then memory areas on the free list can be reused with almost no overhead when URBs are enqueued later on. This idea can become more elaborate if you maintain separate free lists for different devices, or even for different endpoints, or sort the free list by the size of the memory areas. But the basic idea is always the same: Don't change usbcore (including struct urb), and make getting and releasing the extra memory areas have extremely low overhead. Alan Stern
At 2025-05-13 22:25:50, "Alan Stern" <stern@rowland.harvard.edu> wrote: >On Tue, May 13, 2025 at 07:38:17PM +0800, David Wang wrote: >> --- >> Changes: >> 1. Use caller's mem_flags if a larger memory is needed. >> Thanks to Oliver Neukum <oneukum@suse.com>'s review. >> --- >> URB objects have long lifecycle, an urb can be reused between >> enqueue-dequeue loops; The private data needed by some host controller >> has very short lifecycle, the memory is alloced when enqueue, and >> released when dequeue. For example, on a system with xhci, several >> minutes of usage of webcam/keyboard/mouse have memory alloc counts: >> drivers/usb/core/urb.c:75 [usbcore] func:usb_alloc_urb 661 >> drivers/usb/host/xhci.c:1555 [xhci_hcd] func:xhci_urb_enqueue 424863 >> Memory allocation frequency for host-controller private data can reach >> ~1k/s and above. >> >> High frequent allocations for host-controller private data can be >> avoided if urb take over the ownership of memory, the memory then shares >> the longer lifecycle with urb objects. >> >> Add a mempool to urb for hcpriv usage, the mempool only holds one block >> of memory and grows when larger size is requested. >> >> Signed-off-by: David Wang <00107082@163.com> > >It should be possible to do what you want without touching the USB core >code at all, changing only xhci-hcd. That's what Mathias is suggesting. > >Instead of having an URB keep ownership of its extra memory after it >completes, xhci-hcd can put the memory area onto a free list. Then >memory areas on the free list can be reused with almost no overhead when >URBs are enqueued later on. I have to disagree, your suggestion has no much difference from requesting memory from system, locks and memory pool managements, all the same are needed, why bother? The reason I choose URB is that URB life cycle contains the private data's lifecycle, and no two HCD can take over the same URB as the same time. I think the memory pool here is not actually a pool, or I should say the mempool consists of pool of URBs, and each URB have only "single one" slot of mem pool in it. > >This idea can become more elaborate if you maintain separate free lists >for different devices, or even for different endpoints, or sort the free >list by the size of the memory areas. But the basic idea is always the >same: Don't change usbcore (including struct urb), and make getting and >releasing the extra memory areas have extremely low overhead. Why implements a device level memory pool would have extremely low overhead? Why making changes to usb core is bad? The idea in this thread is meant for all kinds of host controllers, xhci is what I have in my system; i think similar changes would benefit other HCs > >Alan Stern
On Tue, May 13, 2025 at 10:41:45PM +0800, David Wang wrote: > > > At 2025-05-13 22:25:50, "Alan Stern" <stern@rowland.harvard.edu> wrote: > >On Tue, May 13, 2025 at 07:38:17PM +0800, David Wang wrote: > >> --- > >> Changes: > >> 1. Use caller's mem_flags if a larger memory is needed. > >> Thanks to Oliver Neukum <oneukum@suse.com>'s review. > >> --- > >> URB objects have long lifecycle, an urb can be reused between > >> enqueue-dequeue loops; The private data needed by some host controller > >> has very short lifecycle, the memory is alloced when enqueue, and > >> released when dequeue. For example, on a system with xhci, several > >> minutes of usage of webcam/keyboard/mouse have memory alloc counts: > >> drivers/usb/core/urb.c:75 [usbcore] func:usb_alloc_urb 661 > >> drivers/usb/host/xhci.c:1555 [xhci_hcd] func:xhci_urb_enqueue 424863 > >> Memory allocation frequency for host-controller private data can reach > >> ~1k/s and above. > >> > >> High frequent allocations for host-controller private data can be > >> avoided if urb take over the ownership of memory, the memory then shares > >> the longer lifecycle with urb objects. > >> > >> Add a mempool to urb for hcpriv usage, the mempool only holds one block > >> of memory and grows when larger size is requested. > >> > >> Signed-off-by: David Wang <00107082@163.com> > > > >It should be possible to do what you want without touching the USB core > >code at all, changing only xhci-hcd. That's what Mathias is suggesting. > > > >Instead of having an URB keep ownership of its extra memory after it > >completes, xhci-hcd can put the memory area onto a free list. Then > >memory areas on the free list can be reused with almost no overhead when > >URBs are enqueued later on. > > I have to disagree, your suggestion has no much difference from requesting memory from > system, locks and memory pool managements, all the same are needed, why bother? There are two differences. First, xhci-hcd already has its own lock that it acquires when enqueuing or dequeuing URBs, so no additional locks would be needed. Second, complicated memory pool management isn't necessary; the management can be extremely simple. (For example, Mathias suggested just reusing the most recently released memory area unless it is too small.) > The reason I choose URB is that URB life cycle contains the private data's lifecycle, > and no two HCD can take over the same URB as the same time. > > I think the memory pool here is not actually a pool, or I should say the mempool consists of > pool of URBs, and each URB have only "single one" slot of mem pool in it. > > > > > >This idea can become more elaborate if you maintain separate free lists > >for different devices, or even for different endpoints, or sort the free > >list by the size of the memory areas. But the basic idea is always the > >same: Don't change usbcore (including struct urb), and make getting and > >releasing the extra memory areas have extremely low overhead. > > Why implements a device level memory pool would have extremely low overhead? Because then getting or releasing memory areas from the pool could be carried out just by manipulating a couple of pointers. Some fraction of the time, URBs are created on demand and destroyed upon completion. Your approach would not save any time for these URBs, whereas my suggested approach would. (This fraction probably isn't very large, although I don't know how big it is.) > Why making changes to usb core is bad? The idea in this thread is meant for all kinds of > host controllers, xhci is what I have in my system; i think similar changes would benefit other > HCs Those other HC drivers would still require changes. They could be changed to support my scheme as easily as your scheme. If I were redesigning Linux's entire USB stack from the beginning, I would change it so that URBs would be dedicated to particular host controllers and endpoint types -- maybe even to particular endpoints. They would contain all the additional memory required for the HCD to use them, all pre-allocated, so that dynamic allocation would not be needed during normal use. (The gadget subsystem behaves this way already.) Such a drastic change isn't feasible at this point, although what you are suggesting is a step in that direction. In the end it comes down to a time/space tradeoff, and it's very difficult to know what the best balance is. I'm not saying that your approach is bad or wrong, just that there are other possibilities to consider. Alan Stern Alan Stern
At 2025-05-13 23:37:20, "Alan Stern" <stern@rowland.harvard.edu> wrote: >On Tue, May 13, 2025 at 10:41:45PM +0800, David Wang wrote: >> >> >> At 2025-05-13 22:25:50, "Alan Stern" <stern@rowland.harvard.edu> wrote: >> >On Tue, May 13, 2025 at 07:38:17PM +0800, David Wang wrote: >> >> --- >> >> Changes: >> >> 1. Use caller's mem_flags if a larger memory is needed. >> >> Thanks to Oliver Neukum <oneukum@suse.com>'s review. >> >> --- >> >> URB objects have long lifecycle, an urb can be reused between >> >> enqueue-dequeue loops; The private data needed by some host controller >> >> has very short lifecycle, the memory is alloced when enqueue, and >> >> released when dequeue. For example, on a system with xhci, several >> >> minutes of usage of webcam/keyboard/mouse have memory alloc counts: >> >> drivers/usb/core/urb.c:75 [usbcore] func:usb_alloc_urb 661 >> >> drivers/usb/host/xhci.c:1555 [xhci_hcd] func:xhci_urb_enqueue 424863 >> >> Memory allocation frequency for host-controller private data can reach >> >> ~1k/s and above. >> >> >> >> High frequent allocations for host-controller private data can be >> >> avoided if urb take over the ownership of memory, the memory then shares >> >> the longer lifecycle with urb objects. >> >> >> >> Add a mempool to urb for hcpriv usage, the mempool only holds one block >> >> of memory and grows when larger size is requested. >> >> >> >> Signed-off-by: David Wang <00107082@163.com> >> > >> >It should be possible to do what you want without touching the USB core >> >code at all, changing only xhci-hcd. That's what Mathias is suggesting. >> > >> >Instead of having an URB keep ownership of its extra memory after it >> >completes, xhci-hcd can put the memory area onto a free list. Then >> >memory areas on the free list can be reused with almost no overhead when >> >URBs are enqueued later on. >> >> I have to disagree, your suggestion has no much difference from requesting memory from >> system, locks and memory pool managements, all the same are needed, why bother? > >There are two differences. First, xhci-hcd already has its own lock >that it acquires when enqueuing or dequeuing URBs, so no additional >locks would be needed. Second, complicated memory pool management isn't >necessary; the management can be extremely simple. (For example, >Mathias suggested just reusing the most recently released memory area >unless it is too small.) My patch also just reuse memory, in a simpler way I would argue.... > >> The reason I choose URB is that URB life cycle contains the private data's lifecycle, >> and no two HCD can take over the same URB as the same time. >> >> I think the memory pool here is not actually a pool, or I should say the mempool consists of >> pool of URBs, and each URB have only "single one" slot of mem pool in it. >> >> >> > >> >This idea can become more elaborate if you maintain separate free lists >> >for different devices, or even for different endpoints, or sort the free >> >list by the size of the memory areas. But the basic idea is always the >> >same: Don't change usbcore (including struct urb), and make getting and >> >releasing the extra memory areas have extremely low overhead. >> >> Why implements a device level memory pool would have extremely low overhead? > >Because then getting or releasing memory areas from the pool could be >carried out just by manipulating a couple of pointers. A couple of pointers manipulation? and it would be simpler than a reusable buffer in URB? I doubt that. There would be lots of details needs to consider, detail is devil and that why we prefer simpler solution, I just don't understand, you seems imply that my patch is not simple, could you elaborate more on it, or it is just that in my mind, make changes to "usb core" is a big no-no! > >Some fraction of the time, URBs are created on demand and destroyed upon >completion. Your approach would not save any time for these URBs, >whereas my suggested approach would. (This fraction probably isn't very >large, although I don't know how big it is.) I am aiming to void extra tons of alloctions for "private data", not URB allocation. When I use my USB webcam, I would observer 1k+ allocation per seconds for private data, while urb allocation's frequency is already very low, people have already done the optimization I guess. The memory profiling feature introduced in 6.12 is a very good start for identifying where improvement could be made for memory behavior. > >> Why making changes to usb core is bad? The idea in this thread is meant for all kinds of >> host controllers, xhci is what I have in my system; i think similar changes would benefit other >> HCs > >Those other HC drivers would still require changes. They could be >changed to support my scheme as easily as your scheme. > >If I were redesigning Linux's entire USB stack from the beginning, I >would change it so that URBs would be dedicated to particular host >controllers and endpoint types -- maybe even to particular endpoints. >They would contain all the additional memory required for the HCD to use >them, all pre-allocated, so that dynamic allocation would not be needed >during normal use. (The gadget subsystem behaves this way already.) > >Such a drastic change isn't feasible at this point, although what you >are suggesting is a step in that direction. In the end it comes down >to a time/space tradeoff, and it's very difficult to know what the best >balance is. Drastic change? Do you mean make change to USB core? Because counting by lines of changes in this patch, I feel my patch is quite simple and efficient. I also don't get your time/space tradeoff here, are you talking about my patch? or you were just talking a solution in your mind.... This patch only needs a pointer and a size in URB, and URB object allocated in a slow pace... > >I'm not saying that your approach is bad or wrong, just that there are >other possibilities to consider. > >Alan Stern > >Alan Stern David
On Wed, May 14, 2025 at 12:35:29AM +0800, David Wang wrote: > > At 2025-05-13 23:37:20, "Alan Stern" <stern@rowland.harvard.edu> wrote: > >> I have to disagree, your suggestion has no much difference from requesting memory from > >> system, locks and memory pool managements, all the same are needed, why bother? > > > >There are two differences. First, xhci-hcd already has its own lock > >that it acquires when enqueuing or dequeuing URBs, so no additional > >locks would be needed. Second, complicated memory pool management isn't > >necessary; the management can be extremely simple. (For example, > >Mathias suggested just reusing the most recently released memory area > >unless it is too small.) > > My patch also just reuse memory, in a simpler way I would argue.... I didn't say your approach wasn't simple. I said that my approach has very low overhead, a lot lower than the existing code, whereas you said my approach "has no much difference" from the existing code. > >> >This idea can become more elaborate if you maintain separate free lists > >> >for different devices, or even for different endpoints, or sort the free > >> >list by the size of the memory areas. But the basic idea is always the > >> >same: Don't change usbcore (including struct urb), and make getting and > >> >releasing the extra memory areas have extremely low overhead. > >> > >> Why implements a device level memory pool would have extremely low overhead? > > > >Because then getting or releasing memory areas from the pool could be > >carried out just by manipulating a couple of pointers. > > A couple of pointers manipulation? and it would be simpler than a reusable buffer in URB? > I doubt that. I didn't say pointer manipulation was simpler than a reusable buffer. I said that it was very low overhead, in order to answer your question: "Why implements a device level memory pool would have extremely low overhead?" > There would be lots of details needs to consider, detail is devil and that why we prefer simpler solution, > I just don't understand, you seems imply that my patch is not simple, could you elaborate more on it, > or it is just that in my mind, make changes to "usb core" is a big no-no! Neither one. However, you have forgotten about one thing: With your patch, each URB maintains ownership of a memory area even when the URB is not in use. With the existing code, that memory is freed when the URB is not in use, and with my approach the memory is shared among URBs. In this way, your patch will use more memory than the existing code or my approach. The question to answer is which is better: Using more memory (your patch) or using more time (the allocations and deallocations in the existing code or my approach)? > >If I were redesigning Linux's entire USB stack from the beginning, I > >would change it so that URBs would be dedicated to particular host > >controllers and endpoint types -- maybe even to particular endpoints. > >They would contain all the additional memory required for the HCD to use > >them, all pre-allocated, so that dynamic allocation would not be needed > >during normal use. (The gadget subsystem behaves this way already.) > > > >Such a drastic change isn't feasible at this point, although what you > >are suggesting is a step in that direction. In the end it comes down > >to a time/space tradeoff, and it's very difficult to know what the best > >balance is. > Drastic change? Do you mean make change to USB core? No, I meant redesigning the entire USB stack. It would require changing all the USB drivers to allocate URBs differently from the way they do now. And changing every HC driver to preallocate the memory it needs to perform transfers. I think you can agree that would be a pretty drastic change. > Because counting by > lines of changes in this patch, I feel my patch is quite simple and efficient. > I also don't get your time/space tradeoff here, are you talking about my patch? > or you were just talking a solution in your mind.... Both. See my comment above. > This patch only needs a pointer and a size in URB, and URB object allocated in a slow pace... It also needs an extra memory area that is allocated the first time the URB is used, and is not deallocated until the URB is destroyed. You seem to be ignoring this fact. Alan Stern
At 2025-05-14 02:21:15, "Alan Stern" <stern@rowland.harvard.edu> wrote: >On Wed, May 14, 2025 at 12:35:29AM +0800, David Wang wrote: >> >> At 2025-05-13 23:37:20, "Alan Stern" <stern@rowland.harvard.edu> wrote: >> >> I have to disagree, your suggestion has no much difference from requesting memory from >> >> system, locks and memory pool managements, all the same are needed, why bother? >> > >> >There are two differences. First, xhci-hcd already has its own lock >> >that it acquires when enqueuing or dequeuing URBs, so no additional >> >locks would be needed. Second, complicated memory pool management isn't >> >necessary; the management can be extremely simple. (For example, >> >Mathias suggested just reusing the most recently released memory area >> >unless it is too small.) >> >> My patch also just reuse memory, in a simpler way I would argue.... > >I didn't say your approach wasn't simple. I said that my approach has >very low overhead, a lot lower than the existing code, whereas you said >my approach "has no much difference" from the existing code. > >> >> >This idea can become more elaborate if you maintain separate free lists >> >> >for different devices, or even for different endpoints, or sort the free >> >> >list by the size of the memory areas. But the basic idea is always the >> >> >same: Don't change usbcore (including struct urb), and make getting and >> >> >releasing the extra memory areas have extremely low overhead. >> >> >> >> Why implements a device level memory pool would have extremely low overhead? >> > >> >Because then getting or releasing memory areas from the pool could be >> >carried out just by manipulating a couple of pointers. >> >> A couple of pointers manipulation? and it would be simpler than a reusable buffer in URB? >> I doubt that. > >I didn't say pointer manipulation was simpler than a reusable buffer. I >said that it was very low overhead, in order to answer your question: >"Why implements a device level memory pool would have extremely low >overhead?" Now I feels it becomes a wording game ..... > >> There would be lots of details needs to consider, detail is devil and that why we prefer simpler solution, >> I just don't understand, you seems imply that my patch is not simple, could you elaborate more on it, >> or it is just that in my mind, make changes to "usb core" is a big no-no! > >Neither one. > >However, you have forgotten about one thing: With your patch, each URB >maintains ownership of a memory area even when the URB is not in use. >With the existing code, that memory is freed when the URB is not in use, >and with my approach the memory is shared among URBs. > >In this way, your patch will use more memory than the existing code or >my approach. The question to answer is which is better: Using more >memory (your patch) or using more time (the allocations and >deallocations in the existing code or my approach)? > >> >If I were redesigning Linux's entire USB stack from the beginning, I >> >would change it so that URBs would be dedicated to particular host >> >controllers and endpoint types -- maybe even to particular endpoints. >> >They would contain all the additional memory required for the HCD to use >> >them, all pre-allocated, so that dynamic allocation would not be needed >> >during normal use. (The gadget subsystem behaves this way already.) >> > >> >Such a drastic change isn't feasible at this point, although what you >> >are suggesting is a step in that direction. In the end it comes down >> >to a time/space tradeoff, and it's very difficult to know what the best >> >balance is. >> Drastic change? Do you mean make change to USB core? > >No, I meant redesigning the entire USB stack. It would require changing >all the USB drivers to allocate URBs differently from the way they do >now. And changing every HC driver to preallocate the memory it needs to >perform transfers. I think you can agree that would be a pretty drastic >change. > >> Because counting by >> lines of changes in this patch, I feel my patch is quite simple and efficient. >> I also don't get your time/space tradeoff here, are you talking about my patch? >> or you were just talking a solution in your mind.... > >Both. See my comment above. > >> This patch only needs a pointer and a size in URB, and URB object allocated in a slow pace... > >It also needs an extra memory area that is allocated the first time the >URB is used, and is not deallocated until the URB is destroyed. You >seem to be ignoring this fact. > >Alan Stern It is not an "extra" memory area, the memory is needed by HC anyway, the memory pool just cache it. And about not freeing memory until URB released, you seems forgot that we are talking about "memory pool" . A URB only used once could be considered a memory pool never used. If your memory pool approach would not "waste" memory, I would rather happy to learn. I want to mention the purpose of this patch again: A lot of "private data" allocation could be avoided if we use a "mempool" to cache and reuse those memory. And use URB as the holder is a very simple way to implement this,. And to add , base on my memory profiling, URB usage is very efficient. I think it is a very good candidate to hold private data cache for HCs. David
On Wed, May 14, 2025 at 02:48:21AM +0800, David Wang wrote: > It is not an "extra" memory area, the memory is needed by HC anyway, the memory pool just cache it. > And about not freeing memory until URB released, you seems forgot that we are talking > about "memory pool" . A URB only used once could be considered a memory pool never used. > > If your memory pool approach would not "waste" memory, I would rather happy to learn. Here's a simple example to illustrate the point. Suppose a driver uses two URBs, call them A and B, but it never has more than one URB active at a time. Thus, when A completes B is submitted, and when B completes A is submitted. With your approach A and B each have their own memory area. With my approach, a single memory area is shared between A and B. Therefore my approach uses less total memory. Now, I admit this pattern is probably not at all common. Usually if a driver is going to reuse an URB, it resubmits the URB as soon as the URB completes rather than waiting for some other URB to complete. Drivers generally don't keep many unused URBs just sitting around -- although there may be exceptions, like a driver for a media device when the device isn't running. > I want to mention the purpose of this patch again: > A lot of "private data" allocation could be avoided if we use a "mempool" to cache and reuse those memory. > And use URB as the holder is a very simple way to implement this,. > > And to add , base on my memory profiling, URB usage is very efficient. I think it is a very good candidate to hold > private data cache for HCs. All right. I withdraw any objection to your patches. Alan Stern
Hi, Update memory footprints after hours of USB devices usage on my system: (I have webcam/mic/keyboard/mouse/harddisk connected via USB, a full picture of memory footprints is attached below) +----------------------+----------------+-------------------------------------------+-----------------------+ | active memory(bytes) | active objects | alloc location | total objects created | +----------------------+----------------+-------------------------------------------+-----------------------+ | 22912 | 24 | core/urb.c:1054:urb_hcpriv_mempool_zalloc | 10523 | | 11776 | 31 | core/urb.c:76:usb_alloc_urb | 11027 | +----------------------+----------------+-------------------------------------------+-----------------------+ The count for active URB objects remain at low level, its peak is about 12KB when I copied 10G file to my harddisk. The memory pool in this patch takes about 22KB, its peak is 23KB. The patch meant to reuse memory via a mempool, the memory kept in pool is indeed the "tradeoff" when the system is idle. (Well, we are talking about mempool anyway.) How balance the tradeoff is depends on how well the mempool is managed. This patch takes a easy approach: put faith in URB objects management and put a single slot of mempool in URB on demands. And the changes, by counting lines in this patch, are very simple. Base on the profiling, the number of active URB objects are kept at a very low scale, only several could have a very long lifecycle. I think URB is a good candidate for caching those memory needed for private data. But I could be very wrong, due simply to the lack of knowledge. And before, without the patch, a 10 minutes webcam usage and copying 10G file to harddisk would yield high rate of memory allocation for priviate data in xhci_urb_enqueue: +----------------------+----------------+-----------------------------------+-----------------------+ | active memory(bytes) | active objects | alloc location | total objects created | +----------------------+----------------+-----------------------------------+-----------------------+ | 22784 | 23 | host/xhci.c:1555:xhci_urb_enqueue | 894281 << grow|ing very quick | 10880 | 31 | core/urb.c:75:usb_alloc_urb | 4028 | +----------------------+----------------+-----------------------------------+-----------------------+ I observe a highest allocation rate of 1.5K/s in xhci_urb_enqueue when I was copying 10G file, and had my webcam opened at the same time. And again, to be honest, I did not observe any observable performance improvement from an enduser's point of view with this patch. The only significant improvement is memory footprint _numbers_. I guess memory allocation is indeed "_really damn fast_", but I still have the mindset of "the less allocation the better". Full drivers/usb memory footprint on my system with this patch: (The data was collected via memory profiling: https://docs.kernel.org/mm/allocation-profiling.html) +----------------------+----------------+------------------------------------------------------+-----------------------+ | active memory(bytes) | active objects | alloc location | total objects created | +----------------------+----------------+------------------------------------------------------+-----------------------+ | 40960 | 5 | host/xhci-mem.c:951:xhci_alloc_virt_device | 10 | | 26624 | 26 | core/port.c:742:usb_hub_create_port_device | 26 | | 24576 | 2 | host/xhci-mem.c:2170:xhci_setup_port_arrays | 3 | | 22912 | 24 | core/urb.c:1054:urb_hcpriv_mempool_zalloc | 10523 | | 21504 | 21 | core/endpoint.c:157:usb_create_ep_devs | 31 | | 18432 | 9 | core/usb.c:650:usb_alloc_dev | 10 | | 16384 | 4 | core/hcd.c:2553:__usb_create_hcd | 4 | | 13312 | 13 | core/message.c:2037:usb_set_configuration | 14 | | 11776 | 31 | core/urb.c:76:usb_alloc_urb | 11027 | | 9216 | 9 | core/config.c:930:usb_get_configuration | 10 | | 5120 | 5 | core/hub.c:1938:hub_probe | 5 | | 5120 | 2 | host/xhci-mem.c:2156:xhci_setup_port_arrays | 3 | | 2560 | 5 | host/xhci-debugfs.c:578:xhci_debugfs_create_slot | 10 | | 2416 | 16 | host/xhci-mem.c:49:xhci_segment_alloc | 44 | | 2176 | 17 | host/xhci-mem.c:377:xhci_ring_alloc | 38 | | 2112 | 22 | core/port.c:746:usb_hub_create_port_device | 26 | | 2048 | 32 | host/xhci-mem.c:38:xhci_segment_alloc | 73 | | 1728 | 18 | host/xhci-debugfs.c:90:xhci_debugfs_alloc_regset | 26 | | 1632 | 17 | core/config.c:619:usb_parse_interface | 18 | | 1504 | 9 | core/config.c:967:usb_get_configuration | 10 | | 1312 | 13 | core/config.c:820:usb_parse_configuration | 14 | | 1024 | 1 | host/xhci-mem.c:812:xhci_alloc_tt_info | 2 | | 704 | 24 | core/message.c:1032:usb_cache_string | 27 | | 512 | 8 | host/xhci-debugfs.c:438:xhci_debugfs_create_endpoint | 22 | | 320 | 10 | host/xhci-mem.c:461:xhci_alloc_container_ctx | 30 | | 272 | 2 | host/xhci-mem.c:1635:scratchpad_alloc | 3 | | 224 | 5 | core/hub.c:1502:hub_configure | 5 | | 192 | 4 | host/xhci-mem.c:2121:xhci_create_rhub_port_array | 6 | | 192 | 2 | host/xhci-mem.c:2263:xhci_alloc_interrupter | 3 | | 160 | 2 | host/xhci-mem.c:2198:xhci_setup_port_arrays | 3 | | 128 | 2 | host/xhci-mem.c:2506:xhci_mem_init | 3 | | 128 | 2 | core/config.c:1063:usb_get_bos_descriptor | 3 | | 128 | 2 | core/config.c:1058:usb_get_bos_descriptor | 3 | | 80 | 5 | core/hub.c:1454:hub_configure | 5 | | 72 | 9 | core/config.c:935:usb_get_configuration | 10 | | 64 | 2 | host/xhci-mem.c:1624:scratchpad_alloc | 3 | | 64 | 2 | core/hcd.c:2565:__usb_create_hcd | 2 | | 64 | 2 | core/hcd.c:2557:__usb_create_hcd | 2 | | 40 | 5 | host/xhci-mem.c:2039:xhci_add_in_port | 6 | | 40 | 5 | core/hub.c:1447:hub_configure | 5 | | 40 | 5 | core/hub.c:1441:hub_configure | 5 | | 0 | 0 | core/message.c:529:usb_sg_init | 10137 | | 0 | 0 | core/message.c:144:usb_control_msg | 793 | | 0 | 0 | core/hcd.c:491:rh_call_control | 449 | | 0 | 0 | host/xhci-mem.c:1705:xhci_alloc_command | 234 | | 0 | 0 | host/xhci-mem.c:1711:xhci_alloc_command | 97 | | 0 | 0 | core/message.c:980:usb_string | 31 | | 0 | 0 | core/message.c:1028:usb_cache_string | 27 | | 0 | 0 | core/message.c:1145:usb_get_status | 19 | | 0 | 0 | core/message.c:1061:usb_get_device_descriptor | 14 | | 0 | 0 | core/message.c:2031:usb_set_configuration | 10 | | 0 | 0 | core/hub.c:4883:hub_port_init | 10 | | 0 | 0 | core/config.c:939:usb_get_configuration | 10 | | 0 | 0 | core/hub.c:5311:descriptors_changed | 4 | | 0 | 0 | core/config.c:1037:usb_get_bos_descriptor | 3 | | 0 | 0 | storage/usb.c:540:associate_dev | 1 | +----------------------+----------------+------------------------------------------------------+-----------------------+ FYI David
On Wed, May 14, 2025 at 02:44:55PM +0800, David Wang wrote: > Hi, > > Update memory footprints after hours of USB devices usage > on my system: > (I have webcam/mic/keyboard/mouse/harddisk connected via USB, > a full picture of memory footprints is attached below) > +----------------------+----------------+-------------------------------------------+-----------------------+ > | active memory(bytes) | active objects | alloc location | total objects created | > +----------------------+----------------+-------------------------------------------+-----------------------+ > | 22912 | 24 | core/urb.c:1054:urb_hcpriv_mempool_zalloc | 10523 | > | 11776 | 31 | core/urb.c:76:usb_alloc_urb | 11027 | > +----------------------+----------------+-------------------------------------------+-----------------------+ > > The count for active URB objects remain at low level, > its peak is about 12KB when I copied 10G file to my harddisk. > The memory pool in this patch takes about 22KB, its peak is 23KB. > The patch meant to reuse memory via a mempool, the memory kept in pool is indeed > the "tradeoff" when the system is idle. (Well, we are talking about mempool anyway.) > How balance the tradeoff is depends on how well the mempool is managed. > This patch takes a easy approach: put faith in URB objects management and put > a single slot of mempool in URB on demands. And the changes, by counting lines > in this patch, are very simple. > Base on the profiling, the number of active URB objects are kept at a very low scale, > only several could have a very long lifecycle. > I think URB is a good candidate for caching those memory needed for private data. > But I could be very wrong, due simply to the lack of knowledge. > > And before, without the patch, a 10 minutes webcam usage and copying 10G file to harddisk > would yield high rate of memory allocation for priviate data in xhci_urb_enqueue: > +----------------------+----------------+-----------------------------------+-----------------------+ > | active memory(bytes) | active objects | alloc location | total objects created | > +----------------------+----------------+-----------------------------------+-----------------------+ > | 22784 | 23 | host/xhci.c:1555:xhci_urb_enqueue | 894281 << grow|ing very quick > | 10880 | 31 | core/urb.c:75:usb_alloc_urb | 4028 | > +----------------------+----------------+-----------------------------------+-----------------------+ > I observe a highest allocation rate of 1.5K/s in xhci_urb_enqueue > when I was copying 10G file, and had my webcam opened at the same time. > > And again, to be honest, I did not observe any observable performance improvement from > an enduser's point of view with this patch. The only significant improvement is memory footprint > _numbers_. > I guess memory allocation is indeed "_really damn fast_", but I still have the mindset of > "the less allocation the better". No, this isn't necessarily true at all. Allocations are fast, and if we free/allocate things quickly, it's even faster. USB is limited by the hardware throughput, which is _very_ slow compared to memory accesses of the allocator. So unless you can show that we are using less CPU time, or something else "real" that is measurable in a real way in userspace, that would justify the extra complexity, it's going to be hard to get me to agree that this is something that needs to be addressed at all. Also, I'm totally confused as to what the "latest" version of this patchset is... thanks, greg k-h
At 2025-05-14 15:29:42, "Greg KH" <gregkh@linuxfoundation.org> wrote: >On Wed, May 14, 2025 at 02:44:55PM +0800, David Wang wrote: >> Hi, >> >> Update memory footprints after hours of USB devices usage >> on my system: >> (I have webcam/mic/keyboard/mouse/harddisk connected via USB, >> a full picture of memory footprints is attached below) >> +----------------------+----------------+-------------------------------------------+-----------------------+ >> | active memory(bytes) | active objects | alloc location | total objects created | >> +----------------------+----------------+-------------------------------------------+-----------------------+ >> | 22912 | 24 | core/urb.c:1054:urb_hcpriv_mempool_zalloc | 10523 | >> | 11776 | 31 | core/urb.c:76:usb_alloc_urb | 11027 | >> +----------------------+----------------+-------------------------------------------+-----------------------+ >> >> The count for active URB objects remain at low level, >> its peak is about 12KB when I copied 10G file to my harddisk. >> The memory pool in this patch takes about 22KB, its peak is 23KB. >> The patch meant to reuse memory via a mempool, the memory kept in pool is indeed >> the "tradeoff" when the system is idle. (Well, we are talking about mempool anyway.) >> How balance the tradeoff is depends on how well the mempool is managed. >> This patch takes a easy approach: put faith in URB objects management and put >> a single slot of mempool in URB on demands. And the changes, by counting lines >> in this patch, are very simple. >> Base on the profiling, the number of active URB objects are kept at a very low scale, >> only several could have a very long lifecycle. >> I think URB is a good candidate for caching those memory needed for private data. >> But I could be very wrong, due simply to the lack of knowledge. >> >> And before, without the patch, a 10 minutes webcam usage and copying 10G file to harddisk >> would yield high rate of memory allocation for priviate data in xhci_urb_enqueue: >> +----------------------+----------------+-----------------------------------+-----------------------+ >> | active memory(bytes) | active objects | alloc location | total objects created | >> +----------------------+----------------+-----------------------------------+-----------------------+ >> | 22784 | 23 | host/xhci.c:1555:xhci_urb_enqueue | 894281 << grow|ing very quick >> | 10880 | 31 | core/urb.c:75:usb_alloc_urb | 4028 | >> +----------------------+----------------+-----------------------------------+-----------------------+ >> I observe a highest allocation rate of 1.5K/s in xhci_urb_enqueue >> when I was copying 10G file, and had my webcam opened at the same time. >> >> And again, to be honest, I did not observe any observable performance improvement from >> an enduser's point of view with this patch. The only significant improvement is memory footprint >> _numbers_. >> I guess memory allocation is indeed "_really damn fast_", but I still have the mindset of >> "the less allocation the better". > >No, this isn't necessarily true at all. Allocations are fast, and if we >free/allocate things quickly, it's even faster. USB is limited by the >hardware throughput, which is _very_ slow compared to memory accesses of >the allocator. > >So unless you can show that we are using less CPU time, or something >else "real" that is measurable in a real way in userspace, that would >justify the extra complexity, it's going to be hard to get me to agree >that this is something that needs to be addressed at all. Thanks for feedbacks~! That's very reasonable to me, and I have been pondering on how to profile a USB performance, but still no clue. I will keep thinking about it, hopefully this 1k+/s allocation would show up somewhere, or conclude that it really has no significant impact at all. Thanks David > >Also, I'm totally confused as to what the "latest" version of this >patchset is... > sorry, I think I mess up the mails when I add "reply-to" header to newer patches >thanks, > >greg k-h
On 14.05.25 09:29, Greg KH wrote: > No, this isn't necessarily true at all. Allocations are fast, and if we > free/allocate things quickly, it's even faster. USB is limited by the > hardware throughput, which is _very_ slow compared to memory accesses of > the allocator. If and only if we do not trigger disk IO. If you really want to give this patch a good performance testing you'd have to do it under memory pressure. Regards Oliver
On 13.05.25 13:38, David Wang wrote: > --- Hi, still an issue after a second review. I should have noticed earlier. > --- a/drivers/usb/core/urb.c > +++ b/drivers/usb/core/urb.c > @@ -23,6 +23,7 @@ static void urb_destroy(struct kref *kref) > > if (urb->transfer_flags & URB_FREE_BUFFER) > kfree(urb->transfer_buffer); > + kfree(urb->hcpriv_mempool); What if somebody uses usb_init_urb()? > kfree(urb); > } > @@ -1037,3 +1038,25 @@ int usb_anchor_empty(struct usb_anchor *anchor) > > EXPORT_SYMBOL_GPL(usb_anchor_empty); > > +/** > + * urb_hcpriv_mempool_zalloc - alloc memory from mempool for hcpriv > + * @urb: pointer to URB being used > + * @size: memory size requested by current host controller > + * @mem_flags: the type of memory to allocate > + * > + * Return: NULL if out of memory, otherwise memory are zeroed > + */ > +void *urb_hcpriv_mempool_zalloc(struct urb *urb, size_t size, gfp_t mem_flags) > +{ > + if (urb->hcpriv_mempool_size < size) { > + kfree(urb->hcpriv_mempool); > + urb->hcpriv_mempool_size = size; > + urb->hcpriv_mempool = kmalloc(size, mem_flags); That could use kzalloc(). Regards Oliver
On 13.05.25 16:25, Alan Stern wrote: > Instead of having an URB keep ownership of its extra memory after it > completes, xhci-hcd can put the memory area onto a free list. Then > memory areas on the free list can be reused with almost no overhead when > URBs are enqueued later on. The number of URBs in flight is basically unlimited. When would you shrink the number of cached privates? That would seem to be the basic question if you do not link this with URBs. Regards Oliver
At 2025-05-14 19:23:02, "Oliver Neukum" <oneukum@suse.com> wrote: > > >On 13.05.25 13:38, David Wang wrote: >> --- > >Hi, > >still an issue after a second review. >I should have noticed earlier. > >> --- a/drivers/usb/core/urb.c >> +++ b/drivers/usb/core/urb.c >> @@ -23,6 +23,7 @@ static void urb_destroy(struct kref *kref) >> >> if (urb->transfer_flags & URB_FREE_BUFFER) >> kfree(urb->transfer_buffer); >> + kfree(urb->hcpriv_mempool); > >What if somebody uses usb_init_urb()? I am not quite sure about the concern here, do you mean somebody create a urb, and then usb_init_urb() here, and never use urb_destroy to release it? That would cause memory leak if urb_destroy is not called......But is this really possible?. > >> kfree(urb); >> } >> @@ -1037,3 +1038,25 @@ int usb_anchor_empty(struct usb_anchor *anchor) >> >> EXPORT_SYMBOL_GPL(usb_anchor_empty); >> >> +/** >> + * urb_hcpriv_mempool_zalloc - alloc memory from mempool for hcpriv >> + * @urb: pointer to URB being used >> + * @size: memory size requested by current host controller >> + * @mem_flags: the type of memory to allocate >> + * >> + * Return: NULL if out of memory, otherwise memory are zeroed >> + */ >> +void *urb_hcpriv_mempool_zalloc(struct urb *urb, size_t size, gfp_t mem_flags) >> +{ >> + if (urb->hcpriv_mempool_size < size) { >> + kfree(urb->hcpriv_mempool); >> + urb->hcpriv_mempool_size = size; >> + urb->hcpriv_mempool = kmalloc(size, mem_flags); > >That could use kzalloc(). The memory would be set to 0 before returning to user, via memset, no matter whether the memory is newly alloced or just reused. I think using kmalloc is ok here. Thanks David > > Regards > Oliver
On 14.05.25 13:51, David Wang wrote: > I am not quite sure about the concern here, do you mean somebody create a urb, > and then usb_init_urb() here, and never use urb_destroy to release it? Yes. > > That would cause memory leak if urb_destroy is not called......But is this really possible?. I think a few drivers under drivers/media do so. Regards Oliver
At 2025-05-14 20:03:02, "Oliver Neukum" <oneukum@suse.com> wrote: > > >On 14.05.25 13:51, David Wang wrote: > >> I am not quite sure about the concern here, do you mean somebody create a urb, >> and then usb_init_urb() here, and never use urb_destroy to release it? > >Yes. > >> >> That would cause memory leak if urb_destroy is not called......But is this really possible?. > >I think a few drivers under drivers/media do so. That would cause real problem here.... I will keep this in mind... (It is really a bad pattern to have only init function, it should be paired with corresponding "release" to cancel side-effects......) Thanks~ David > > Regards > Oliver
At 2025-05-14 20:03:02, "Oliver Neukum" <oneukum@suse.com> wrote: > > >On 14.05.25 13:51, David Wang wrote: > >> I am not quite sure about the concern here, do you mean somebody create a urb, >> and then usb_init_urb() here, and never use urb_destroy to release it? > >Yes. > >> >> That would cause memory leak if urb_destroy is not called......But is this really possible?. > >I think a few drivers under drivers/media do so. I search through codes, some drivers will use usb_free_urb which would invoke urb_destroy; But there are indeed several drivers use urb as a struct member, which is not directly kmalloced and should not be kfreed via usb_free_urb..... It would involve lots of changes..... On the bright side, when I made the code check, I notice something off: in drivers/net/wireless/realtek/rtl8xxxu/core.c 5168 usb_free_urb(&tx_urb->urb); 5868 usb_free_urb(&rx_urb->urb); 5890 usb_free_urb(&rx_urb->urb); 5938 usb_free_urb(&rx_urb->urb); usb_free_urb would kfree the urb pointer, which would be wrong here since rx_urb and tx_urb defined in drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu.h 1944 struct rtl8xxxu_rx_urb { 1945 struct urb urb; 1946 struct ieee80211_hw *hw; 1947 struct list_head list; 1948 }; 1949 1950 struct rtl8xxxu_tx_urb { 1951 struct urb urb; 1952 struct ieee80211_hw *hw; 1953 struct list_head list; 1954 }; Hi, Jes Would you take a look? I feel usb_free_urb needs a pointer which is allokedd directly, but I would be wrong..... Thanks David > > Regards > Oliver
At 2025-05-14 17:34:21, "Oliver Neukum" <oneukum@suse.com> wrote: >On 14.05.25 09:29, Greg KH wrote: > >> No, this isn't necessarily true at all. Allocations are fast, and if we >> free/allocate things quickly, it's even faster. USB is limited by the >> hardware throughput, which is _very_ slow compared to memory accesses of >> the allocator. > >If and only if we do not trigger disk IO. If you really want to give this patch >a good performance testing you'd have to do it under memory pressure. > > Regards > Oliver Hi, I made some test: Using FPS for webcam and bitrate for audio mic for measurement. When system is under no memory pressure, no significant difference could be observed w/o this patch. When system is under heavy memory pressure, bitrate would drop from ~760.3kbits/s to ~524.3kbits/s, but this patch dose not make any significant difference, bitrate drops are almost the same w/o this. When under heavy memory pressure, my whole system gets slow.... But I think, in between no memory pressure and heavy memory pressure, there would be a point where an extra 1k/s would kick start a chain-of-effect landing a very bad performance, it is just very hard to pinpoint. Using my webcam would have ~250/s memory allocation rate, and my mic ~1k/s. I am imaging a system with several usb webcam/mic connected. There would be x*1k/s allocation if those devices are used at the same time. (Not sure whether all allctation could be avoided under heavy usage of usb devices, but I think good part of the allocations can be reused.) Still think this change benefits even without a solid evidence yet. (I have send out another version addressing Oliver's comments about urb managed by drivers) Thanks David
diff --git a/drivers/usb/core/urb.c b/drivers/usb/core/urb.c index 5e52a35486af..51bf25125aeb 100644 --- a/drivers/usb/core/urb.c +++ b/drivers/usb/core/urb.c @@ -23,6 +23,7 @@ static void urb_destroy(struct kref *kref) if (urb->transfer_flags & URB_FREE_BUFFER) kfree(urb->transfer_buffer); + kfree(urb->hcpriv_mempool); kfree(urb); } @@ -1037,3 +1038,25 @@ int usb_anchor_empty(struct usb_anchor *anchor) EXPORT_SYMBOL_GPL(usb_anchor_empty); +/** + * urb_hcpriv_mempool_zalloc - alloc memory from mempool for hcpriv + * @urb: pointer to URB being used + * @size: memory size requested by current host controller + * @mem_flags: the type of memory to allocate + * + * Return: NULL if out of memory, otherwise memory are zeroed + */ +void *urb_hcpriv_mempool_zalloc(struct urb *urb, size_t size, gfp_t mem_flags) +{ + if (urb->hcpriv_mempool_size < size) { + kfree(urb->hcpriv_mempool); + urb->hcpriv_mempool_size = size; + urb->hcpriv_mempool = kmalloc(size, mem_flags); + } + if (urb->hcpriv_mempool) + memset(urb->hcpriv_mempool, 0, size); + else + urb->hcpriv_mempool_size = 0; + return urb->hcpriv_mempool; +} +EXPORT_SYMBOL_GPL(urb_hcpriv_mempool_zalloc); diff --git a/include/linux/usb.h b/include/linux/usb.h index b46738701f8d..4400e41271bc 100644 --- a/include/linux/usb.h +++ b/include/linux/usb.h @@ -1602,6 +1602,8 @@ struct urb { struct kref kref; /* reference count of the URB */ int unlinked; /* unlink error code */ void *hcpriv; /* private data for host controller */ + void *hcpriv_mempool; /* a single slot of cache for HCD's private data */ + size_t hcpriv_mempool_size; /* current size of the memory pool */ atomic_t use_count; /* concurrent submissions counter */ atomic_t reject; /* submissions will fail */ @@ -1790,6 +1792,7 @@ extern int usb_wait_anchor_empty_timeout(struct usb_anchor *anchor, extern struct urb *usb_get_from_anchor(struct usb_anchor *anchor); extern void usb_scuttle_anchored_urbs(struct usb_anchor *anchor); extern int usb_anchor_empty(struct usb_anchor *anchor); +extern void *urb_hcpriv_mempool_zalloc(struct urb *urb, size_t size, gfp_t mem_flags); #define usb_unblock_urb usb_unpoison_urb
--- Changes: 1. Use caller's mem_flags if a larger memory is needed. Thanks to Oliver Neukum <oneukum@suse.com>'s review. --- URB objects have long lifecycle, an urb can be reused between enqueue-dequeue loops; The private data needed by some host controller has very short lifecycle, the memory is alloced when enqueue, and released when dequeue. For example, on a system with xhci, several minutes of usage of webcam/keyboard/mouse have memory alloc counts: drivers/usb/core/urb.c:75 [usbcore] func:usb_alloc_urb 661 drivers/usb/host/xhci.c:1555 [xhci_hcd] func:xhci_urb_enqueue 424863 Memory allocation frequency for host-controller private data can reach ~1k/s and above. High frequent allocations for host-controller private data can be avoided if urb take over the ownership of memory, the memory then shares the longer lifecycle with urb objects. Add a mempool to urb for hcpriv usage, the mempool only holds one block of memory and grows when larger size is requested. Signed-off-by: David Wang <00107082@163.com> --- drivers/usb/core/urb.c | 23 +++++++++++++++++++++++ include/linux/usb.h | 3 +++ 2 files changed, 26 insertions(+)