Message ID | 20250414225227.3642618-3-tjmercier@google.com |
---|---|
State | New |
Headers | show |
Series | Replace CONFIG_DMABUF_SYSFS_STATS with BPF | expand |
On Tue, Apr 15, 2025 at 9:43 PM kernel test robot <lkp@intel.com> wrote: > > Hi Mercier, > > kernel test robot noticed the following build errors: > > [auto build test ERROR on bpf-next/net] > [also build test ERROR on bpf-next/master bpf/master linus/master v6.15-rc2 next-20250415] > [If your patch is applied to the wrong git tree, kindly drop us a note. > And when submitting patch, we suggest to use '--base' as documented in > https://git-scm.com/docs/git-format-patch#_base_tree_information] > > url: https://github.com/intel-lab-lkp/linux/commits/T-J-Mercier/dma-buf-Rename-and-expose-debugfs-symbols/20250415-065354 > base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git net > patch link: https://lore.kernel.org/r/20250414225227.3642618-3-tjmercier%40google.com > patch subject: [PATCH 2/4] bpf: Add dmabuf iterator > config: i386-buildonly-randconfig-005-20250416 > compiler: clang version 20.1.2 (https://github.com/llvm/llvm-project 58df0ef89dd64126512e4ee27b4ac3fd8ddf6247) > reproduce (this is a W=1 build): > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <lkp@intel.com> > | Closes: https://lore.kernel.org/oe-kbuild-all/202504161015.x2XLaha2-lkp@intel.com/ > > All errors (new ones prefixed by >>): > > >> ld.lld: error: undefined symbol: dmabuf_debugfs_list_mutex > >>> referenced by dmabuf_iter.c:44 (kernel/bpf/dmabuf_iter.c:44) > >>> vmlinux.o:(dmabuf_iter_seq_next) > >>> referenced by dmabuf_iter.c:53 (kernel/bpf/dmabuf_iter.c:53) > >>> vmlinux.o:(dmabuf_iter_seq_next) > >>> referenced by dmabuf_iter.c:26 (kernel/bpf/dmabuf_iter.c:26) > >>> vmlinux.o:(dmabuf_iter_seq_start) > >>> referenced 1 more times > -- > >> ld.lld: error: undefined symbol: dma_buf_put > >>> referenced by dmabuf_iter.c:45 (kernel/bpf/dmabuf_iter.c:45) > >>> vmlinux.o:(dmabuf_iter_seq_next) > >>> referenced by dmabuf_iter.c:90 (kernel/bpf/dmabuf_iter.c:90) > >>> vmlinux.o:(dmabuf_iter_seq_stop) > -- > >> ld.lld: error: undefined symbol: dmabuf_debugfs_list > >>> referenced by list.h:354 (include/linux/list.h:354) > >>> vmlinux.o:(dmabuf_iter_seq_next) > >>> referenced by dmabuf_iter.c:0 (kernel/bpf/dmabuf_iter.c:0) > >>> vmlinux.o:(dmabuf_iter_seq_start) > >>> referenced by list.h:364 (include/linux/list.h:364) > >>> vmlinux.o:(dmabuf_iter_seq_start) > > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests/wiki This is due to no CONFIG_DMA_SHARED_BUFFER. Fixed by: --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -53,7 +53,7 @@ obj-$(CONFIG_BPF_SYSCALL) += relo_core.o obj-$(CONFIG_BPF_SYSCALL) += btf_iter.o obj-$(CONFIG_BPF_SYSCALL) += btf_relocate.o obj-$(CONFIG_BPF_SYSCALL) += kmem_cache_iter.o -ifeq ($(CONFIG_DEBUG_FS),y) +ifeq ($(CONFIG_DMA_SHARED_BUFFER)$(CONFIG_DEBUG_FS),yy) obj-$(CONFIG_BPF_SYSCALL) += dmabuf_iter.o endif
On Wed, Apr 16, 2025 at 3:51 PM T.J. Mercier <tjmercier@google.com> wrote: [...] > > > > IIUC, the iterator simply traverses elements in a linked list. I feel it is > > an overkill to implement a new BPF iterator for it. > > Like other BPF iterators such as kmem_cache_iter or task_iter. > Cgroup_iter iterates trees instead of lists. This is iterating over > kernel objects just like the docs say, "A BPF iterator is a type of > BPF program that allows users to iterate over specific types of kernel > objects". More complicated iteration should not be a requirement here. > > > Maybe we simply > > use debugging tools like crash or drgn for this? The access with > > these tools will not be protected by the mutex. But from my personal > > experience, this is not a big issue for user space debugging tools. > > drgn is *way* too slow, and even if it weren't the dependencies for > running it aren't available. crash needs debug symbols which also > aren't available on user builds. This is not just for manual > debugging, it's for reporting memory use in production. Or anything > else someone might care to extract like attachment info or refcounts. Could you please share more information about the use cases and the time constraint here, and why drgn is too slow. Is most of the delay comes from parsing DWARF? This is mostly for my curiosity, because I have been thinking about using drgn to do some monitoring in production. Thanks, Song
On Wed, Apr 16, 2025 at 4:08 PM Song Liu <song@kernel.org> wrote: > > On Wed, Apr 16, 2025 at 3:51 PM T.J. Mercier <tjmercier@google.com> wrote: > [...] > > > > > > IIUC, the iterator simply traverses elements in a linked list. I feel it is > > > an overkill to implement a new BPF iterator for it. > > > > Like other BPF iterators such as kmem_cache_iter or task_iter. > > Cgroup_iter iterates trees instead of lists. This is iterating over > > kernel objects just like the docs say, "A BPF iterator is a type of > > BPF program that allows users to iterate over specific types of kernel > > objects". More complicated iteration should not be a requirement here. > > > > > Maybe we simply > > > use debugging tools like crash or drgn for this? The access with > > > these tools will not be protected by the mutex. But from my personal > > > experience, this is not a big issue for user space debugging tools. > > > > drgn is *way* too slow, and even if it weren't the dependencies for > > running it aren't available. crash needs debug symbols which also > > aren't available on user builds. This is not just for manual > > debugging, it's for reporting memory use in production. Or anything > > else someone might care to extract like attachment info or refcounts. > > Could you please share more information about the use cases and > the time constraint here, and why drgn is too slow. Is most of the delay > comes from parsing DWARF? This is mostly for my curiosity, because > I have been thinking about using drgn to do some monitoring in > production. > > Thanks, > Song These RunCommands have 10 second timeouts for example. It's rare that I see them get exceeded but it happens occasionally.: https://cs.android.com/android/platform/superproject/main/+/main:frameworks/native/cmds/dumpstate/dumpstate.cpp;drc=98bdc04b7658fde0a99403fc052d1d18e7d48ea6;l=2008 The last time I used drgn (admittedly back in 2023) it took over a minute to iterate through less than 200 cgroups. I'm not sure what the root cause of the slowness was, but I'd expect the DWARF processing to be done up-front once and the slowness I experienced was not just at startup. Eventually I switched over to tracefs for that issue, which we still use for some telemetry. Other uses are by statsd for telemetry, memory reporting on app kills or death, and for "dumpsys meminfo". Thanks, T.J.
On Wed, Apr 16, 2025 at 6:26 PM Song Liu <song@kernel.org> wrote: > > On Wed, Apr 16, 2025 at 4:40 PM T.J. Mercier <tjmercier@google.com> wrote: > > > > On Wed, Apr 16, 2025 at 4:08 PM Song Liu <song@kernel.org> wrote: > > > > > > On Wed, Apr 16, 2025 at 3:51 PM T.J. Mercier <tjmercier@google.com> wrote: > > > [...] > > > > > > > > > > IIUC, the iterator simply traverses elements in a linked list. I feel it is > > > > > an overkill to implement a new BPF iterator for it. > > > > > > > > Like other BPF iterators such as kmem_cache_iter or task_iter. > > > > Cgroup_iter iterates trees instead of lists. This is iterating over > > > > kernel objects just like the docs say, "A BPF iterator is a type of > > > > BPF program that allows users to iterate over specific types of kernel > > > > objects". More complicated iteration should not be a requirement here. > > > > > > > > > Maybe we simply > > > > > use debugging tools like crash or drgn for this? The access with > > > > > these tools will not be protected by the mutex. But from my personal > > > > > experience, this is not a big issue for user space debugging tools. > > > > > > > > drgn is *way* too slow, and even if it weren't the dependencies for > > > > running it aren't available. crash needs debug symbols which also > > > > aren't available on user builds. This is not just for manual > > > > debugging, it's for reporting memory use in production. Or anything > > > > else someone might care to extract like attachment info or refcounts. > > > > > > Could you please share more information about the use cases and > > > the time constraint here, and why drgn is too slow. Is most of the delay > > > comes from parsing DWARF? This is mostly for my curiosity, because > > > I have been thinking about using drgn to do some monitoring in > > > production. > > > > > > Thanks, > > > Song > > > > These RunCommands have 10 second timeouts for example. It's rare that > > I see them get exceeded but it happens occasionally.: > > https://cs.android.com/android/platform/superproject/main/+/main:frameworks/native/cmds/dumpstate/dumpstate.cpp;drc=98bdc04b7658fde0a99403fc052d1d18e7d48ea6;l=2008 > > Thanks for sharing this information. > > > The last time I used drgn (admittedly back in 2023) it took over a > > minute to iterate through less than 200 cgroups. I'm not sure what the > > root cause of the slowness was, but I'd expect the DWARF processing to > > be done up-front once and the slowness I experienced was not just at > > startup. Eventually I switched over to tracefs for that issue, which > > we still use for some telemetry. > > I haven't tried drgn on Android. On server side, iterating should 200 > cgroups should be fairly fast (< 5 seconds, where DWARF parsing is > the most expensive part). > > > Other uses are by statsd for telemetry, memory reporting on app kills > > or death, and for "dumpsys meminfo". > > Here is another rookie question, it appears to me there is a file descriptor > associated with each DMA buffer, can we achieve the same goal with > a task-file iterator? That would find almost all of them, but not the kernel-only allocations. (kernel_rss in the dmabuf_dump output I attached earlier. If there's a leak, it's likely to show up in kernel_rss because some driver forgot to release its reference(s).) Also wouldn't that be a ton more iterations since we'd have to visit every FD to find the small portion that are dmabufs? I'm not actually sure if buffers that have been mapped, and then have had their file descriptors closed would show up in task_struct->files; if not I think that would mean scanning both files and vmas for each task.
On Wed, Apr 16, 2025 at 9:56 PM Song Liu <song@kernel.org> wrote: > > On Wed, Apr 16, 2025 at 7:09 PM T.J. Mercier <tjmercier@google.com> wrote: > > > > On Wed, Apr 16, 2025 at 6:26 PM Song Liu <song@kernel.org> wrote: > [...] > > > > > > Here is another rookie question, it appears to me there is a file descriptor > > > associated with each DMA buffer, can we achieve the same goal with > > > a task-file iterator? > > > > That would find almost all of them, but not the kernel-only > > allocations. (kernel_rss in the dmabuf_dump output I attached earlier. > > If there's a leak, it's likely to show up in kernel_rss because some > > driver forgot to release its reference(s).) Also wouldn't that be a > > ton more iterations since we'd have to visit every FD to find the > > small portion that are dmabufs? I'm not actually sure if buffers that > > have been mapped, and then have had their file descriptors closed > > would show up in task_struct->files; if not I think that would mean > > scanning both files and vmas for each task. > > I don't think scanning all FDs to find a small portion of specific FDs > is a real issue. We have a tool that scans all FDs in the system and > only dump data for perf_event FDs. I think it should be easy to > prototype a tool by scanning all files and all vmas. If that turns out > to be very slow, which I highly doubt will be, we can try other > approaches. But this will not find *all* the buffers, and that defeats the purpose of having the iterator. > OTOH, I am wondering whether we can build a more generic iterator > for a list of objects. Adding a iterator for each important kernel lists > seems not scalable in the long term. I think the wide variety of differences in locking for different objects would make this difficult to do in a generic way.
On Thu, Apr 17, 2025 at 9:05 AM T.J. Mercier <tjmercier@google.com> wrote: > > On Wed, Apr 16, 2025 at 9:56 PM Song Liu <song@kernel.org> wrote: > > > > On Wed, Apr 16, 2025 at 7:09 PM T.J. Mercier <tjmercier@google.com> wrote: > > > > > > On Wed, Apr 16, 2025 at 6:26 PM Song Liu <song@kernel.org> wrote: > > [...] > > > > > > > > Here is another rookie question, it appears to me there is a file descriptor > > > > associated with each DMA buffer, can we achieve the same goal with > > > > a task-file iterator? > > > > > > That would find almost all of them, but not the kernel-only > > > allocations. (kernel_rss in the dmabuf_dump output I attached earlier. > > > If there's a leak, it's likely to show up in kernel_rss because some > > > driver forgot to release its reference(s).) Also wouldn't that be a > > > ton more iterations since we'd have to visit every FD to find the > > > small portion that are dmabufs? I'm not actually sure if buffers that > > > have been mapped, and then have had their file descriptors closed > > > would show up in task_struct->files; if not I think that would mean > > > scanning both files and vmas for each task. > > > > I don't think scanning all FDs to find a small portion of specific FDs > > is a real issue. We have a tool that scans all FDs in the system and > > only dump data for perf_event FDs. I think it should be easy to > > prototype a tool by scanning all files and all vmas. If that turns out > > to be very slow, which I highly doubt will be, we can try other > > approaches. > > But this will not find *all* the buffers, and that defeats the purpose > of having the iterator. Do you mean this approach cannot get kernel only allocations? If that's the case, we probably do need a separate iterator. I am interested in other folks thoughts on this. > > OTOH, I am wondering whether we can build a more generic iterator > > for a list of objects. Adding a iterator for each important kernel lists > > seems not scalable in the long term. > > I think the wide variety of differences in locking for different > objects would make this difficult to do in a generic way. Agreed it is not easy to build a generic solution. But with the help from BTF, we can probably build something that covers a large number of use cases. Thanks, Song
diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h index 139bdececdcf..899ead57d89d 100644 --- a/include/linux/btf_ids.h +++ b/include/linux/btf_ids.h @@ -284,5 +284,6 @@ extern u32 bpf_cgroup_btf_id[]; extern u32 bpf_local_storage_map_btf_id[]; extern u32 btf_bpf_map_id[]; extern u32 bpf_kmem_cache_btf_id[]; +extern u32 bpf_dmabuf_btf_id[]; #endif diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index 70502f038b92..5b30d37ef055 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -53,6 +53,9 @@ obj-$(CONFIG_BPF_SYSCALL) += relo_core.o obj-$(CONFIG_BPF_SYSCALL) += btf_iter.o obj-$(CONFIG_BPF_SYSCALL) += btf_relocate.o obj-$(CONFIG_BPF_SYSCALL) += kmem_cache_iter.o +ifeq ($(CONFIG_DEBUG_FS),y) +obj-$(CONFIG_BPF_SYSCALL) += dmabuf_iter.o +endif CFLAGS_REMOVE_percpu_freelist.o = $(CC_FLAGS_FTRACE) CFLAGS_REMOVE_bpf_lru_list.o = $(CC_FLAGS_FTRACE) diff --git a/kernel/bpf/dmabuf_iter.c b/kernel/bpf/dmabuf_iter.c new file mode 100644 index 000000000000..b4b8be1d6aa4 --- /dev/null +++ b/kernel/bpf/dmabuf_iter.c @@ -0,0 +1,130 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* Copyright (c) 2025 Google LLC */ +#include <linux/bpf.h> +#include <linux/btf_ids.h> +#include <linux/dma-buf.h> +#include <linux/kernel.h> +#include <linux/seq_file.h> + +BTF_ID_LIST_GLOBAL_SINGLE(bpf_dmabuf_btf_id, struct, dma_buf) +DEFINE_BPF_ITER_FUNC(dmabuf, struct bpf_iter_meta *meta, struct dma_buf *dmabuf) + +static void *dmabuf_iter_seq_start(struct seq_file *seq, loff_t *pos) +{ + struct dma_buf *dmabuf, *ret = NULL; + + if (*pos) { + *pos = 0; + return NULL; + } + /* Look for the first buffer we can obtain a reference to. + * The list mutex does not protect a dmabuf's refcount, so it can be + * zeroed while we are iterating. Therefore we cannot call get_dma_buf() + * since the caller of this program may not already own a reference to + * the buffer. + */ + mutex_lock(&dmabuf_debugfs_list_mutex); + list_for_each_entry(dmabuf, &dmabuf_debugfs_list, list_node) { + if (file_ref_get(&dmabuf->file->f_ref)) { + ret = dmabuf; + break; + } + } + mutex_unlock(&dmabuf_debugfs_list_mutex); + + return ret; +} + +static void *dmabuf_iter_seq_next(struct seq_file *seq, void *v, loff_t *pos) +{ + struct dma_buf *dmabuf = v, *ret = NULL; + + ++*pos; + + mutex_lock(&dmabuf_debugfs_list_mutex); + dma_buf_put(dmabuf); + while (!list_is_last(&dmabuf->list_node, &dmabuf_debugfs_list)) { + dmabuf = list_next_entry(dmabuf, list_node); + if (file_ref_get(&dmabuf->file->f_ref)) { + ret = dmabuf; + break; + } + } + mutex_unlock(&dmabuf_debugfs_list_mutex); + + return ret; +} + +struct bpf_iter__dmabuf { + __bpf_md_ptr(struct bpf_iter_meta *, meta); + __bpf_md_ptr(struct dma_buf *, dmabuf); +}; + +static int __dmabuf_seq_show(struct seq_file *seq, void *v, bool in_stop) +{ + struct bpf_iter_meta meta = { + .seq = seq, + }; + struct bpf_iter__dmabuf ctx = { + .meta = &meta, + .dmabuf = v, + }; + struct bpf_prog *prog = bpf_iter_get_info(&meta, in_stop); + + if (prog) + return bpf_iter_run_prog(prog, &ctx); + + return 0; +} + +static int dmabuf_iter_seq_show(struct seq_file *seq, void *v) +{ + return __dmabuf_seq_show(seq, v, false); +} + +static void dmabuf_iter_seq_stop(struct seq_file *seq, void *v) +{ + struct dma_buf *dmabuf = v; + + if (dmabuf) + dma_buf_put(dmabuf); +} + +static const struct seq_operations dmabuf_iter_seq_ops = { + .start = dmabuf_iter_seq_start, + .next = dmabuf_iter_seq_next, + .stop = dmabuf_iter_seq_stop, + .show = dmabuf_iter_seq_show, +}; + +static void bpf_iter_dmabuf_show_fdinfo(const struct bpf_iter_aux_info *aux, + struct seq_file *seq) +{ + seq_puts(seq, "dmabuf iter\n"); +} + +static const struct bpf_iter_seq_info dmabuf_iter_seq_info = { + .seq_ops = &dmabuf_iter_seq_ops, + .init_seq_private = NULL, + .fini_seq_private = NULL, + .seq_priv_size = 0, +}; + +static struct bpf_iter_reg bpf_dmabuf_reg_info = { + .target = "dmabuf", + .show_fdinfo = bpf_iter_dmabuf_show_fdinfo, + .ctx_arg_info_size = 1, + .ctx_arg_info = { + { offsetof(struct bpf_iter__dmabuf, dmabuf), + PTR_TO_BTF_ID_OR_NULL }, + }, + .seq_info = &dmabuf_iter_seq_info, +}; + +static int __init dmabuf_iter_init(void) +{ + bpf_dmabuf_reg_info.ctx_arg_info[0].btf_id = bpf_dmabuf_btf_id[0]; + return bpf_iter_reg_target(&bpf_dmabuf_reg_info); +} + +late_initcall(dmabuf_iter_init);
The dmabuf iterator traverses the list of all DMA buffers. The list is maintained only when CONFIG_DEBUG_FS is enabled. DMA buffers are refcounted through their associated struct file. A reference is taken on each buffer as the list is iterated to ensure each buffer persists for the duration of the bpf program execution without holding the list mutex. Signed-off-by: T.J. Mercier <tjmercier@google.com> --- include/linux/btf_ids.h | 1 + kernel/bpf/Makefile | 3 + kernel/bpf/dmabuf_iter.c | 130 +++++++++++++++++++++++++++++++++++++++ 3 files changed, 134 insertions(+) create mode 100644 kernel/bpf/dmabuf_iter.c