From patchwork Mon Jun 2 07:49:27 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Auger Eric X-Patchwork-Id: 31253 Return-Path: X-Original-To: linaro@patches.linaro.org Delivered-To: linaro@patches.linaro.org Received: from mail-oa0-f69.google.com (mail-oa0-f69.google.com [209.85.219.69]) by ip-10-151-82-157.ec2.internal (Postfix) with ESMTPS id 4A00F20AE6 for ; Mon, 2 Jun 2014 07:50:31 +0000 (UTC) Received: by mail-oa0-f69.google.com with SMTP id i7sf25569888oag.4 for ; Mon, 02 Jun 2014 00:50:30 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:delivered-to:from:to:cc:subject :date:message-id:in-reply-to:references:x-original-sender :x-original-authentication-results:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-unsubscribe; bh=LyNK1xbwK5hacDikgewfa+2eQ8aQu+ePZtKwgTDgy+M=; b=V90gJb4Zm3ObJd6N21vWwesZ3MgjLmkrY8OYao/OJzBYe9giKipXN21FCq8xhZtKkP in5zB5IjRG05mHsxZgMun3K/r4jCFL95IbtEw5K4peIhG4hRKZ1UAzWGdSQqNejl8rHw 2OoIsZWr0n1/Cs0r9Hr4yXxT3PK2SuFOKuGwnFfebcB4879KhIr21Vj8JIPRzB5fiu57 arYQ0mfVd4jcSTVJEnVE4XXOpiBvKoF2f/oVT4QoB9fIb/3V5sIl7t/VWg6AYezVNcXV 2DPGwQuW4ZqKQgHbDFXVwfNSSssyC9qgz+y9lixOkMS338ITTAlPNF1rJ6yNWM5GBbUw j2BQ== X-Gm-Message-State: ALoCoQkQ4hM8+zsKjGGGF22QRpBaGvLNB+muowqb7mF85mKg2erfD39gUvcI4+RlWZRG0cymXXML X-Received: by 10.50.18.20 with SMTP id s20mr5861963igd.3.1401695430717; Mon, 02 Jun 2014 00:50:30 -0700 (PDT) MIME-Version: 1.0 X-BeenThere: patchwork-forward@linaro.org Received: by 10.140.29.4 with SMTP id a4ls1838231qga.45.gmail; Mon, 02 Jun 2014 00:50:30 -0700 (PDT) X-Received: by 10.58.211.229 with SMTP id nf5mr27924195vec.19.1401695430516; Mon, 02 Jun 2014 00:50:30 -0700 (PDT) Received: from mail-ve0-f169.google.com (mail-ve0-f169.google.com [209.85.128.169]) by mx.google.com with ESMTPS id lx4si7527741veb.29.2014.06.02.00.50.30 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 02 Jun 2014 00:50:30 -0700 (PDT) Received-SPF: pass (google.com: domain of patch+caf_=patchwork-forward=linaro.org@linaro.org designates 209.85.128.169 as permitted sender) client-ip=209.85.128.169; Received: by mail-ve0-f169.google.com with SMTP id jx11so4843532veb.28 for ; Mon, 02 Jun 2014 00:50:30 -0700 (PDT) X-Received: by 10.220.82.133 with SMTP id b5mr28723276vcl.13.1401695430276; Mon, 02 Jun 2014 00:50:30 -0700 (PDT) X-Forwarded-To: patchwork-forward@linaro.org X-Forwarded-For: patch@linaro.org patchwork-forward@linaro.org Delivered-To: patches@linaro.org Received: by 10.220.221.72 with SMTP id ib8csp78769vcb; Mon, 2 Jun 2014 00:50:28 -0700 (PDT) X-Received: by 10.180.19.233 with SMTP id i9mr20025411wie.38.1401695428289; Mon, 02 Jun 2014 00:50:28 -0700 (PDT) Received: from mail-wg0-f44.google.com (mail-wg0-f44.google.com [74.125.82.44]) by mx.google.com with ESMTPS id es5si20010746wib.89.2014.06.02.00.50.27 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 02 Jun 2014 00:50:28 -0700 (PDT) Received-SPF: pass (google.com: domain of eric.auger@linaro.org designates 74.125.82.44 as permitted sender) client-ip=74.125.82.44; Received: by mail-wg0-f44.google.com with SMTP id a1so4686994wgh.27 for ; Mon, 02 Jun 2014 00:50:27 -0700 (PDT) X-Received: by 10.194.248.130 with SMTP id ym2mr2236243wjc.88.1401695427650; Mon, 02 Jun 2014 00:50:27 -0700 (PDT) Received: from midway01-04-00.lavalab ([88.98.47.97]) by mx.google.com with ESMTPSA id je7sm30286772wic.14.2014.06.02.00.50.26 for (version=TLSv1.1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 02 Jun 2014 00:50:26 -0700 (PDT) From: Eric Auger To: eric.auger@st.com, christoffer.dall@linaro.org, qemu-devel@nongnu.org, kim.phillips@freescale.com, a.rigo@virtualopensystems.com Cc: eric.auger@linaro.org, christophe.barnichon@st.com, kvmarm@lists.cs.columbia.edu, alex.williamson@redhat.com, agraf@suse.de, peter.maydell@linaro.org, stuart.yoder@freescale.com, a.motakis@virtualopensystems.com, patches@linaro.org, Kim Phillips Subject: [RFC v3 03/10] vfio: add vfio-platform support Date: Mon, 2 Jun 2014 08:49:27 +0100 Message-Id: <1401695374-4287-4-git-send-email-eric.auger@linaro.org> X-Mailer: git-send-email 1.8.3.2 In-Reply-To: <1401695374-4287-1-git-send-email-eric.auger@linaro.org> References: <1401695374-4287-1-git-send-email-eric.auger@linaro.org> X-Removed-Original-Auth: Dkim didn't pass. X-Original-Sender: eric.auger@linaro.org X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of patch+caf_=patchwork-forward=linaro.org@linaro.org designates 209.85.128.169 as permitted sender) smtp.mail=patch+caf_=patchwork-forward=linaro.org@linaro.org Precedence: list Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org List-ID: X-Google-Group-Id: 836684582541 List-Post: , List-Help: , List-Archive: List-Unsubscribe: , From: Kim Phillips Functions for which PCI and platform device support share are moved into common.c. The common vfio_{get,put}_group() get an additional argument, a pointer to a vfio_reset_handler(), for which to pass on to qemu_register_reset, but only if it exists (the platform device code currently passes a NULL as its reset_handler). For the platform device code, we basically use SysBusDevice instead of PCIDevice. Since realize() returns void, unlike PCIDevice's initfn, error codes are moved into the error message text with %m. Currently only MMIO access is supported at this time. The perceived path for future QEMU development is: - add support for interrupts - verify and test platform dev unmap path - test existing PCI path for any regressions - add support for creating platform devices on the qemu command line - currently device address specification is hardcoded for test development on Calxeda Midway's fff51000.ethernet device - reset is not supported and registration of reset functions is bypassed for platform devices. - there is no standard means of resetting a platform device, unsure if it suffices to be handled at device--VFIO binding time [1] http://www.spinics.net/lists/kvm-arm/msg08195.html Changes (v2 -> v3): [work done by Eric Auger] This new version introduces 2 separate VFIO Device objects: - VFIOPCIDevice - VFIOPlatformDevice Both objects share a VFIODevice struct. Also a VFIORegion shared struct was created. It is embedded in VFIOBAR struct. VFIOPlatformDevice uses VFIORegion directly. Introducing those base classes induced quite a lot of tiny changes in the PCI code. Theoretically PCI and platform devices can be supported simultaneously. PCI modifications currently are not tested. The VFIODevice is not a QOM object due to the single inheritance model limitation. The VFIODevice struct embeds an ops structure which is specialized in each VFIO leaf device. This makes possible to call device specific functions in common parts, hence achieving better factorization. Reset handling typically is handled that way where a unique generic ResetHandler (in common.c) is used for both derived classes. It calls device specific methods. As in the original contribution, only MMIO is supported in that patch file (in mmap mode). IRQ support comes in a subsequent patch. Signed-off-by: Kim Phillips Signed-off-by: Eric Auger --- hw/vfio/Makefile.objs | 2 + hw/vfio/common.c | 849 ++++++++++++++++++++++++++++ hw/vfio/pci.c | 1316 ++++++++++---------------------------------- hw/vfio/platform.c | 267 +++++++++ hw/vfio/vfio-common.h | 143 +++++ linux-headers/linux/vfio.h | 1 + 6 files changed, 1565 insertions(+), 1013 deletions(-) create mode 100644 hw/vfio/common.c create mode 100644 hw/vfio/platform.c create mode 100644 hw/vfio/vfio-common.h diff --git a/hw/vfio/Makefile.objs b/hw/vfio/Makefile.objs index 31c7dab..c5c76fe 100644 --- a/hw/vfio/Makefile.objs +++ b/hw/vfio/Makefile.objs @@ -1,3 +1,5 @@ ifeq ($(CONFIG_LINUX), y) +obj-$(CONFIG_SOFTMMU) += common.o obj-$(CONFIG_PCI) += pci.o +obj-$(CONFIG_SOFTMMU) += platform.o endif diff --git a/hw/vfio/common.c b/hw/vfio/common.c new file mode 100644 index 0000000..07dc409 --- /dev/null +++ b/hw/vfio/common.c @@ -0,0 +1,849 @@ +/* + * vfio based device assignment support + * + * Copyright Red Hat, Inc. 2012 + * + * Authors: + * Alex Williamson + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * Based on qemu-kvm device-assignment: + * Adapted for KVM by Qumranet. + * Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com) + * Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com) + * Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com) + * Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com) + * Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com) + */ + +#include +#include +#include "sys/mman.h" + +#include "exec/address-spaces.h" +#include "qemu/error-report.h" +#include "sysemu/kvm.h" + +#include "vfio-common.h" + +static QLIST_HEAD(, VFIOContainer) + container_list = QLIST_HEAD_INITIALIZER(container_list); + +QLIST_HEAD(, VFIOGroup) + group_list = QLIST_HEAD_INITIALIZER(group_list); + + +#ifdef CONFIG_KVM +/* + * We have a single VFIO pseudo device per KVM VM. Once created it lives + * for the life of the VM. Closing the file descriptor only drops our + * reference to it and the device's reference to kvm. Therefore once + * initialized, this file descriptor is only released on QEMU exit and + * we'll re-use it should another vfio device be attached before then. + */ +static int vfio_kvm_device_fd = -1; +#endif + +/* + * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 + */ +static int vfio_dma_unmap(VFIOContainer *container, + hwaddr iova, ram_addr_t size) +{ + struct vfio_iommu_type1_dma_unmap unmap = { + .argsz = sizeof(unmap), + .flags = 0, + .iova = iova, + .size = size, + }; + + if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) { + DPRINTF("VFIO_UNMAP_DMA: %d\n", -errno); + return -errno; + } + + return 0; +} + +static int vfio_dma_map(VFIOContainer *container, hwaddr iova, + ram_addr_t size, void *vaddr, bool readonly) +{ + struct vfio_iommu_type1_dma_map map = { + .argsz = sizeof(map), + .flags = VFIO_DMA_MAP_FLAG_READ, + .vaddr = (__u64)(uintptr_t)vaddr, + .iova = iova, + .size = size, + }; + + if (!readonly) { + map.flags |= VFIO_DMA_MAP_FLAG_WRITE; + } + + /* + * Try the mapping, if it fails with EBUSY, unmap the region and try + * again. This shouldn't be necessary, but we sometimes see it in + * the the VGA ROM space. + */ + if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 || + (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 && + ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) { + return 0; + } + + DPRINTF("VFIO_MAP_DMA: %d\n", -errno); + return -errno; +} + +static bool vfio_listener_skipped_section(MemoryRegionSection *section) +{ + return !memory_region_is_ram(section->mr) || + /* + * Sizing an enabled 64-bit BAR can cause spurious mappings to + * addresses in the upper part of the 64-bit address space. These + * are never accessed by the CPU and beyond the address width of + * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. + */ + section->offset_within_address_space & (1ULL << 63); +} + +static void vfio_listener_region_add(MemoryListener *listener, + MemoryRegionSection *section) +{ + VFIOContainer *container = container_of(listener, VFIOContainer, + iommu_data.type1.listener); + hwaddr iova, end; + void *vaddr; + int ret; + + assert(!memory_region_is_iommu(section->mr)); + + if (vfio_listener_skipped_section(section)) { + DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n", + section->offset_within_address_space, + section->offset_within_address_space + + int128_get64(int128_sub(section->size, int128_one()))); + return; + } + + if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) != + (section->offset_within_region & ~TARGET_PAGE_MASK))) { + error_report("%s received unaligned region", __func__); + return; + } + + iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); + end = (section->offset_within_address_space + int128_get64(section->size)) & + TARGET_PAGE_MASK; + + if (iova >= end) { + return; + } + + vaddr = memory_region_get_ram_ptr(section->mr) + + section->offset_within_region + + (iova - section->offset_within_address_space); + + DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", + iova, end - 1, vaddr); + + memory_region_ref(section->mr); + ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly); + if (ret) { + error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx", %p) = %d (%m)", + container, iova, end - iova, vaddr, ret); + + /* + * On the initfn path, store the first error in the container so we + * can gracefully fail. Runtime, there's not much we can do other + * than throw a hardware error. + */ + if (!container->iommu_data.type1.initialized) { + if (!container->iommu_data.type1.error) { + container->iommu_data.type1.error = ret; + } + } else { + hw_error("vfio: DMA mapping failed, unable to continue"); + } + } +} + +static void vfio_listener_region_del(MemoryListener *listener, + MemoryRegionSection *section) +{ + VFIOContainer *container = container_of(listener, VFIOContainer, + iommu_data.type1.listener); + hwaddr iova, end; + int ret; + + if (vfio_listener_skipped_section(section)) { + DPRINTF("SKIPPING region_del %"HWADDR_PRIx" - %"PRIx64"\n", + section->offset_within_address_space, + section->offset_within_address_space + + int128_get64(int128_sub(section->size, int128_one()))); + return; + } + + if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) != + (section->offset_within_region & ~TARGET_PAGE_MASK))) { + error_report("%s received unaligned region", __func__); + return; + } + + iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); + end = (section->offset_within_address_space + int128_get64(section->size)) & + TARGET_PAGE_MASK; + + if (iova >= end) { + return; + } + + DPRINTF("region_del %"HWADDR_PRIx" - %"HWADDR_PRIx"\n", + iova, end - 1); + + ret = vfio_dma_unmap(container, iova, end - iova); + memory_region_unref(section->mr); + if (ret) { + error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " + "0x%"HWADDR_PRIx") = %d (%m)", + container, iova, end - iova, ret); + } +} + +static MemoryListener vfio_memory_listener = { + .region_add = vfio_listener_region_add, + .region_del = vfio_listener_region_del, +}; + +static void vfio_listener_release(VFIOContainer *container) +{ + memory_listener_unregister(&container->iommu_data.type1.listener); +} + +static void vfio_kvm_device_add_group(VFIOGroup *group) +{ +#ifdef CONFIG_KVM + struct kvm_device_attr attr = { + .group = KVM_DEV_VFIO_GROUP, + .attr = KVM_DEV_VFIO_GROUP_ADD, + .addr = (uint64_t)(unsigned long)&group->fd, + }; + + if (!kvm_enabled()) { + return; + } + + if (vfio_kvm_device_fd < 0) { + struct kvm_create_device cd = { + .type = KVM_DEV_TYPE_VFIO, + }; + + if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) { + DPRINTF("KVM_CREATE_DEVICE: %m\n"); + return; + } + + vfio_kvm_device_fd = cd.fd; + } + + if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { + error_report("Failed to add group %d to KVM VFIO device: %m", + group->groupid); + } +#endif +} + +static void vfio_kvm_device_del_group(VFIOGroup *group) +{ +#ifdef CONFIG_KVM + struct kvm_device_attr attr = { + .group = KVM_DEV_VFIO_GROUP, + .attr = KVM_DEV_VFIO_GROUP_DEL, + .addr = (uint64_t)(unsigned long)&group->fd, + }; + + if (vfio_kvm_device_fd < 0) { + return; + } + + if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { + error_report("Failed to remove group %d from KVM VFIO device: %m", + group->groupid); + } +#endif +} + +static int vfio_connect_container(VFIOGroup *group) +{ + VFIOContainer *container; + int ret, fd; + + if (group->container) { + return 0; + } + + QLIST_FOREACH(container, &container_list, next) { + if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) { + group->container = container; + QLIST_INSERT_HEAD(&container->group_list, group, container_next); + return 0; + } + } + + fd = qemu_open("/dev/vfio/vfio", O_RDWR); + if (fd < 0) { + error_report("vfio: failed to open /dev/vfio/vfio: %m"); + return -errno; + } + + ret = ioctl(fd, VFIO_GET_API_VERSION); + if (ret != VFIO_API_VERSION) { + error_report("vfio: supported vfio version: %d, " + "reported version: %d", VFIO_API_VERSION, ret); + close(fd); + return -EINVAL; + } + + container = g_malloc0(sizeof(*container)); + container->fd = fd; + + if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) { + ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd); + if (ret) { + error_report("vfio: failed to set group container: %m"); + g_free(container); + close(fd); + return -errno; + } + + ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU); + if (ret) { + error_report("vfio: failed to set iommu for container: %m"); + g_free(container); + close(fd); + return -errno; + } + + container->iommu_data.type1.listener = vfio_memory_listener; + container->iommu_data.release = vfio_listener_release; + + memory_listener_register(&container->iommu_data.type1.listener, + &address_space_memory); + + if (container->iommu_data.type1.error) { + ret = container->iommu_data.type1.error; + vfio_listener_release(container); + g_free(container); + close(fd); + error_report("vfio: memory listener initialization failed" + " for container"); + return ret; + } + + container->iommu_data.type1.initialized = true; + + } else { + error_report("vfio: No available IOMMU models"); + g_free(container); + close(fd); + return -EINVAL; + } + + QLIST_INIT(&container->group_list); + QLIST_INSERT_HEAD(&container_list, container, next); + + group->container = container; + QLIST_INSERT_HEAD(&container->group_list, group, container_next); + + return 0; +} + +static void vfio_disconnect_container(VFIOGroup *group) +{ + VFIOContainer *container = group->container; + + if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) { + error_report("vfio: error disconnecting group %d from container", + group->groupid); + } + + QLIST_REMOVE(group, container_next); + group->container = NULL; + + if (QLIST_EMPTY(&container->group_list)) { + if (container->iommu_data.release) { + container->iommu_data.release(container); + } + QLIST_REMOVE(container, next); + DPRINTF("vfio_disconnect_container: close container->fd\n"); + close(container->fd); + g_free(container); + } +} + +VFIOGroup *vfio_get_group(int groupid, QEMUResetHandler *reset_handler) +{ + VFIOGroup *group; + char path[32]; + struct vfio_group_status status = { .argsz = sizeof(status) }; + + QLIST_FOREACH(group, &group_list, next) { + if (group->groupid == groupid) { + return group; + } + } + + group = g_malloc0(sizeof(*group)); + + snprintf(path, sizeof(path), "/dev/vfio/%d", groupid); + group->fd = qemu_open(path, O_RDWR); + if (group->fd < 0) { + error_report("vfio: error opening %s: %m", path); + g_free(group); + return NULL; + } + + if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) { + error_report("vfio: error getting group status: %m"); + close(group->fd); + g_free(group); + return NULL; + } + + if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) { + error_report("vfio: error, group %d is not viable, please ensure " + "all devices within the iommu_group are bound to their " + "vfio bus driver.", groupid); + close(group->fd); + g_free(group); + return NULL; + } + + group->groupid = groupid; + QLIST_INIT(&group->device_list); + + if (vfio_connect_container(group)) { + error_report("vfio: failed to setup container for group %d", groupid); + close(group->fd); + g_free(group); + return NULL; + } + + if (QLIST_EMPTY(&group_list) && reset_handler) { + qemu_register_reset(reset_handler, NULL); + } + + QLIST_INSERT_HEAD(&group_list, group, next); + + vfio_kvm_device_add_group(group); + + return group; +} + +void vfio_put_group(VFIOGroup *group, QEMUResetHandler *reset_handler) +{ + if (!QLIST_EMPTY(&group->device_list)) { + return; + } + + vfio_kvm_device_del_group(group); + vfio_disconnect_container(group); + QLIST_REMOVE(group, next); + DPRINTF("vfio_put_group: close group->fd\n"); + close(group->fd); + g_free(group); + + if (QLIST_EMPTY(&group_list) && reset_handler) { + qemu_unregister_reset(reset_handler, NULL); + } +} + + +void vfio_unmask_irqindex(VFIODevice *vdev, int index) +{ + struct vfio_irq_set irq_set = { + .argsz = sizeof(irq_set), + .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_UNMASK, + .index = index, + .start = 0, + .count = 1, + }; + + ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set); +} + +void vfio_disable_irqindex(VFIODevice *vdev, int index) +{ + struct vfio_irq_set irq_set = { + .argsz = sizeof(irq_set), + .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER, + .index = index, + .start = 0, + .count = 0, + }; + + ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set); +} + +#ifdef CONFIG_KVM /* Unused outside of CONFIG_KVM code */ +void vfio_mask_int(VFIODevice *vdev, int index) +{ + struct vfio_irq_set irq_set = { + .argsz = sizeof(irq_set), + .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK, + .index = index, + .start = 0, + .count = 1, + }; + + ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set); +} +#endif + +int vfio_mmap_region(Object *vdev, VFIORegion *region, + MemoryRegion *mem, MemoryRegion *submem, + void **map, size_t size, off_t offset, + const char *name) +{ + int ret = 0; + + if (VFIO_ALLOW_MMAP && size && region->flags & VFIO_REGION_INFO_FLAG_MMAP) { + int prot = 0; + + if (region->flags & VFIO_REGION_INFO_FLAG_READ) { + prot |= PROT_READ; + } + + if (region->flags & VFIO_REGION_INFO_FLAG_WRITE) { + prot |= PROT_WRITE; + } + + *map = mmap(NULL, size, prot, MAP_SHARED, + region->fd, region->fd_offset + offset); + if (*map == MAP_FAILED) { + *map = NULL; + ret = -errno; + goto empty_region; + } + + memory_region_init_ram_ptr(submem, vdev, name, size, *map); + } else { +empty_region: + /* Create a zero sized sub-region to make cleanup easy. */ + memory_region_init(submem, vdev, name, 0); + } + + memory_region_add_subregion(mem, offset, submem); + + return ret; +} + +/* + * IO Port/MMIO - Beware of the endians, VFIO is always little endian + */ +void vfio_region_write(void *opaque, hwaddr addr, + uint64_t data, unsigned size) +{ + VFIORegion *region = opaque; + VFIODevice *vdev = region->vdev; + union { + uint8_t byte; + uint16_t word; + uint32_t dword; + uint64_t qword; + } buf; + + switch (size) { + case 1: + buf.byte = data; + break; + case 2: + buf.word = cpu_to_le16(data); + break; + case 4: + buf.dword = cpu_to_le32(data); + break; + default: + hw_error("vfio: unsupported write size, %d bytes", size); + break; + } + + if (pwrite(region->fd, &buf, size, region->fd_offset + addr) != size) { + error_report("%s(,0x%"HWADDR_PRIx", 0x%"PRIx64", %d) failed: %m", + __func__, addr, data, size); + } + +#ifdef DEBUG_VFIO + { + DPRINTF("%s(%s:region%d+0x%"HWADDR_PRIx", 0x%"PRIx64 + ", %d)\n", __func__, vdev->name, + region->nr, addr, data, size); + } +#endif + + /* + * A read or write to a BAR always signals an INTx EOI. This will + * do nothing if not pending (including not in INTx mode). We assume + * that a BAR access is in response to an interrupt and that BAR + * accesses will service the interrupt. Unfortunately, we don't know + * which access will service the interrupt, so we're potentially + * getting quite a few host interrupts per guest interrupt. + */ + vdev->ops->vfio_eoi(vdev); +} + +uint64_t vfio_region_read(void *opaque, + hwaddr addr, unsigned size) +{ + VFIORegion *region = opaque; + VFIODevice *vdev = region->vdev; + union { + uint8_t byte; + uint16_t word; + uint32_t dword; + uint64_t qword; + } buf; + uint64_t data = 0; + + if (pread(region->fd, &buf, size, region->fd_offset + addr) != size) { + error_report("%s(,0x%"HWADDR_PRIx", %d) failed: %m", + __func__, addr, size); + return (uint64_t)-1; + } + + switch (size) { + case 1: + data = buf.byte; + break; + case 2: + data = le16_to_cpu(buf.word); + break; + case 4: + data = le32_to_cpu(buf.dword); + break; + default: + hw_error("vfio: unsupported read size, %d bytes", size); + break; + } + +#ifdef DEBUG_VFIO + { + DPRINTF("%s(%s:region%d+0x%"HWADDR_PRIx", %d) = 0x%"PRIx64"\n", + __func__, vdev->name, + region->nr, addr, size, data); + } +#endif + + /* Same as write above */ + vdev->ops->vfio_eoi(vdev); + + return data; +} + + +int vfio_get_base_device(VFIOGroup *group, const char *name, + struct VFIODevice *vdev) +{ + struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) }; + int ret; + int fd; + + fd = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name); + if (fd < 0) { + error_report("vfio: error getting device %s from group %d: %m", + name, group->groupid); + error_printf("Verify all devices in group %d are bound to the " + "vfio driver and are not already in use\n", + group->groupid); + return fd; + } + + vdev->fd = fd; + vdev->group = group; + QLIST_INSERT_HEAD(&group->device_list, vdev, next); + + /* Sanity check device */ + ret = ioctl(fd, VFIO_DEVICE_GET_INFO, &dev_info); + if (ret) { + error_report("vfio: error getting device info: %m"); + goto error; + } + + DPRINTF("Device %s flags: %u, regions: %u, irqs: %u\n", name, + dev_info.flags, dev_info.num_regions, dev_info.num_irqs); + + /* Check type consistency */ + if (dev_info.flags & VFIO_DEVICE_FLAGS_PCI) { + if (vdev->type != VFIO_DEVICE_TYPE_PCI) { + goto error; + } + } else if (dev_info.flags & VFIO_DEVICE_FLAGS_PLATFORM) { + if (vdev->type != VFIO_DEVICE_TYPE_PLATFORM) { + goto error; + } + } else { + goto error; + } + + vdev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET); + + vdev->num_regions = dev_info.num_regions; + vdev->num_irqs = dev_info.num_irqs; + + /* call device specific functions */ + ret = vdev->ops->vfio_check_device(vdev); + if (ret < 0) { + DPRINTF("%s -- Error when checking device\n", __func__); + goto error; + } + ret = vdev->ops->vfio_get_device_regions(vdev); + if (ret < 0) { + DPRINTF("%s -- Error when handling regions\n", __func__); + goto error; + } + vdev->ops->vfio_get_device_interrupts(vdev); + if (ret < 0) { + DPRINTF("%s -- Error when handling interrupts\n", __func__); + goto error; + } + + return 0; + +error: + if (ret) { + vfio_put_base_device(vdev); + } + return ret; + +} + + +void vfio_put_base_device(VFIODevice *vdev) +{ + QLIST_REMOVE(vdev, next); + vdev->group = NULL; + DPRINTF("vfio_put_device: close vdev->fd\n"); + close(vdev->fd); +} + + +int vfio_base_device_init(VFIODevice *vdev, int type) +{ + VFIODevice *tmp; + VFIOGroup *group; + char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name; + ssize_t len; + struct stat st; + int groupid; + int ret; + + /* name must be set prior to the call */ + if (vdev->name == NULL) { + return -errno; + } + /* device specific ops must be set prior to the call */ + if (vdev->ops == NULL) { + return -errno; + } + + /* Check that the host device exists */ + if (type == VFIO_DEVICE_TYPE_PCI) { + snprintf(path, sizeof(path), "/sys/bus/pci/devices/%s/", vdev->name); + vdev->type = VFIO_DEVICE_TYPE_PCI; + } else if (type == VFIO_DEVICE_TYPE_PLATFORM) { + snprintf(path, sizeof(path), "/sys/bus/platform/devices/%s/", + vdev->name); + vdev->type = VFIO_DEVICE_TYPE_PLATFORM; + } else { + return -errno; + } + + if (stat(path, &st) < 0) { + error_report("vfio: error: no such host device: %s", path); + return -errno; + } + + strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1); + + len = readlink(path, iommu_group_path, sizeof(path)); + if (len <= 0 || len >= sizeof(path)) { + error_report("vfio: error no iommu_group for device"); + return len < 0 ? -errno : ENAMETOOLONG; + } + + iommu_group_path[len] = 0; + group_name = basename(iommu_group_path); + + if (sscanf(group_name, "%d", &groupid) != 1) { + error_report("vfio: error reading %s: %m", path); + return -errno; + } + + DPRINTF("%s(%s) group %d\n", __func__, vdev->name, groupid); + + group = vfio_get_group(groupid, vfio_reset_handler); + if (!group) { + error_report("vfio: failed to get group %d", groupid); + return -ENOENT; + } + + snprintf(path, sizeof(path), "%s", vdev->name); + + QLIST_FOREACH(tmp, &group->device_list, next) { + if (strcmp(tmp->name, vdev->name) == 0) { + error_report("vfio: error: device %s is already attached", path); + vfio_put_group(group, vfio_reset_handler); + return -EBUSY; + } + } + + ret = vfio_get_base_device(group, path, vdev); + if (ret < 0) { + error_report("vfio: failed to get device %s", path); + vfio_put_group(group, vfio_reset_handler); + return ret; + } + + return ret; + +} + +void print_regions(VFIODevice *vdev) +{ + int i; + DPRINTF("Device \"%s\" counts %d region(s):\n", + vdev->name, vdev->num_regions); + + for (i = 0; i < vdev->num_regions; i++) { + DPRINTF("- region %d flags = 0x%lx, size = 0x%lx, " + "fd= %d, offset = 0x%lx\n", + vdev->regions[i]->nr, + (unsigned long)vdev->regions[i]->flags, + (unsigned long)vdev->regions[i]->size, + vdev->regions[i]->fd, + (unsigned long)vdev->regions[i]->fd_offset); + } +} + +void vfio_reset_handler(void *opaque) +{ + VFIOGroup *group; + VFIODevice *vdev; + + QLIST_FOREACH(group, &group_list, next) { + QLIST_FOREACH(vdev, &group->device_list, next) { + vdev->ops->vfio_compute_needs_reset(vdev); + } + } + + QLIST_FOREACH(group, &group_list, next) { + QLIST_FOREACH(vdev, &group->device_list, next) { + if (vdev->needs_reset) { + vdev->ops->vfio_hot_reset_multi(vdev); + } + } + } +} diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c index 9cf5b84..a9e4d97 100644 --- a/hw/vfio/pci.c +++ b/hw/vfio/pci.c @@ -1,5 +1,5 @@ /* - * vfio based device assignment support + * vfio based device assignment support - PCI devices * * Copyright Red Hat, Inc. 2012 * @@ -18,48 +18,23 @@ * Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com) */ -#include #include #include #include -#include -#include -#include -#include "config.h" -#include "exec/address-spaces.h" -#include "exec/memory.h" #include "hw/pci/msi.h" #include "hw/pci/msix.h" -#include "hw/pci/pci.h" -#include "qemu-common.h" #include "qemu/error-report.h" -#include "qemu/event_notifier.h" -#include "qemu/queue.h" #include "qemu/range.h" -#include "sysemu/kvm.h" #include "sysemu/sysemu.h" -/* #define DEBUG_VFIO */ -#ifdef DEBUG_VFIO -#define DPRINTF(fmt, ...) \ - do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0) -#else -#define DPRINTF(fmt, ...) \ - do { } while (0) -#endif +#include "vfio-common.h" -/* Extra debugging, trap acceleration paths for more logging */ -#define VFIO_ALLOW_MMAP 1 -#define VFIO_ALLOW_KVM_INTX 1 -#define VFIO_ALLOW_KVM_MSI 1 -#define VFIO_ALLOW_KVM_MSIX 1 - -struct VFIODevice; +extern QLIST_HEAD(, VFIOGroup) group_list; typedef struct VFIOQuirk { MemoryRegion mem; - struct VFIODevice *vdev; + struct VFIOPCIDevice *vdev; QLIST_ENTRY(VFIOQuirk) next; struct { uint32_t base_offset:TARGET_PAGE_BITS; @@ -81,14 +56,8 @@ typedef struct VFIOQuirk { } VFIOQuirk; typedef struct VFIOBAR { - off_t fd_offset; /* offset of BAR within device fd */ - int fd; /* device fd, allows us to pass VFIOBAR as opaque data */ - MemoryRegion mem; /* slow, read/write access */ - MemoryRegion mmap_mem; /* direct mapped access */ - void *mmap; - size_t size; - uint32_t flags; /* VFIO region flags (rd/wr/mmap) */ - uint8_t nr; /* cache the BAR number for debug */ + VFIORegion region; + bool ioport; bool mem64; QLIST_HEAD(, VFIOQuirk) quirks; @@ -120,7 +89,7 @@ typedef struct VFIOINTx { typedef struct VFIOMSIVector { EventNotifier interrupt; /* eventfd triggered on interrupt */ - struct VFIODevice *vdev; /* back pointer to device */ + struct VFIOPCIDevice *vdev; /* back pointer to device */ MSIMessage msg; /* cache the MSI message so we know when it changes */ int virq; /* KVM irqchip route for QEMU bypass */ bool use; @@ -133,27 +102,6 @@ enum { VFIO_INT_MSIX = 3, }; -struct VFIOGroup; - -typedef struct VFIOType1 { - MemoryListener listener; - int error; - bool initialized; -} VFIOType1; - -typedef struct VFIOContainer { - int fd; /* /dev/vfio/vfio, empowered by the attached groups */ - struct { - /* enable abstraction to support various iommu backends */ - union { - VFIOType1 type1; - }; - void (*release)(struct VFIOContainer *); - } iommu_data; - QLIST_HEAD(, VFIOGroup) group_list; - QLIST_ENTRY(VFIOContainer) next; -} VFIOContainer; - /* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */ typedef struct VFIOMSIXInfo { uint8_t table_bar; @@ -165,9 +113,9 @@ typedef struct VFIOMSIXInfo { void *mmap; } VFIOMSIXInfo; -typedef struct VFIODevice { +typedef struct VFIOPCIDevice { + VFIODevice vdev; PCIDevice pdev; - int fd; VFIOINTx intx; unsigned int config_size; uint8_t *emulated_config_bits; /* QEMU emulated bits, little-endian */ @@ -183,31 +131,18 @@ typedef struct VFIODevice { VFIOBAR bars[PCI_NUM_REGIONS - 1]; /* No ROM */ VFIOVGA vga; /* 0xa0000, 0x3b0, 0x3c0 */ PCIHostDeviceAddress host; - QLIST_ENTRY(VFIODevice) next; - struct VFIOGroup *group; EventNotifier err_notifier; uint32_t features; #define VFIO_FEATURE_ENABLE_VGA_BIT 0 #define VFIO_FEATURE_ENABLE_VGA (1 << VFIO_FEATURE_ENABLE_VGA_BIT) int32_t bootindex; uint8_t pm_cap; - bool reset_works; bool has_vga; bool pci_aer; bool has_flr; bool has_pm_reset; - bool needs_reset; bool rom_read_failed; -} VFIODevice; - -typedef struct VFIOGroup { - int fd; - int groupid; - VFIOContainer *container; - QLIST_HEAD(, VFIODevice) device_list; - QLIST_ENTRY(VFIOGroup) next; - QLIST_ENTRY(VFIOGroup) container_next; -} VFIOGroup; +} VFIOPCIDevice; typedef struct VFIORomBlacklistEntry { uint16_t vendor_id; @@ -234,75 +169,12 @@ static const VFIORomBlacklistEntry romblacklist[] = { #define MSIX_CAP_LENGTH 12 -static QLIST_HEAD(, VFIOContainer) - container_list = QLIST_HEAD_INITIALIZER(container_list); - -static QLIST_HEAD(, VFIOGroup) - group_list = QLIST_HEAD_INITIALIZER(group_list); - -#ifdef CONFIG_KVM -/* - * We have a single VFIO pseudo device per KVM VM. Once created it lives - * for the life of the VM. Closing the file descriptor only drops our - * reference to it and the device's reference to kvm. Therefore once - * initialized, this file descriptor is only released on QEMU exit and - * we'll re-use it should another vfio device be attached before then. - */ -static int vfio_kvm_device_fd = -1; -#endif - -static void vfio_disable_interrupts(VFIODevice *vdev); +static void vfio_disable_interrupts(VFIOPCIDevice *vdev); static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len); static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, uint32_t val, int len); -static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled); - -/* - * Common VFIO interrupt disable - */ -static void vfio_disable_irqindex(VFIODevice *vdev, int index) -{ - struct vfio_irq_set irq_set = { - .argsz = sizeof(irq_set), - .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER, - .index = index, - .start = 0, - .count = 0, - }; - - ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set); -} +static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled); -/* - * INTx - */ -static void vfio_unmask_intx(VFIODevice *vdev) -{ - struct vfio_irq_set irq_set = { - .argsz = sizeof(irq_set), - .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_UNMASK, - .index = VFIO_PCI_INTX_IRQ_INDEX, - .start = 0, - .count = 1, - }; - - ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set); -} - -#ifdef CONFIG_KVM /* Unused outside of CONFIG_KVM code */ -static void vfio_mask_intx(VFIODevice *vdev) -{ - struct vfio_irq_set irq_set = { - .argsz = sizeof(irq_set), - .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK, - .index = VFIO_PCI_INTX_IRQ_INDEX, - .start = 0, - .count = 1, - }; - - ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set); -} -#endif /* * Disabling BAR mmaping can be slow, but toggling it around INTx can @@ -321,11 +193,11 @@ static void vfio_mask_intx(VFIODevice *vdev) */ static void vfio_intx_mmap_enable(void *opaque) { - VFIODevice *vdev = opaque; + VFIOPCIDevice *vdev = opaque; if (vdev->intx.pending) { timer_mod(vdev->intx.mmap_timer, - qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->intx.mmap_timeout); + qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->intx.mmap_timeout); return; } @@ -334,7 +206,7 @@ static void vfio_intx_mmap_enable(void *opaque) static void vfio_intx_interrupt(void *opaque) { - VFIODevice *vdev = opaque; + VFIOPCIDevice *vdev = opaque; if (!event_notifier_test_and_clear(&vdev->intx.interrupt)) { return; @@ -349,25 +221,27 @@ static void vfio_intx_interrupt(void *opaque) vfio_mmap_set_enabled(vdev, false); if (vdev->intx.mmap_timeout) { timer_mod(vdev->intx.mmap_timer, - qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->intx.mmap_timeout); + qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + vdev->intx.mmap_timeout); } } -static void vfio_eoi(VFIODevice *vdev) +static void vfio_pci_eoi(VFIODevice *vdev) { - if (!vdev->intx.pending) { + VFIOPCIDevice *vpcidev = container_of(vdev, VFIOPCIDevice, vdev); + + if (!vpcidev->intx.pending) { return; } - DPRINTF("%s(%04x:%02x:%02x.%x) EOI\n", __func__, vdev->host.domain, - vdev->host.bus, vdev->host.slot, vdev->host.function); + DPRINTF("%s(%04x:%02x:%02x.%x) EOI\n", __func__, vpcidev->host.domain, + vpcidev->host.bus, vpcidev->host.slot, vpcidev->host.function); - vdev->intx.pending = false; - pci_irq_deassert(&vdev->pdev); - vfio_unmask_intx(vdev); + vpcidev->intx.pending = false; + pci_irq_deassert(&vpcidev->pdev); + vfio_unmask_irqindex(vdev, VFIO_PCI_INTX_IRQ_INDEX); } -static void vfio_enable_intx_kvm(VFIODevice *vdev) +static void vfio_enable_intx_kvm(VFIOPCIDevice *vdev) { #ifdef CONFIG_KVM struct kvm_irqfd irqfd = { @@ -387,7 +261,7 @@ static void vfio_enable_intx_kvm(VFIODevice *vdev) /* Get to a known interrupt state */ qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev); - vfio_mask_intx(vdev); + vfio_mask_int(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX); vdev->intx.pending = false; pci_irq_deassert(&vdev->pdev); @@ -417,7 +291,7 @@ static void vfio_enable_intx_kvm(VFIODevice *vdev) *pfd = irqfd.resamplefd; - ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set); + ret = ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set); g_free(irq_set); if (ret) { error_report("vfio: Error: Failed to setup INTx unmask fd: %m"); @@ -425,7 +299,7 @@ static void vfio_enable_intx_kvm(VFIODevice *vdev) } /* Let'em rip */ - vfio_unmask_intx(vdev); + vfio_unmask_irqindex(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX); vdev->intx.kvm_accel = true; @@ -442,11 +316,11 @@ fail_irqfd: event_notifier_cleanup(&vdev->intx.unmask); fail: qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev); - vfio_unmask_intx(vdev); + vfio_unmask_irqindex(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX); #endif } -static void vfio_disable_intx_kvm(VFIODevice *vdev) +static void vfio_disable_intx_kvm(VFIOPCIDevice *vdev) { #ifdef CONFIG_KVM struct kvm_irqfd irqfd = { @@ -463,7 +337,7 @@ static void vfio_disable_intx_kvm(VFIODevice *vdev) * Get to a known state, hardware masked, QEMU ready to accept new * interrupts, QEMU IRQ de-asserted. */ - vfio_mask_intx(vdev); + vfio_mask_int(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX); vdev->intx.pending = false; pci_irq_deassert(&vdev->pdev); @@ -481,7 +355,7 @@ static void vfio_disable_intx_kvm(VFIODevice *vdev) vdev->intx.kvm_accel = false; /* If we've missed an event, let it re-fire through QEMU */ - vfio_unmask_intx(vdev); + vfio_unmask_irqindex(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX); DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel disabled\n", __func__, vdev->host.domain, vdev->host.bus, @@ -491,14 +365,14 @@ static void vfio_disable_intx_kvm(VFIODevice *vdev) static void vfio_update_irq(PCIDevice *pdev) { - VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev); + VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev); PCIINTxRoute route; if (vdev->interrupt != VFIO_INT_INTx) { return; } - route = pci_device_route_intx_to_irq(&vdev->pdev, vdev->intx.pin); + route = pci_device_route_intx_to_irq(pdev, vdev->intx.pin); if (!pci_intx_route_changed(&vdev->intx.route, &route)) { return; /* Nothing changed */ @@ -519,10 +393,10 @@ static void vfio_update_irq(PCIDevice *pdev) vfio_enable_intx_kvm(vdev); /* Re-enable the interrupt in cased we missed an EOI */ - vfio_eoi(vdev); + vfio_pci_eoi(&vdev->vdev); } -static int vfio_enable_intx(VFIODevice *vdev) +static int vfio_enable_intx(VFIOPCIDevice *vdev) { uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1); int ret, argsz; @@ -569,7 +443,7 @@ static int vfio_enable_intx(VFIODevice *vdev) *pfd = event_notifier_get_fd(&vdev->intx.interrupt); qemu_set_fd_handler(*pfd, vfio_intx_interrupt, NULL, vdev); - ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set); + ret = ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set); g_free(irq_set); if (ret) { error_report("vfio: Error: Failed to setup INTx fd: %m"); @@ -588,13 +462,13 @@ static int vfio_enable_intx(VFIODevice *vdev) return 0; } -static void vfio_disable_intx(VFIODevice *vdev) +static void vfio_disable_intx(VFIOPCIDevice *vdev) { int fd; timer_del(vdev->intx.mmap_timer); vfio_disable_intx_kvm(vdev); - vfio_disable_irqindex(vdev, VFIO_PCI_INTX_IRQ_INDEX); + vfio_disable_irqindex(&vdev->vdev, VFIO_PCI_INTX_IRQ_INDEX); vdev->intx.pending = false; pci_irq_deassert(&vdev->pdev); vfio_mmap_set_enabled(vdev, true); @@ -615,7 +489,7 @@ static void vfio_disable_intx(VFIODevice *vdev) static void vfio_msi_interrupt(void *opaque) { VFIOMSIVector *vector = opaque; - VFIODevice *vdev = vector->vdev; + VFIOPCIDevice *vdev = vector->vdev; int nr = vector - vdev->msi_vectors; if (!event_notifier_test_and_clear(&vector->interrupt)) { @@ -647,7 +521,7 @@ static void vfio_msi_interrupt(void *opaque) } } -static int vfio_enable_vectors(VFIODevice *vdev, bool msix) +static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix) { struct vfio_irq_set *irq_set; int ret = 0, i, argsz; @@ -672,7 +546,7 @@ static int vfio_enable_vectors(VFIODevice *vdev, bool msix) fds[i] = event_notifier_get_fd(&vdev->msi_vectors[i].interrupt); } - ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set); + ret = ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set); g_free(irq_set); @@ -682,7 +556,7 @@ static int vfio_enable_vectors(VFIODevice *vdev, bool msix) static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr, MSIMessage *msg, IOHandler *handler) { - VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev); + VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev); VFIOMSIVector *vector; int ret; @@ -723,7 +597,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr, * increase them as needed. */ if (vdev->nr_vectors < nr + 1) { - vfio_disable_irqindex(vdev, VFIO_PCI_MSIX_IRQ_INDEX); + vfio_disable_irqindex(&vdev->vdev, VFIO_PCI_MSIX_IRQ_INDEX); vdev->nr_vectors = nr + 1; ret = vfio_enable_vectors(vdev, true); if (ret) { @@ -747,7 +621,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr, *pfd = event_notifier_get_fd(&vector->interrupt); - ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set); + ret = ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set); g_free(irq_set); if (ret) { error_report("vfio: failed to modify vector, %d", ret); @@ -765,7 +639,7 @@ static int vfio_msix_vector_use(PCIDevice *pdev, static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr) { - VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev); + VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev); VFIOMSIVector *vector = &vdev->msi_vectors[nr]; int argsz; struct vfio_irq_set *irq_set; @@ -795,7 +669,7 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr) *pfd = -1; - ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set); + ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set); g_free(irq_set); @@ -813,7 +687,7 @@ static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr) vector->use = false; } -static void vfio_enable_msix(VFIODevice *vdev) +static void vfio_enable_msix(VFIOPCIDevice *vdev) { vfio_disable_interrupts(vdev); @@ -846,7 +720,7 @@ static void vfio_enable_msix(VFIODevice *vdev) vdev->host.bus, vdev->host.slot, vdev->host.function); } -static void vfio_enable_msi(VFIODevice *vdev) +static void vfio_enable_msi(VFIOPCIDevice *vdev) { int ret, i; @@ -923,7 +797,7 @@ retry: vdev->host.function, vdev->nr_vectors); } -static void vfio_disable_msi_common(VFIODevice *vdev) +static void vfio_disable_msi_common(VFIOPCIDevice *vdev) { g_free(vdev->msi_vectors); vdev->msi_vectors = NULL; @@ -933,7 +807,7 @@ static void vfio_disable_msi_common(VFIODevice *vdev) vfio_enable_intx(vdev); } -static void vfio_disable_msix(VFIODevice *vdev) +static void vfio_disable_msix(VFIOPCIDevice *vdev) { int i; @@ -950,7 +824,7 @@ static void vfio_disable_msix(VFIODevice *vdev) } if (vdev->nr_vectors) { - vfio_disable_irqindex(vdev, VFIO_PCI_MSIX_IRQ_INDEX); + vfio_disable_irqindex(&vdev->vdev, VFIO_PCI_MSIX_IRQ_INDEX); } vfio_disable_msi_common(vdev); @@ -959,11 +833,11 @@ static void vfio_disable_msix(VFIODevice *vdev) vdev->host.bus, vdev->host.slot, vdev->host.function); } -static void vfio_disable_msi(VFIODevice *vdev) +static void vfio_disable_msi(VFIOPCIDevice *vdev) { int i; - vfio_disable_irqindex(vdev, VFIO_PCI_MSI_IRQ_INDEX); + vfio_disable_irqindex(&vdev->vdev, VFIO_PCI_MSI_IRQ_INDEX); for (i = 0; i < vdev->nr_vectors; i++) { VFIOMSIVector *vector = &vdev->msi_vectors[i]; @@ -991,7 +865,7 @@ static void vfio_disable_msi(VFIODevice *vdev) vdev->host.bus, vdev->host.slot, vdev->host.function); } -static void vfio_update_msi(VFIODevice *vdev) +static void vfio_update_msi(VFIOPCIDevice *vdev) { int i; @@ -1018,119 +892,17 @@ static void vfio_update_msi(VFIODevice *vdev) } } -/* - * IO Port/MMIO - Beware of the endians, VFIO is always little endian - */ -static void vfio_bar_write(void *opaque, hwaddr addr, - uint64_t data, unsigned size) -{ - VFIOBAR *bar = opaque; - union { - uint8_t byte; - uint16_t word; - uint32_t dword; - uint64_t qword; - } buf; - - switch (size) { - case 1: - buf.byte = data; - break; - case 2: - buf.word = cpu_to_le16(data); - break; - case 4: - buf.dword = cpu_to_le32(data); - break; - default: - hw_error("vfio: unsupported write size, %d bytes", size); - break; - } - - if (pwrite(bar->fd, &buf, size, bar->fd_offset + addr) != size) { - error_report("%s(,0x%"HWADDR_PRIx", 0x%"PRIx64", %d) failed: %m", - __func__, addr, data, size); - } - -#ifdef DEBUG_VFIO - { - VFIODevice *vdev = container_of(bar, VFIODevice, bars[bar->nr]); - - DPRINTF("%s(%04x:%02x:%02x.%x:BAR%d+0x%"HWADDR_PRIx", 0x%"PRIx64 - ", %d)\n", __func__, vdev->host.domain, vdev->host.bus, - vdev->host.slot, vdev->host.function, bar->nr, addr, - data, size); - } -#endif - - /* - * A read or write to a BAR always signals an INTx EOI. This will - * do nothing if not pending (including not in INTx mode). We assume - * that a BAR access is in response to an interrupt and that BAR - * accesses will service the interrupt. Unfortunately, we don't know - * which access will service the interrupt, so we're potentially - * getting quite a few host interrupts per guest interrupt. - */ - vfio_eoi(container_of(bar, VFIODevice, bars[bar->nr])); -} - -static uint64_t vfio_bar_read(void *opaque, - hwaddr addr, unsigned size) -{ - VFIOBAR *bar = opaque; - union { - uint8_t byte; - uint16_t word; - uint32_t dword; - uint64_t qword; - } buf; - uint64_t data = 0; - - if (pread(bar->fd, &buf, size, bar->fd_offset + addr) != size) { - error_report("%s(,0x%"HWADDR_PRIx", %d) failed: %m", - __func__, addr, size); - return (uint64_t)-1; - } - - switch (size) { - case 1: - data = buf.byte; - break; - case 2: - data = le16_to_cpu(buf.word); - break; - case 4: - data = le32_to_cpu(buf.dword); - break; - default: - hw_error("vfio: unsupported read size, %d bytes", size); - break; - } - -#ifdef DEBUG_VFIO - { - VFIODevice *vdev = container_of(bar, VFIODevice, bars[bar->nr]); - - DPRINTF("%s(%04x:%02x:%02x.%x:BAR%d+0x%"HWADDR_PRIx - ", %d) = 0x%"PRIx64"\n", __func__, vdev->host.domain, - vdev->host.bus, vdev->host.slot, vdev->host.function, - bar->nr, addr, size, data); - } -#endif - - /* Same as write above */ - vfio_eoi(container_of(bar, VFIODevice, bars[bar->nr])); - - return data; -} static const MemoryRegionOps vfio_bar_ops = { - .read = vfio_bar_read, - .write = vfio_bar_write, + .read = vfio_region_read, + .write = vfio_region_write, .endianness = DEVICE_LITTLE_ENDIAN, }; -static void vfio_pci_load_rom(VFIODevice *vdev) + +/* PCI ONLY FUNCTIONS */ + +static void vfio_pci_load_rom(VFIOPCIDevice *vdev) { struct vfio_region_info reg_info = { .argsz = sizeof(reg_info), @@ -1139,8 +911,9 @@ static void vfio_pci_load_rom(VFIODevice *vdev) uint64_t size; off_t off = 0; size_t bytes; + int fd = vdev->vdev.fd; - if (ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, ®_info)) { + if (ioctl(fd, VFIO_DEVICE_GET_REGION_INFO, ®_info)) { error_report("vfio: Error getting ROM info: %m"); return; } @@ -1170,7 +943,7 @@ static void vfio_pci_load_rom(VFIODevice *vdev) memset(vdev->rom, 0xff, size); while (size) { - bytes = pread(vdev->fd, vdev->rom + off, size, vdev->rom_offset + off); + bytes = pread(fd, vdev->rom + off, size, vdev->rom_offset + off); if (bytes == 0) { break; } else if (bytes > 0) { @@ -1188,7 +961,7 @@ static void vfio_pci_load_rom(VFIODevice *vdev) static uint64_t vfio_rom_read(void *opaque, hwaddr addr, unsigned size) { - VFIODevice *vdev = opaque; + VFIOPCIDevice *vdev = opaque; uint64_t val = ((uint64_t)1 << (size * 8)) - 1; /* Load the ROM lazily when the guest tries to read it */ @@ -1217,7 +990,7 @@ static const MemoryRegionOps vfio_rom_ops = { .endianness = DEVICE_LITTLE_ENDIAN, }; -static bool vfio_blacklist_opt_rom(VFIODevice *vdev) +static bool vfio_blacklist_opt_rom(VFIOPCIDevice *vdev) { PCIDevice *pdev = &vdev->pdev; uint16_t vendor_id, device_id; @@ -1237,12 +1010,13 @@ static bool vfio_blacklist_opt_rom(VFIODevice *vdev) return false; } -static void vfio_pci_size_rom(VFIODevice *vdev) +static void vfio_pci_size_rom(VFIOPCIDevice *vdev) { uint32_t orig, size = cpu_to_le32((uint32_t)PCI_ROM_ADDRESS_MASK); off_t offset = vdev->config_offset + PCI_ROM_ADDRESS; DeviceState *dev = DEVICE(vdev); char name[32]; + int fd = vdev->vdev.fd; if (vdev->pdev.romfile || !vdev->pdev.rom_bar) { /* Since pci handles romfile, just print a message and return */ @@ -1261,10 +1035,10 @@ static void vfio_pci_size_rom(VFIODevice *vdev) * Use the same size ROM BAR as the physical device. The contents * will get filled in later when the guest tries to read it. */ - if (pread(vdev->fd, &orig, 4, offset) != 4 || - pwrite(vdev->fd, &size, 4, offset) != 4 || - pread(vdev->fd, &size, 4, offset) != 4 || - pwrite(vdev->fd, &orig, 4, offset) != 4) { + if (pread(fd, &orig, 4, offset) != 4 || + pwrite(fd, &size, 4, offset) != 4 || + pread(fd, &size, 4, offset) != 4 || + pwrite(fd, &orig, 4, offset) != 4) { error_report("%s(%04x:%02x:%02x.%x) failed: %m", __func__, vdev->host.domain, vdev->host.bus, vdev->host.slot, vdev->host.function); @@ -1416,7 +1190,7 @@ static uint64_t vfio_generic_window_quirk_read(void *opaque, hwaddr addr, unsigned size) { VFIOQuirk *quirk = opaque; - VFIODevice *vdev = quirk->vdev; + VFIOPCIDevice *vdev = quirk->vdev; uint64_t data; if (vfio_flags_enabled(quirk->data.flags, quirk->data.read_flags) && @@ -1438,7 +1212,7 @@ static uint64_t vfio_generic_window_quirk_read(void *opaque, vdev->host.bus, vdev->host.slot, vdev->host.function, quirk->data.bar, addr, size, data); } else { - data = vfio_bar_read(&vdev->bars[quirk->data.bar], + data = vfio_region_read(&vdev->bars[quirk->data.bar].region, addr + quirk->data.base_offset, size); } @@ -1449,7 +1223,7 @@ static void vfio_generic_window_quirk_write(void *opaque, hwaddr addr, uint64_t data, unsigned size) { VFIOQuirk *quirk = opaque; - VFIODevice *vdev = quirk->vdev; + VFIOPCIDevice *vdev = quirk->vdev; if (ranges_overlap(addr, size, quirk->data.address_offset, quirk->data.address_size)) { @@ -1489,7 +1263,7 @@ static void vfio_generic_window_quirk_write(void *opaque, hwaddr addr, return; } - vfio_bar_write(&vdev->bars[quirk->data.bar], + vfio_region_write(&vdev->bars[quirk->data.bar].region, addr + quirk->data.base_offset, data, size); } @@ -1503,7 +1277,7 @@ static uint64_t vfio_generic_quirk_read(void *opaque, hwaddr addr, unsigned size) { VFIOQuirk *quirk = opaque; - VFIODevice *vdev = quirk->vdev; + VFIOPCIDevice *vdev = quirk->vdev; hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK; hwaddr offset = quirk->data.address_match & ~TARGET_PAGE_MASK; uint64_t data; @@ -1523,7 +1297,8 @@ static uint64_t vfio_generic_quirk_read(void *opaque, vdev->host.bus, vdev->host.slot, vdev->host.function, quirk->data.bar, addr + base, size, data); } else { - data = vfio_bar_read(&vdev->bars[quirk->data.bar], addr + base, size); + data = vfio_region_read(&vdev->bars[quirk->data.bar].region, + addr + base, size); } return data; @@ -1533,7 +1308,7 @@ static void vfio_generic_quirk_write(void *opaque, hwaddr addr, uint64_t data, unsigned size) { VFIOQuirk *quirk = opaque; - VFIODevice *vdev = quirk->vdev; + VFIOPCIDevice *vdev = quirk->vdev; hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK; hwaddr offset = quirk->data.address_match & ~TARGET_PAGE_MASK; @@ -1552,7 +1327,8 @@ static void vfio_generic_quirk_write(void *opaque, hwaddr addr, vdev->host.domain, vdev->host.bus, vdev->host.slot, vdev->host.function, quirk->data.bar, addr + base, data, size); } else { - vfio_bar_write(&vdev->bars[quirk->data.bar], addr + base, data, size); + vfio_region_write(&vdev->bars[quirk->data.bar].region, addr + base, + data, size); } } @@ -1578,7 +1354,7 @@ static uint64_t vfio_ati_3c3_quirk_read(void *opaque, hwaddr addr, unsigned size) { VFIOQuirk *quirk = opaque; - VFIODevice *vdev = quirk->vdev; + VFIOPCIDevice *vdev = quirk->vdev; uint64_t data = vfio_pci_read_config(&vdev->pdev, PCI_BASE_ADDRESS_0 + (4 * 4) + 1, size); @@ -1592,7 +1368,7 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = { .endianness = DEVICE_LITTLE_ENDIAN, }; -static void vfio_vga_probe_ati_3c3_quirk(VFIODevice *vdev) +static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev) { PCIDevice *pdev = &vdev->pdev; VFIOQuirk *quirk; @@ -1605,7 +1381,7 @@ static void vfio_vga_probe_ati_3c3_quirk(VFIODevice *vdev) * As long as the BAR is >= 256 bytes it will be aligned such that the * lower byte is always zero. Filter out anything else, if it exists. */ - if (!vdev->bars[4].ioport || vdev->bars[4].size < 256) { + if (!vdev->bars[4].ioport || vdev->bars[4].region.size < 256) { return; } @@ -1635,7 +1411,7 @@ static void vfio_vga_probe_ati_3c3_quirk(VFIODevice *vdev) * that only read-only access is provided, but we drop writes when the window * is enabled to config space nonetheless. */ -static void vfio_probe_ati_bar4_window_quirk(VFIODevice *vdev, int nr) +static void vfio_probe_ati_bar4_window_quirk(VFIOPCIDevice *vdev, int nr) { PCIDevice *pdev = &vdev->pdev; VFIOQuirk *quirk; @@ -1658,7 +1434,7 @@ static void vfio_probe_ati_bar4_window_quirk(VFIODevice *vdev, int nr) memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_generic_window_quirk, quirk, "vfio-ati-bar4-window-quirk", 8); - memory_region_add_subregion_overlap(&vdev->bars[nr].mem, + memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem, quirk->data.base_offset, &quirk->mem, 1); QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next); @@ -1671,7 +1447,7 @@ static void vfio_probe_ati_bar4_window_quirk(VFIODevice *vdev, int nr) /* * Trap the BAR2 MMIO window to config space as well. */ -static void vfio_probe_ati_bar2_4000_quirk(VFIODevice *vdev, int nr) +static void vfio_probe_ati_bar2_4000_quirk(VFIOPCIDevice *vdev, int nr) { PCIDevice *pdev = &vdev->pdev; VFIOQuirk *quirk; @@ -1692,7 +1468,7 @@ static void vfio_probe_ati_bar2_4000_quirk(VFIODevice *vdev, int nr) memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_generic_quirk, quirk, "vfio-ati-bar2-4000-quirk", TARGET_PAGE_ALIGN(quirk->data.address_mask + 1)); - memory_region_add_subregion_overlap(&vdev->bars[nr].mem, + memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem, quirk->data.address_match & TARGET_PAGE_MASK, &quirk->mem, 1); @@ -1739,7 +1515,7 @@ static uint64_t vfio_nvidia_3d0_quirk_read(void *opaque, hwaddr addr, unsigned size) { VFIOQuirk *quirk = opaque; - VFIODevice *vdev = quirk->vdev; + VFIOPCIDevice *vdev = quirk->vdev; PCIDevice *pdev = &vdev->pdev; uint64_t data = vfio_vga_read(&vdev->vga.region[QEMU_PCI_VGA_IO_HI], addr + quirk->data.base_offset, size); @@ -1758,7 +1534,7 @@ static void vfio_nvidia_3d0_quirk_write(void *opaque, hwaddr addr, uint64_t data, unsigned size) { VFIOQuirk *quirk = opaque; - VFIODevice *vdev = quirk->vdev; + VFIOPCIDevice *vdev = quirk->vdev; PCIDevice *pdev = &vdev->pdev; switch (quirk->data.flags) { @@ -1805,13 +1581,13 @@ static const MemoryRegionOps vfio_nvidia_3d0_quirk = { .endianness = DEVICE_LITTLE_ENDIAN, }; -static void vfio_vga_probe_nvidia_3d0_quirk(VFIODevice *vdev) +static void vfio_vga_probe_nvidia_3d0_quirk(VFIOPCIDevice *vdev) { PCIDevice *pdev = &vdev->pdev; VFIOQuirk *quirk; if (pci_get_word(pdev->config + PCI_VENDOR_ID) != PCI_VENDOR_ID_NVIDIA || - !vdev->bars[1].size) { + !vdev->bars[1].region.size) { return; } @@ -1897,7 +1673,7 @@ static const MemoryRegionOps vfio_nvidia_bar5_window_quirk = { .endianness = DEVICE_LITTLE_ENDIAN, }; -static void vfio_probe_nvidia_bar5_window_quirk(VFIODevice *vdev, int nr) +static void vfio_probe_nvidia_bar5_window_quirk(VFIOPCIDevice *vdev, int nr) { PCIDevice *pdev = &vdev->pdev; VFIOQuirk *quirk; @@ -1919,7 +1695,8 @@ static void vfio_probe_nvidia_bar5_window_quirk(VFIODevice *vdev, int nr) memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_nvidia_bar5_window_quirk, quirk, "vfio-nvidia-bar5-window-quirk", 16); - memory_region_add_subregion_overlap(&vdev->bars[nr].mem, 0, &quirk->mem, 1); + memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem, 0, + &quirk->mem, 1); QLIST_INSERT_HEAD(&vdev->bars[nr].quirks, quirk, next); @@ -1932,7 +1709,7 @@ static void vfio_nvidia_88000_quirk_write(void *opaque, hwaddr addr, uint64_t data, unsigned size) { VFIOQuirk *quirk = opaque; - VFIODevice *vdev = quirk->vdev; + VFIOPCIDevice *vdev = quirk->vdev; PCIDevice *pdev = &vdev->pdev; hwaddr base = quirk->data.address_match & TARGET_PAGE_MASK; @@ -1946,7 +1723,8 @@ static void vfio_nvidia_88000_quirk_write(void *opaque, hwaddr addr, */ if ((pdev->cap_present & QEMU_PCI_CAP_MSI) && vfio_range_contained(addr, size, pdev->msi_cap, PCI_MSI_FLAGS)) { - vfio_bar_write(&vdev->bars[quirk->data.bar], addr + base, data, size); + vfio_region_write(&vdev->bars[quirk->data.bar].region, + addr + base, data, size); } } @@ -1965,7 +1743,7 @@ static const MemoryRegionOps vfio_nvidia_88000_quirk = { * * Here's offset 0x88000... */ -static void vfio_probe_nvidia_bar0_88000_quirk(VFIODevice *vdev, int nr) +static void vfio_probe_nvidia_bar0_88000_quirk(VFIOPCIDevice *vdev, int nr) { PCIDevice *pdev = &vdev->pdev; VFIOQuirk *quirk; @@ -1985,7 +1763,7 @@ static void vfio_probe_nvidia_bar0_88000_quirk(VFIODevice *vdev, int nr) memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_nvidia_88000_quirk, quirk, "vfio-nvidia-bar0-88000-quirk", TARGET_PAGE_ALIGN(quirk->data.address_mask + 1)); - memory_region_add_subregion_overlap(&vdev->bars[nr].mem, + memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem, quirk->data.address_match & TARGET_PAGE_MASK, &quirk->mem, 1); @@ -1999,7 +1777,7 @@ static void vfio_probe_nvidia_bar0_88000_quirk(VFIODevice *vdev, int nr) /* * And here's the same for BAR0 offset 0x1800... */ -static void vfio_probe_nvidia_bar0_1800_quirk(VFIODevice *vdev, int nr) +static void vfio_probe_nvidia_bar0_1800_quirk(VFIOPCIDevice *vdev, int nr) { PCIDevice *pdev = &vdev->pdev; VFIOQuirk *quirk; @@ -2011,7 +1789,8 @@ static void vfio_probe_nvidia_bar0_1800_quirk(VFIODevice *vdev, int nr) /* Log the chipset ID */ DPRINTF("Nvidia NV%02x\n", - (unsigned int)(vfio_bar_read(&vdev->bars[0], 0, 4) >> 20) & 0xff); + (unsigned int)(vfio_region_read(&vdev->bars[0].region, 0, 4) >> 20) + & 0xff); quirk = g_malloc0(sizeof(*quirk)); quirk->vdev = vdev; @@ -2023,7 +1802,7 @@ static void vfio_probe_nvidia_bar0_1800_quirk(VFIODevice *vdev, int nr) memory_region_init_io(&quirk->mem, OBJECT(vdev), &vfio_generic_quirk, quirk, "vfio-nvidia-bar0-1800-quirk", TARGET_PAGE_ALIGN(quirk->data.address_mask + 1)); - memory_region_add_subregion_overlap(&vdev->bars[nr].mem, + memory_region_add_subregion_overlap(&vdev->bars[nr].region.mem, quirk->data.address_match & TARGET_PAGE_MASK, &quirk->mem, 1); @@ -2043,13 +1822,13 @@ static void vfio_probe_nvidia_bar0_1800_quirk(VFIODevice *vdev, int nr) /* * Common quirk probe entry points. */ -static void vfio_vga_quirk_setup(VFIODevice *vdev) +static void vfio_vga_quirk_setup(VFIOPCIDevice *vdev) { vfio_vga_probe_ati_3c3_quirk(vdev); vfio_vga_probe_nvidia_3d0_quirk(vdev); } -static void vfio_vga_quirk_teardown(VFIODevice *vdev) +static void vfio_vga_quirk_teardown(VFIOPCIDevice *vdev) { int i; @@ -2064,7 +1843,7 @@ static void vfio_vga_quirk_teardown(VFIODevice *vdev) } } -static void vfio_bar_quirk_setup(VFIODevice *vdev, int nr) +static void vfio_bar_quirk_setup(VFIOPCIDevice *vdev, int nr) { vfio_probe_ati_bar4_window_quirk(vdev, nr); vfio_probe_ati_bar2_4000_quirk(vdev, nr); @@ -2073,13 +1852,13 @@ static void vfio_bar_quirk_setup(VFIODevice *vdev, int nr) vfio_probe_nvidia_bar0_1800_quirk(vdev, nr); } -static void vfio_bar_quirk_teardown(VFIODevice *vdev, int nr) +static void vfio_bar_quirk_teardown(VFIOPCIDevice *vdev, int nr) { VFIOBAR *bar = &vdev->bars[nr]; while (!QLIST_EMPTY(&bar->quirks)) { VFIOQuirk *quirk = QLIST_FIRST(&bar->quirks); - memory_region_del_subregion(&bar->mem, &quirk->mem); + memory_region_del_subregion(&bar->region.mem, &quirk->mem); memory_region_destroy(&quirk->mem); QLIST_REMOVE(quirk, next); g_free(quirk); @@ -2091,7 +1870,7 @@ static void vfio_bar_quirk_teardown(VFIODevice *vdev, int nr) */ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len) { - VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev); + VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev); uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val; memcpy(&emu_bits, vdev->emulated_config_bits + addr, len); @@ -2104,7 +1883,7 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len) if (~emu_bits & (0xffffffffU >> (32 - len * 8))) { ssize_t ret; - ret = pread(vdev->fd, &phys_val, len, vdev->config_offset + addr); + ret = pread(vdev->vdev.fd, &phys_val, len, vdev->config_offset + addr); if (ret != len) { error_report("%s(%04x:%02x:%02x.%x, 0x%x, 0x%x) failed: %m", __func__, vdev->host.domain, vdev->host.bus, @@ -2126,15 +1905,15 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len) static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, uint32_t val, int len) { - VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev); + VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev); uint32_t val_le = cpu_to_le32(val); - DPRINTF("%s(%04x:%02x:%02x.%x, @0x%x, 0x%x, len=0x%x)\n", __func__, - vdev->host.domain, vdev->host.bus, vdev->host.slot, - vdev->host.function, addr, val, len); + DPRINTF("%s(%s, @0x%x, 0x%x, len=0x%x)\n", __func__, vdev->vdev.name, + addr, val, len); /* Write everything to VFIO, let it filter out what we can't write */ - if (pwrite(vdev->fd, &val_le, len, vdev->config_offset + addr) != len) { + if (pwrite(vdev->vdev.fd, &val_le, len, + vdev->config_offset + addr) != len) { error_report("%s(%04x:%02x:%02x.%x, 0x%x, 0x%x, 0x%x) failed: %m", __func__, vdev->host.domain, vdev->host.bus, vdev->host.slot, vdev->host.function, addr, val, len); @@ -2180,186 +1959,9 @@ static void vfio_pci_write_config(PCIDevice *pdev, uint32_t addr, } /* - * DMA - Mapping and unmapping for the "type1" IOMMU interface used on x86 - */ -static int vfio_dma_unmap(VFIOContainer *container, - hwaddr iova, ram_addr_t size) -{ - struct vfio_iommu_type1_dma_unmap unmap = { - .argsz = sizeof(unmap), - .flags = 0, - .iova = iova, - .size = size, - }; - - if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) { - DPRINTF("VFIO_UNMAP_DMA: %d\n", -errno); - return -errno; - } - - return 0; -} - -static int vfio_dma_map(VFIOContainer *container, hwaddr iova, - ram_addr_t size, void *vaddr, bool readonly) -{ - struct vfio_iommu_type1_dma_map map = { - .argsz = sizeof(map), - .flags = VFIO_DMA_MAP_FLAG_READ, - .vaddr = (__u64)(uintptr_t)vaddr, - .iova = iova, - .size = size, - }; - - if (!readonly) { - map.flags |= VFIO_DMA_MAP_FLAG_WRITE; - } - - /* - * Try the mapping, if it fails with EBUSY, unmap the region and try - * again. This shouldn't be necessary, but we sometimes see it in - * the the VGA ROM space. - */ - if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 || - (errno == EBUSY && vfio_dma_unmap(container, iova, size) == 0 && - ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) { - return 0; - } - - DPRINTF("VFIO_MAP_DMA: %d\n", -errno); - return -errno; -} - -static bool vfio_listener_skipped_section(MemoryRegionSection *section) -{ - return !memory_region_is_ram(section->mr) || - /* - * Sizing an enabled 64-bit BAR can cause spurious mappings to - * addresses in the upper part of the 64-bit address space. These - * are never accessed by the CPU and beyond the address width of - * some IOMMU hardware. TODO: VFIO should tell us the IOMMU width. - */ - section->offset_within_address_space & (1ULL << 63); -} - -static void vfio_listener_region_add(MemoryListener *listener, - MemoryRegionSection *section) -{ - VFIOContainer *container = container_of(listener, VFIOContainer, - iommu_data.type1.listener); - hwaddr iova, end; - void *vaddr; - int ret; - - assert(!memory_region_is_iommu(section->mr)); - - if (vfio_listener_skipped_section(section)) { - DPRINTF("SKIPPING region_add %"HWADDR_PRIx" - %"PRIx64"\n", - section->offset_within_address_space, - section->offset_within_address_space + - int128_get64(int128_sub(section->size, int128_one()))); - return; - } - - if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) != - (section->offset_within_region & ~TARGET_PAGE_MASK))) { - error_report("%s received unaligned region", __func__); - return; - } - - iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); - end = (section->offset_within_address_space + int128_get64(section->size)) & - TARGET_PAGE_MASK; - - if (iova >= end) { - return; - } - - vaddr = memory_region_get_ram_ptr(section->mr) + - section->offset_within_region + - (iova - section->offset_within_address_space); - - DPRINTF("region_add %"HWADDR_PRIx" - %"HWADDR_PRIx" [%p]\n", - iova, end - 1, vaddr); - - memory_region_ref(section->mr); - ret = vfio_dma_map(container, iova, end - iova, vaddr, section->readonly); - if (ret) { - error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx", %p) = %d (%m)", - container, iova, end - iova, vaddr, ret); - - /* - * On the initfn path, store the first error in the container so we - * can gracefully fail. Runtime, there's not much we can do other - * than throw a hardware error. - */ - if (!container->iommu_data.type1.initialized) { - if (!container->iommu_data.type1.error) { - container->iommu_data.type1.error = ret; - } - } else { - hw_error("vfio: DMA mapping failed, unable to continue"); - } - } -} - -static void vfio_listener_region_del(MemoryListener *listener, - MemoryRegionSection *section) -{ - VFIOContainer *container = container_of(listener, VFIOContainer, - iommu_data.type1.listener); - hwaddr iova, end; - int ret; - - if (vfio_listener_skipped_section(section)) { - DPRINTF("SKIPPING region_del %"HWADDR_PRIx" - %"PRIx64"\n", - section->offset_within_address_space, - section->offset_within_address_space + - int128_get64(int128_sub(section->size, int128_one()))); - return; - } - - if (unlikely((section->offset_within_address_space & ~TARGET_PAGE_MASK) != - (section->offset_within_region & ~TARGET_PAGE_MASK))) { - error_report("%s received unaligned region", __func__); - return; - } - - iova = TARGET_PAGE_ALIGN(section->offset_within_address_space); - end = (section->offset_within_address_space + int128_get64(section->size)) & - TARGET_PAGE_MASK; - - if (iova >= end) { - return; - } - - DPRINTF("region_del %"HWADDR_PRIx" - %"HWADDR_PRIx"\n", - iova, end - 1); - - ret = vfio_dma_unmap(container, iova, end - iova); - memory_region_unref(section->mr); - if (ret) { - error_report("vfio_dma_unmap(%p, 0x%"HWADDR_PRIx", " - "0x%"HWADDR_PRIx") = %d (%m)", - container, iova, end - iova, ret); - } -} - -static MemoryListener vfio_memory_listener = { - .region_add = vfio_listener_region_add, - .region_del = vfio_listener_region_del, -}; - -static void vfio_listener_release(VFIOContainer *container) -{ - memory_listener_unregister(&container->iommu_data.type1.listener); -} - -/* * Interrupt setup */ -static void vfio_disable_interrupts(VFIODevice *vdev) +static void vfio_disable_interrupts(VFIOPCIDevice *vdev) { switch (vdev->interrupt) { case VFIO_INT_INTx: @@ -2374,13 +1976,13 @@ static void vfio_disable_interrupts(VFIODevice *vdev) } } -static int vfio_setup_msi(VFIODevice *vdev, int pos) +static int vfio_setup_msi(VFIOPCIDevice *vdev, int pos) { uint16_t ctrl; bool msi_64bit, msi_maskbit; int ret, entries; - if (pread(vdev->fd, &ctrl, sizeof(ctrl), + if (pread(vdev->vdev.fd, &ctrl, sizeof(ctrl), vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) { return -errno; } @@ -2414,28 +2016,29 @@ static int vfio_setup_msi(VFIODevice *vdev, int pos) * need to first look for where the MSI-X table lives. So we * unfortunately split MSI-X setup across two functions. */ -static int vfio_early_setup_msix(VFIODevice *vdev) +static int vfio_early_setup_msix(VFIOPCIDevice *vdev) { uint8_t pos; uint16_t ctrl; uint32_t table, pba; + int fd = vdev->vdev.fd; pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSIX); if (!pos) { return 0; } - if (pread(vdev->fd, &ctrl, sizeof(ctrl), + if (pread(fd, &ctrl, sizeof(ctrl), vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) { return -errno; } - if (pread(vdev->fd, &table, sizeof(table), + if (pread(fd, &table, sizeof(table), vdev->config_offset + pos + PCI_MSIX_TABLE) != sizeof(table)) { return -errno; } - if (pread(vdev->fd, &pba, sizeof(pba), + if (pread(fd, &pba, sizeof(pba), vdev->config_offset + pos + PCI_MSIX_PBA) != sizeof(pba)) { return -errno; } @@ -2460,14 +2063,14 @@ static int vfio_early_setup_msix(VFIODevice *vdev) return 0; } -static int vfio_setup_msix(VFIODevice *vdev, int pos) +static int vfio_setup_msix(VFIOPCIDevice *vdev, int pos) { int ret; ret = msix_init(&vdev->pdev, vdev->msix->entries, - &vdev->bars[vdev->msix->table_bar].mem, + &vdev->bars[vdev->msix->table_bar].region.mem, vdev->msix->table_bar, vdev->msix->table_offset, - &vdev->bars[vdev->msix->pba_bar].mem, + &vdev->bars[vdev->msix->pba_bar].region.mem, vdev->msix->pba_bar, vdev->msix->pba_offset, pos); if (ret < 0) { if (ret == -ENOTSUP) { @@ -2480,102 +2083,64 @@ static int vfio_setup_msix(VFIODevice *vdev, int pos) return 0; } -static void vfio_teardown_msi(VFIODevice *vdev) +static void vfio_teardown_msi(VFIOPCIDevice *vdev) { msi_uninit(&vdev->pdev); if (vdev->msix) { - msix_uninit(&vdev->pdev, &vdev->bars[vdev->msix->table_bar].mem, - &vdev->bars[vdev->msix->pba_bar].mem); + msix_uninit(&vdev->pdev, &vdev->bars[vdev->msix->table_bar].region.mem, + &vdev->bars[vdev->msix->pba_bar].region.mem); } } /* * Resource setup */ -static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled) +static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled) { int i; for (i = 0; i < PCI_ROM_SLOT; i++) { VFIOBAR *bar = &vdev->bars[i]; - if (!bar->size) { + if (!bar->region.size) { continue; } - memory_region_set_enabled(&bar->mmap_mem, enabled); + memory_region_set_enabled(&bar->region.mmap_mem, enabled); if (vdev->msix && vdev->msix->table_bar == i) { memory_region_set_enabled(&vdev->msix->mmap_mem, enabled); } } } -static void vfio_unmap_bar(VFIODevice *vdev, int nr) +static void vfio_unmap_bar(VFIOPCIDevice *vdev, int nr) { VFIOBAR *bar = &vdev->bars[nr]; - if (!bar->size) { + if (!bar->region.size) { return; } vfio_bar_quirk_teardown(vdev, nr); - memory_region_del_subregion(&bar->mem, &bar->mmap_mem); - munmap(bar->mmap, memory_region_size(&bar->mmap_mem)); - memory_region_destroy(&bar->mmap_mem); + memory_region_del_subregion(&bar->region.mem, &bar->region.mmap_mem); + munmap(bar->region.mmap, memory_region_size(&bar->region.mmap_mem)); + memory_region_destroy(&bar->region.mmap_mem); if (vdev->msix && vdev->msix->table_bar == nr) { - memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem); + memory_region_del_subregion(&bar->region.mem, &vdev->msix->mmap_mem); munmap(vdev->msix->mmap, memory_region_size(&vdev->msix->mmap_mem)); memory_region_destroy(&vdev->msix->mmap_mem); } - memory_region_destroy(&bar->mem); + memory_region_destroy(&bar->region.mem); } -static int vfio_mmap_bar(VFIODevice *vdev, VFIOBAR *bar, - MemoryRegion *mem, MemoryRegion *submem, - void **map, size_t size, off_t offset, - const char *name) -{ - int ret = 0; - - if (VFIO_ALLOW_MMAP && size && bar->flags & VFIO_REGION_INFO_FLAG_MMAP) { - int prot = 0; - - if (bar->flags & VFIO_REGION_INFO_FLAG_READ) { - prot |= PROT_READ; - } - - if (bar->flags & VFIO_REGION_INFO_FLAG_WRITE) { - prot |= PROT_WRITE; - } - - *map = mmap(NULL, size, prot, MAP_SHARED, - bar->fd, bar->fd_offset + offset); - if (*map == MAP_FAILED) { - *map = NULL; - ret = -errno; - goto empty_region; - } - - memory_region_init_ram_ptr(submem, OBJECT(vdev), name, size, *map); - } else { -empty_region: - /* Create a zero sized sub-region to make cleanup easy. */ - memory_region_init(submem, OBJECT(vdev), name, 0); - } - - memory_region_add_subregion(mem, offset, submem); - - return ret; -} - -static void vfio_map_bar(VFIODevice *vdev, int nr) +static void vfio_map_bar(VFIOPCIDevice *vdev, int nr) { VFIOBAR *bar = &vdev->bars[nr]; - unsigned size = bar->size; + unsigned size = bar->region.size; char name[64]; uint32_t pci_bar; uint8_t type; @@ -2591,7 +2156,7 @@ static void vfio_map_bar(VFIODevice *vdev, int nr) vdev->host.function, nr); /* Determine what type of BAR this is for registration */ - ret = pread(vdev->fd, &pci_bar, sizeof(pci_bar), + ret = pread(vdev->vdev.fd, &pci_bar, sizeof(pci_bar), vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr)); if (ret != sizeof(pci_bar)) { error_report("vfio: Failed to read BAR %d (%m)", nr); @@ -2605,9 +2170,9 @@ static void vfio_map_bar(VFIODevice *vdev, int nr) ~PCI_BASE_ADDRESS_MEM_MASK); /* A "slow" read/write mapping underlies all BARs */ - memory_region_init_io(&bar->mem, OBJECT(vdev), &vfio_bar_ops, - bar, name, size); - pci_register_bar(&vdev->pdev, nr, type, &bar->mem); + memory_region_init_io(&bar->region.mem, OBJECT(vdev), &vfio_bar_ops, + &bar->region, name, size); + pci_register_bar(&vdev->pdev, nr, type, &bar->region.mem); /* * We can't mmap areas overlapping the MSIX vector table, so we @@ -2618,8 +2183,9 @@ static void vfio_map_bar(VFIODevice *vdev, int nr) } strncat(name, " mmap", sizeof(name) - strlen(name) - 1); - if (vfio_mmap_bar(vdev, bar, &bar->mem, - &bar->mmap_mem, &bar->mmap, size, 0, name)) { + if (vfio_mmap_region(OBJECT(vdev), &bar->region, &bar->region.mem, + &bar->region.mmap_mem, &bar->region.mmap, + size, 0, name)) { error_report("%s unsupported. Performance may be slow", name); } @@ -2629,11 +2195,12 @@ static void vfio_map_bar(VFIODevice *vdev, int nr) start = HOST_PAGE_ALIGN(vdev->msix->table_offset + (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE)); - size = start < bar->size ? bar->size - start : 0; + size = start < bar->region.size ? bar->region.size - start : 0; strncat(name, " msix-hi", sizeof(name) - strlen(name) - 1); /* VFIOMSIXInfo contains another MemoryRegion for this mapping */ - if (vfio_mmap_bar(vdev, bar, &bar->mem, &vdev->msix->mmap_mem, - &vdev->msix->mmap, size, start, name)) { + if (vfio_mmap_region(OBJECT(vdev), &bar->region, + &bar->region.mem, &vdev->msix->mmap_mem, + &vdev->msix->mmap, size, start, name)) { error_report("%s unsupported. Performance may be slow", name); } } @@ -2641,7 +2208,7 @@ static void vfio_map_bar(VFIODevice *vdev, int nr) vfio_bar_quirk_setup(vdev, nr); } -static void vfio_map_bars(VFIODevice *vdev) +static void vfio_map_bars(VFIOPCIDevice *vdev) { int i; @@ -2673,7 +2240,7 @@ static void vfio_map_bars(VFIODevice *vdev) } } -static void vfio_unmap_bars(VFIODevice *vdev) +static void vfio_unmap_bars(VFIOPCIDevice *vdev) { int i; @@ -2712,7 +2279,7 @@ static void vfio_set_word_bits(uint8_t *buf, uint16_t val, uint16_t mask) pci_set_word(buf, (pci_get_word(buf) & ~mask) | val); } -static void vfio_add_emulated_word(VFIODevice *vdev, int pos, +static void vfio_add_emulated_word(VFIOPCIDevice *vdev, int pos, uint16_t val, uint16_t mask) { vfio_set_word_bits(vdev->pdev.config + pos, val, mask); @@ -2725,7 +2292,7 @@ static void vfio_set_long_bits(uint8_t *buf, uint32_t val, uint32_t mask) pci_set_long(buf, (pci_get_long(buf) & ~mask) | val); } -static void vfio_add_emulated_long(VFIODevice *vdev, int pos, +static void vfio_add_emulated_long(VFIOPCIDevice *vdev, int pos, uint32_t val, uint32_t mask) { vfio_set_long_bits(vdev->pdev.config + pos, val, mask); @@ -2733,7 +2300,7 @@ static void vfio_add_emulated_long(VFIODevice *vdev, int pos, vfio_set_long_bits(vdev->emulated_config_bits + pos, mask, mask); } -static int vfio_setup_pcie_cap(VFIODevice *vdev, int pos, uint8_t size) +static int vfio_setup_pcie_cap(VFIOPCIDevice *vdev, int pos, uint8_t size) { uint16_t flags; uint8_t type; @@ -2825,7 +2392,7 @@ static int vfio_setup_pcie_cap(VFIODevice *vdev, int pos, uint8_t size) return pos; } -static void vfio_check_pcie_flr(VFIODevice *vdev, uint8_t pos) +static void vfio_check_pcie_flr(VFIOPCIDevice *vdev, uint8_t pos) { uint32_t cap = pci_get_long(vdev->pdev.config + pos + PCI_EXP_DEVCAP); @@ -2837,7 +2404,7 @@ static void vfio_check_pcie_flr(VFIODevice *vdev, uint8_t pos) } } -static void vfio_check_pm_reset(VFIODevice *vdev, uint8_t pos) +static void vfio_check_pm_reset(VFIOPCIDevice *vdev, uint8_t pos) { uint16_t csr = pci_get_word(vdev->pdev.config + pos + PCI_PM_CTRL); @@ -2849,7 +2416,7 @@ static void vfio_check_pm_reset(VFIODevice *vdev, uint8_t pos) } } -static void vfio_check_af_flr(VFIODevice *vdev, uint8_t pos) +static void vfio_check_af_flr(VFIOPCIDevice *vdev, uint8_t pos) { uint8_t cap = pci_get_byte(vdev->pdev.config + pos + PCI_AF_CAP); @@ -2861,7 +2428,7 @@ static void vfio_check_af_flr(VFIODevice *vdev, uint8_t pos) } } -static int vfio_add_std_cap(VFIODevice *vdev, uint8_t pos) +static int vfio_add_std_cap(VFIOPCIDevice *vdev, uint8_t pos) { PCIDevice *pdev = &vdev->pdev; uint8_t cap_id, next, size; @@ -2936,7 +2503,7 @@ static int vfio_add_std_cap(VFIODevice *vdev, uint8_t pos) return 0; } -static int vfio_add_capabilities(VFIODevice *vdev) +static int vfio_add_capabilities(VFIOPCIDevice *vdev) { PCIDevice *pdev = &vdev->pdev; @@ -2948,7 +2515,7 @@ static int vfio_add_capabilities(VFIODevice *vdev) return vfio_add_std_cap(vdev, pdev->config[PCI_CAPABILITY_LIST]); } -static void vfio_pci_pre_reset(VFIODevice *vdev) +static void vfio_pci_pre_reset(VFIOPCIDevice *vdev) { PCIDevice *pdev = &vdev->pdev; uint16_t cmd; @@ -2985,7 +2552,7 @@ static void vfio_pci_pre_reset(VFIODevice *vdev) vfio_pci_write_config(pdev, PCI_COMMAND, cmd, 2); } -static void vfio_pci_post_reset(VFIODevice *vdev) +static void vfio_pci_post_reset(VFIOPCIDevice *vdev) { vfio_enable_intx(vdev); } @@ -2997,7 +2564,7 @@ static bool vfio_pci_host_match(PCIHostDeviceAddress *host1, host1->slot == host2->slot && host1->function == host2->function); } -static int vfio_pci_hot_reset(VFIODevice *vdev, bool single) +static int vfio_pci_hot_reset(VFIOPCIDevice *vdev, bool single) { VFIOGroup *group; struct vfio_pci_hot_reset_info *info; @@ -3006,18 +2573,19 @@ static int vfio_pci_hot_reset(VFIODevice *vdev, bool single) int32_t *fds; int ret, i, count; bool multi = false; + int fd = vdev->vdev.fd; DPRINTF("%s(%04x:%02x:%02x.%x) %s\n", __func__, vdev->host.domain, vdev->host.bus, vdev->host.slot, vdev->host.function, single ? "one" : "multi"); vfio_pci_pre_reset(vdev); - vdev->needs_reset = false; + vdev->vdev.needs_reset = false; info = g_malloc0(sizeof(*info)); info->argsz = sizeof(*info); - ret = ioctl(vdev->fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info); + ret = ioctl(fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info); if (ret && errno != ENOSPC) { ret = -errno; if (!vdev->has_pm_reset) { @@ -3033,7 +2601,7 @@ static int vfio_pci_hot_reset(VFIODevice *vdev, bool single) info->argsz = sizeof(*info) + (count * sizeof(*devices)); devices = &info->devices[0]; - ret = ioctl(vdev->fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info); + ret = ioctl(fd, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO, info); if (ret) { ret = -errno; error_report("vfio: hot reset info failed: %m"); @@ -3048,6 +2616,7 @@ static int vfio_pci_hot_reset(VFIODevice *vdev, bool single) for (i = 0; i < info->count; i++) { PCIHostDeviceAddress host; VFIODevice *tmp; + VFIOPCIDevice *vpcidev; host.domain = devices[i].segment; host.bus = devices[i].bus; @@ -3080,7 +2649,11 @@ static int vfio_pci_hot_reset(VFIODevice *vdev, bool single) /* Prep dependent devices for reset and clear our marker. */ QLIST_FOREACH(tmp, &group->device_list, next) { - if (vfio_pci_host_match(&host, &tmp->host)) { + if (tmp->type != VFIO_DEVICE_TYPE_PCI) { + continue; + } + vpcidev = container_of(tmp, VFIOPCIDevice, vdev); + if (vfio_pci_host_match(&host, &vpcidev->host)) { if (single) { DPRINTF("vfio: found another in-use device " "%04x:%02x:%02x.%x\n", host.domain, host.bus, @@ -3088,8 +2661,8 @@ static int vfio_pci_hot_reset(VFIODevice *vdev, bool single) ret = -EINVAL; goto out_single; } - vfio_pci_pre_reset(tmp); - tmp->needs_reset = false; + vfio_pci_pre_reset(vpcidev); + vpcidev->vdev.needs_reset = false; multi = true; break; } @@ -3128,7 +2701,7 @@ static int vfio_pci_hot_reset(VFIODevice *vdev, bool single) } /* Bus reset! */ - ret = ioctl(vdev->fd, VFIO_DEVICE_PCI_HOT_RESET, reset); + ret = ioctl(fd, VFIO_DEVICE_PCI_HOT_RESET, reset); g_free(reset); DPRINTF("%04x:%02x:%02x.%x hot reset: %s\n", vdev->host.domain, @@ -3140,6 +2713,7 @@ out: for (i = 0; i < info->count; i++) { PCIHostDeviceAddress host; VFIODevice *tmp; + VFIOPCIDevice *vpcidev; host.domain = devices[i].segment; host.bus = devices[i].bus; @@ -3161,8 +2735,12 @@ out: } QLIST_FOREACH(tmp, &group->device_list, next) { - if (vfio_pci_host_match(&host, &tmp->host)) { - vfio_pci_post_reset(tmp); + if (tmp->type != VFIO_DEVICE_TYPE_PCI) { + continue; + } + vpcidev = container_of(tmp, VFIOPCIDevice, vdev); + if (vfio_pci_host_match(&host, &vpcidev->host)) { + vfio_pci_post_reset(vpcidev); break; } } @@ -3178,7 +2756,7 @@ out_single: * We want to differentiate hot reset of mulitple in-use devices vs hot reset * of a single in-use device. VFIO_DEVICE_RESET will already handle the case * of doing hot resets when there is only a single device per bus. The in-use - * here refers to how many VFIODevices are affected. A hot reset that affects + * here refers to how many VFIOPCIDevices are affected. A hot reset that affects * multiple devices, but only a single in-use device, means that we can call * it from our bus ->reset() callback since the extent is effectively a single * device. This allows us to make use of it in the hotplug path. When there @@ -3189,354 +2767,99 @@ out_single: * _one() will only do a hot reset for the one in-use devices case, calling * _multi() will do nothing if a _one() would have been sufficient. */ -static int vfio_pci_hot_reset_one(VFIODevice *vdev) +static int vfio_pci_hot_reset_one(VFIOPCIDevice *vdev) { return vfio_pci_hot_reset(vdev, true); } static int vfio_pci_hot_reset_multi(VFIODevice *vdev) { - return vfio_pci_hot_reset(vdev, false); -} - -static void vfio_pci_reset_handler(void *opaque) -{ - VFIOGroup *group; - VFIODevice *vdev; - - QLIST_FOREACH(group, &group_list, next) { - QLIST_FOREACH(vdev, &group->device_list, next) { - if (!vdev->reset_works || (!vdev->has_flr && vdev->has_pm_reset)) { - vdev->needs_reset = true; - } - } - } - - QLIST_FOREACH(group, &group_list, next) { - QLIST_FOREACH(vdev, &group->device_list, next) { - if (vdev->needs_reset) { - vfio_pci_hot_reset_multi(vdev); - } - } - } -} - -static void vfio_kvm_device_add_group(VFIOGroup *group) -{ -#ifdef CONFIG_KVM - struct kvm_device_attr attr = { - .group = KVM_DEV_VFIO_GROUP, - .attr = KVM_DEV_VFIO_GROUP_ADD, - .addr = (uint64_t)(unsigned long)&group->fd, - }; - - if (!kvm_enabled()) { - return; - } - - if (vfio_kvm_device_fd < 0) { - struct kvm_create_device cd = { - .type = KVM_DEV_TYPE_VFIO, - }; - - if (kvm_vm_ioctl(kvm_state, KVM_CREATE_DEVICE, &cd)) { - DPRINTF("KVM_CREATE_DEVICE: %m\n"); - return; - } - - vfio_kvm_device_fd = cd.fd; - } - - if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { - error_report("Failed to add group %d to KVM VFIO device: %m", - group->groupid); - } -#endif -} - -static void vfio_kvm_device_del_group(VFIOGroup *group) -{ -#ifdef CONFIG_KVM - struct kvm_device_attr attr = { - .group = KVM_DEV_VFIO_GROUP, - .attr = KVM_DEV_VFIO_GROUP_DEL, - .addr = (uint64_t)(unsigned long)&group->fd, - }; - - if (vfio_kvm_device_fd < 0) { - return; - } - - if (ioctl(vfio_kvm_device_fd, KVM_SET_DEVICE_ATTR, &attr)) { - error_report("Failed to remove group %d from KVM VFIO device: %m", - group->groupid); - } -#endif + VFIOPCIDevice *vpcidev = container_of(vdev, VFIOPCIDevice, vdev); + return vfio_pci_hot_reset(vpcidev, false); } -static int vfio_connect_container(VFIOGroup *group) +static bool vfio_pci_compute_needs_reset(VFIODevice *vdev) { - VFIOContainer *container; - int ret, fd; - - if (group->container) { - return 0; - } - - QLIST_FOREACH(container, &container_list, next) { - if (!ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &container->fd)) { - group->container = container; - QLIST_INSERT_HEAD(&container->group_list, group, container_next); - return 0; - } - } - - fd = qemu_open("/dev/vfio/vfio", O_RDWR); - if (fd < 0) { - error_report("vfio: failed to open /dev/vfio/vfio: %m"); - return -errno; - } - - ret = ioctl(fd, VFIO_GET_API_VERSION); - if (ret != VFIO_API_VERSION) { - error_report("vfio: supported vfio version: %d, " - "reported version: %d", VFIO_API_VERSION, ret); - close(fd); - return -EINVAL; - } - - container = g_malloc0(sizeof(*container)); - container->fd = fd; - - if (ioctl(fd, VFIO_CHECK_EXTENSION, VFIO_TYPE1_IOMMU)) { - ret = ioctl(group->fd, VFIO_GROUP_SET_CONTAINER, &fd); - if (ret) { - error_report("vfio: failed to set group container: %m"); - g_free(container); - close(fd); - return -errno; - } - - ret = ioctl(fd, VFIO_SET_IOMMU, VFIO_TYPE1_IOMMU); - if (ret) { - error_report("vfio: failed to set iommu for container: %m"); - g_free(container); - close(fd); - return -errno; - } - - container->iommu_data.type1.listener = vfio_memory_listener; - container->iommu_data.release = vfio_listener_release; - - memory_listener_register(&container->iommu_data.type1.listener, - &address_space_memory); - - if (container->iommu_data.type1.error) { - ret = container->iommu_data.type1.error; - vfio_listener_release(container); - g_free(container); - close(fd); - error_report("vfio: memory listener initialization failed for container"); - return ret; - } - - container->iommu_data.type1.initialized = true; - - } else { - error_report("vfio: No available IOMMU models"); - g_free(container); - close(fd); - return -EINVAL; - } - - QLIST_INIT(&container->group_list); - QLIST_INSERT_HEAD(&container_list, container, next); - - group->container = container; - QLIST_INSERT_HEAD(&container->group_list, group, container_next); - - return 0; -} - -static void vfio_disconnect_container(VFIOGroup *group) -{ - VFIOContainer *container = group->container; - - if (ioctl(group->fd, VFIO_GROUP_UNSET_CONTAINER, &container->fd)) { - error_report("vfio: error disconnecting group %d from container", - group->groupid); - } - - QLIST_REMOVE(group, container_next); - group->container = NULL; - - if (QLIST_EMPTY(&container->group_list)) { - if (container->iommu_data.release) { - container->iommu_data.release(container); - } - QLIST_REMOVE(container, next); - DPRINTF("vfio_disconnect_container: close container->fd\n"); - close(container->fd); - g_free(container); + VFIOPCIDevice *vpcidev = container_of(vdev, VFIOPCIDevice, vdev); + if (!vdev->reset_works || (!vpcidev->has_flr && vpcidev->has_pm_reset)) { + vdev->needs_reset = true; } + return vdev->needs_reset; } -static VFIOGroup *vfio_get_group(int groupid) +static int vfio_pci_check_device(VFIODevice *vbasedev) { - VFIOGroup *group; - char path[32]; - struct vfio_group_status status = { .argsz = sizeof(status) }; - QLIST_FOREACH(group, &group_list, next) { - if (group->groupid == groupid) { - return group; - } - } - - group = g_malloc0(sizeof(*group)); - - snprintf(path, sizeof(path), "/dev/vfio/%d", groupid); - group->fd = qemu_open(path, O_RDWR); - if (group->fd < 0) { - error_report("vfio: error opening %s: %m", path); - g_free(group); - return NULL; - } - - if (ioctl(group->fd, VFIO_GROUP_GET_STATUS, &status)) { - error_report("vfio: error getting group status: %m"); - close(group->fd); - g_free(group); - return NULL; - } - - if (!(status.flags & VFIO_GROUP_FLAGS_VIABLE)) { - error_report("vfio: error, group %d is not viable, please ensure " - "all devices within the iommu_group are bound to their " - "vfio bus driver.", groupid); - close(group->fd); - g_free(group); - return NULL; - } - - group->groupid = groupid; - QLIST_INIT(&group->device_list); - - if (vfio_connect_container(group)) { - error_report("vfio: failed to setup container for group %d", groupid); - close(group->fd); - g_free(group); - return NULL; + if (vbasedev->num_regions < VFIO_PCI_CONFIG_REGION_INDEX + 1) { + error_report("vfio: unexpected number of io regions %u", + vbasedev->num_regions); + goto error; } - if (QLIST_EMPTY(&group_list)) { - qemu_register_reset(vfio_pci_reset_handler, NULL); + if (vbasedev->num_irqs < VFIO_PCI_MSIX_IRQ_INDEX + 1) { + error_report("vfio: unexpected number of irqs %u", vbasedev->num_irqs); + goto error; } - QLIST_INSERT_HEAD(&group_list, group, next); - - vfio_kvm_device_add_group(group); - - return group; +error: + vfio_put_base_device(vbasedev); + return -errno; } -static void vfio_put_group(VFIOGroup *group) -{ - if (!QLIST_EMPTY(&group->device_list)) { - return; - } - - vfio_kvm_device_del_group(group); - vfio_disconnect_container(group); - QLIST_REMOVE(group, next); - DPRINTF("vfio_put_group: close group->fd\n"); - close(group->fd); - g_free(group); - - if (QLIST_EMPTY(&group_list)) { - qemu_unregister_reset(vfio_pci_reset_handler, NULL); - } -} -static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev) +static int vfio_pci_get_device_regions(VFIODevice *vbasedev) { - struct vfio_device_info dev_info = { .argsz = sizeof(dev_info) }; struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) }; - struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) }; - int ret, i; - - ret = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name); - if (ret < 0) { - error_report("vfio: error getting device %s from group %d: %m", - name, group->groupid); - error_printf("Verify all devices in group %d are bound to vfio-pci " - "or pci-stub and not already in use\n", group->groupid); - return ret; - } - - vdev->fd = ret; - vdev->group = group; - QLIST_INSERT_HEAD(&group->device_list, vdev, next); - - /* Sanity check device */ - ret = ioctl(vdev->fd, VFIO_DEVICE_GET_INFO, &dev_info); - if (ret) { - error_report("vfio: error getting device info: %m"); - goto error; - } - - DPRINTF("Device %s flags: %u, regions: %u, irgs: %u\n", name, - dev_info.flags, dev_info.num_regions, dev_info.num_irqs); - - if (!(dev_info.flags & VFIO_DEVICE_FLAGS_PCI)) { - error_report("vfio: Um, this isn't a PCI device"); - goto error; - } - - vdev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET); - - if (dev_info.num_regions < VFIO_PCI_CONFIG_REGION_INDEX + 1) { - error_report("vfio: unexpected number of io regions %u", - dev_info.num_regions); - goto error; + int i, ret; + VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vdev); + + vbasedev->regions = g_malloc0(sizeof(VFIORegion *) * + vbasedev->num_regions); + if (!vbasedev->regions) { + error_report("vfio: Error allocating space for %d regions", + vbasedev->num_regions); + ret = -ENOMEM; + goto error; } - if (dev_info.num_irqs < VFIO_PCI_MSIX_IRQ_INDEX + 1) { - error_report("vfio: unexpected number of irqs %u", dev_info.num_irqs); - goto error; + for (i = 0; i < PCI_NUM_REGIONS; i++) { + vbasedev->regions[i] = &vdev->bars[i].region; } for (i = VFIO_PCI_BAR0_REGION_INDEX; i < VFIO_PCI_ROM_REGION_INDEX; i++) { reg_info.index = i; - ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, ®_info); + ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, ®_info); if (ret) { error_report("vfio: Error getting region %d info: %m", i); goto error; } - DPRINTF("Device %s region %d:\n", name, i); + DPRINTF("Device %s region %d:\n", vbasedev->name, i); DPRINTF(" size: 0x%lx, offset: 0x%lx, flags: 0x%lx\n", (unsigned long)reg_info.size, (unsigned long)reg_info.offset, (unsigned long)reg_info.flags); - vdev->bars[i].flags = reg_info.flags; - vdev->bars[i].size = reg_info.size; - vdev->bars[i].fd_offset = reg_info.offset; - vdev->bars[i].fd = vdev->fd; - vdev->bars[i].nr = i; + vbasedev->regions[i]->flags = reg_info.flags; + vbasedev->regions[i]->size = reg_info.size; + vbasedev->regions[i]->fd_offset = reg_info.offset; + vbasedev->regions[i]->fd = vbasedev->fd; + vbasedev->regions[i]->nr = i; + vbasedev->regions[i]->vdev = vbasedev; + QLIST_INIT(&vdev->bars[i].quirks); } + reg_info.index = VFIO_PCI_CONFIG_REGION_INDEX; - ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, ®_info); + ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, ®_info); if (ret) { error_report("vfio: Error getting config info: %m"); goto error; } - DPRINTF("Device %s config:\n", name); + DPRINTF("Device %s config:\n", vbasedev->name); DPRINTF(" size: 0x%lx, offset: 0x%lx, flags: 0x%lx\n", (unsigned long)reg_info.size, (unsigned long)reg_info.offset, (unsigned long)reg_info.flags); @@ -3548,13 +2871,13 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev) vdev->config_offset = reg_info.offset; if ((vdev->features & VFIO_FEATURE_ENABLE_VGA) && - dev_info.num_regions > VFIO_PCI_VGA_REGION_INDEX) { + vbasedev->num_regions > VFIO_PCI_VGA_REGION_INDEX) { struct vfio_region_info vga_info = { .argsz = sizeof(vga_info), .index = VFIO_PCI_VGA_REGION_INDEX, }; - ret = ioctl(vdev->fd, VFIO_DEVICE_GET_REGION_INFO, &vga_info); + ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, &vga_info); if (ret) { error_report( "vfio: Device does not support requested feature x-vga"); @@ -3571,7 +2894,7 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev) } vdev->vga.fd_offset = vga_info.offset; - vdev->vga.fd = vdev->fd; + vdev->vga.fd = vbasedev->fd; vdev->vga.region[QEMU_PCI_VGA_MEM].offset = QEMU_PCI_VGA_MEM_BASE; vdev->vga.region[QEMU_PCI_VGA_MEM].nr = QEMU_PCI_VGA_MEM; @@ -3587,9 +2910,26 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev) vdev->has_vga = true; } + + return ret; + +error: + if (ret) { + vfio_put_base_device(vbasedev); + } + return ret; + +} + +static int vfio_pci_get_device_interrupts(VFIODevice *vbasedev) +{ + VFIOPCIDevice *vdev = container_of(vbasedev, VFIOPCIDevice, vdev); + int ret; + + struct vfio_irq_info irq_info = { .argsz = sizeof(irq_info) }; irq_info.index = VFIO_PCI_ERR_IRQ_INDEX; - ret = ioctl(vdev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info); + ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info); if (ret) { /* This can fail for an old kernel or legacy PCI dev */ DPRINTF("VFIO_DEVICE_GET_IRQ_INFO failure: %m\n"); @@ -3597,27 +2937,18 @@ static int vfio_get_device(VFIOGroup *group, const char *name, VFIODevice *vdev) } else if (irq_info.count == 1) { vdev->pci_aer = true; } else { - error_report("vfio: %04x:%02x:%02x.%x " - "Could not enable error recovery for the device", - vdev->host.domain, vdev->host.bus, vdev->host.slot, - vdev->host.function); + error_report("vfio: %s Could not enable error recovery for the device", + vdev->vdev.name); } -error: - if (ret) { - QLIST_REMOVE(vdev, next); - vdev->group = NULL; - close(vdev->fd); - } + return ret; + } -static void vfio_put_device(VFIODevice *vdev) +static void vfio_put_device(VFIOPCIDevice *vdev) { - QLIST_REMOVE(vdev, next); - vdev->group = NULL; - DPRINTF("vfio_put_device: close vdev->fd\n"); - close(vdev->fd); + vfio_put_base_device(&vdev->vdev); if (vdev->msix) { g_free(vdev->msix); vdev->msix = NULL; @@ -3626,7 +2957,7 @@ static void vfio_put_device(VFIODevice *vdev) static void vfio_err_notifier_handler(void *opaque) { - VFIODevice *vdev = opaque; + VFIOPCIDevice *vdev = opaque; if (!event_notifier_test_and_clear(&vdev->err_notifier)) { return; @@ -3641,10 +2972,9 @@ static void vfio_err_notifier_handler(void *opaque) * guest to contain the error. */ - error_report("%s(%04x:%02x:%02x.%x) Unrecoverable error detected. " + error_report("%s(%s) Unrecoverable error detected. " "Please collect any data possible and then kill the guest", - __func__, vdev->host.domain, vdev->host.bus, - vdev->host.slot, vdev->host.function); + __func__, vdev->vdev.name); vm_stop(RUN_STATE_IO_ERROR); } @@ -3655,7 +2985,7 @@ static void vfio_err_notifier_handler(void *opaque) * and continue after disabling error recovery support for the * device. */ -static void vfio_register_err_notifier(VFIODevice *vdev) +static void vfio_register_err_notifier(VFIOPCIDevice *vdev) { int ret; int argsz; @@ -3686,7 +3016,7 @@ static void vfio_register_err_notifier(VFIODevice *vdev) *pfd = event_notifier_get_fd(&vdev->err_notifier); qemu_set_fd_handler(*pfd, vfio_err_notifier_handler, NULL, vdev); - ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set); + ret = ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set); if (ret) { error_report("vfio: Failed to set up error notification"); qemu_set_fd_handler(*pfd, NULL, NULL, vdev); @@ -3696,7 +3026,7 @@ static void vfio_register_err_notifier(VFIODevice *vdev) g_free(irq_set); } -static void vfio_unregister_err_notifier(VFIODevice *vdev) +static void vfio_unregister_err_notifier(VFIOPCIDevice *vdev) { int argsz; struct vfio_irq_set *irq_set; @@ -3719,7 +3049,7 @@ static void vfio_unregister_err_notifier(VFIODevice *vdev) pfd = (int32_t *)&irq_set->data; *pfd = -1; - ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set); + ret = ioctl(vdev->vdev.fd, VFIO_DEVICE_SET_IRQS, irq_set); if (ret) { error_report("vfio: Failed to de-assign error fd: %m"); } @@ -3729,76 +3059,36 @@ static void vfio_unregister_err_notifier(VFIODevice *vdev) event_notifier_cleanup(&vdev->err_notifier); } + +static VFIODeviceOps vfio_pci_ops = { + .vfio_eoi = vfio_pci_eoi, + .vfio_compute_needs_reset = vfio_pci_compute_needs_reset, + .vfio_hot_reset_multi = vfio_pci_hot_reset_multi, + .vfio_check_device = vfio_pci_check_device, + .vfio_get_device_regions = vfio_pci_get_device_regions, + .vfio_get_device_interrupts = vfio_pci_get_device_interrupts, +}; + static int vfio_initfn(PCIDevice *pdev) { - VFIODevice *pvdev, *vdev = DO_UPCAST(VFIODevice, pdev, pdev); - VFIOGroup *group; - char path[PATH_MAX], iommu_group_path[PATH_MAX], *group_name; - ssize_t len; - struct stat st; - int groupid; + VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev); + VFIODevice *vbasedev = &vdev->vdev; int ret; - /* Check that the host device exists */ - snprintf(path, sizeof(path), - "/sys/bus/pci/devices/%04x:%02x:%02x.%01x/", - vdev->host.domain, vdev->host.bus, vdev->host.slot, - vdev->host.function); - if (stat(path, &st) < 0) { - error_report("vfio: error: no such host device: %s", path); - return -errno; - } - - strncat(path, "iommu_group", sizeof(path) - strlen(path) - 1); - - len = readlink(path, iommu_group_path, sizeof(path)); - if (len <= 0 || len >= sizeof(path)) { - error_report("vfio: error no iommu_group for device"); - return len < 0 ? -errno : ENAMETOOLONG; - } - - iommu_group_path[len] = 0; - group_name = basename(iommu_group_path); - - if (sscanf(group_name, "%d", &groupid) != 1) { - error_report("vfio: error reading %s: %m", path); - return -errno; - } - - DPRINTF("%s(%04x:%02x:%02x.%x) group %d\n", __func__, vdev->host.domain, - vdev->host.bus, vdev->host.slot, vdev->host.function, groupid); - - group = vfio_get_group(groupid); - if (!group) { - error_report("vfio: failed to get group %d", groupid); - return -ENOENT; - } - - snprintf(path, sizeof(path), "%04x:%02x:%02x.%01x", + vbasedev->name = malloc(PATH_MAX); + snprintf(vbasedev->name, PATH_MAX, "%04x:%02x:%02x.%01x", vdev->host.domain, vdev->host.bus, vdev->host.slot, vdev->host.function); - QLIST_FOREACH(pvdev, &group->device_list, next) { - if (pvdev->host.domain == vdev->host.domain && - pvdev->host.bus == vdev->host.bus && - pvdev->host.slot == vdev->host.slot && - pvdev->host.function == vdev->host.function) { + vbasedev->ops = &vfio_pci_ops; - error_report("vfio: error: device %s is already attached", path); - vfio_put_group(group); - return -EBUSY; - } - } - - ret = vfio_get_device(group, path, vdev); - if (ret) { - error_report("vfio: failed to get device %s", path); - vfio_put_group(group); + ret = vfio_base_device_init(vbasedev, VFIO_DEVICE_TYPE_PCI); + if (ret < 0) { return ret; } /* Get a copy of config space */ - ret = pread(vdev->fd, vdev->pdev.config, + ret = pread(vbasedev->fd, vdev->pdev.config, MIN(pci_config_size(&vdev->pdev), vdev->config_size), vdev->config_offset); if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) { @@ -3879,14 +3169,14 @@ out_teardown: out_put: g_free(vdev->emulated_config_bits); vfio_put_device(vdev); - vfio_put_group(group); + vfio_put_group(vbasedev->group, vfio_reset_handler); return ret; } static void vfio_exitfn(PCIDevice *pdev) { - VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev); - VFIOGroup *group = vdev->group; + VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev); + VFIOGroup *group = vdev->vdev.group; vfio_unregister_err_notifier(vdev); pci_device_set_intx_routing_notifier(&vdev->pdev, NULL); @@ -3899,21 +3189,22 @@ static void vfio_exitfn(PCIDevice *pdev) g_free(vdev->emulated_config_bits); g_free(vdev->rom); vfio_put_device(vdev); - vfio_put_group(group); + vfio_put_group(group, vfio_reset_handler); } static void vfio_pci_reset(DeviceState *dev) { PCIDevice *pdev = DO_UPCAST(PCIDevice, qdev, dev); - VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev); + VFIOPCIDevice *vdev = container_of(pdev, VFIOPCIDevice, pdev); + int fd = vdev->vdev.fd; DPRINTF("%s(%04x:%02x:%02x.%x)\n", __func__, vdev->host.domain, vdev->host.bus, vdev->host.slot, vdev->host.function); vfio_pci_pre_reset(vdev); - if (vdev->reset_works && (vdev->has_flr || !vdev->has_pm_reset) && - !ioctl(vdev->fd, VFIO_DEVICE_RESET)) { + if (vdev->vdev.reset_works && (vdev->has_flr || !vdev->has_pm_reset) && + !ioctl(vdev->vdev.fd, VFIO_DEVICE_RESET)) { DPRINTF("%04x:%02x:%02x.%x FLR/VFIO_DEVICE_RESET\n", vdev->host.domain, vdev->host.bus, vdev->host.slot, vdev->host.function); goto post_reset; @@ -3925,10 +3216,9 @@ static void vfio_pci_reset(DeviceState *dev) } /* If nothing else works and the device supports PM reset, use it */ - if (vdev->reset_works && vdev->has_pm_reset && - !ioctl(vdev->fd, VFIO_DEVICE_RESET)) { - DPRINTF("%04x:%02x:%02x.%x PCI PM Reset\n", vdev->host.domain, - vdev->host.bus, vdev->host.slot, vdev->host.function); + if (vdev->vdev.reset_works && vdev->has_pm_reset && + !ioctl(fd, VFIO_DEVICE_RESET)) { + DPRINTF("%s PCI PM Reset\n", vdev->vdev.name); goto post_reset; } @@ -3937,16 +3227,16 @@ post_reset: } static Property vfio_pci_dev_properties[] = { - DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIODevice, host), - DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIODevice, + DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host), + DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIOPCIDevice, intx.mmap_timeout, 1100), - DEFINE_PROP_BIT("x-vga", VFIODevice, features, + DEFINE_PROP_BIT("x-vga", VFIOPCIDevice, features, VFIO_FEATURE_ENABLE_VGA_BIT, false), - DEFINE_PROP_INT32("bootindex", VFIODevice, bootindex, -1), + DEFINE_PROP_INT32("bootindex", VFIOPCIDevice, bootindex, -1), /* * TODO - support passed fds... is this necessary? - * DEFINE_PROP_STRING("vfiofd", VFIODevice, vfiofd_name), - * DEFINE_PROP_STRING("vfiogroupfd, VFIODevice, vfiogroupfd_name), + * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name), + * DEFINE_PROP_STRING("vfiogroupfd, VFIOPCIDevice, vfiogroupfd_name), */ DEFINE_PROP_END_OF_LIST(), }; @@ -3976,7 +3266,7 @@ static void vfio_pci_dev_class_init(ObjectClass *klass, void *data) static const TypeInfo vfio_pci_dev_info = { .name = "vfio-pci", .parent = TYPE_PCI_DEVICE, - .instance_size = sizeof(VFIODevice), + .instance_size = sizeof(VFIOPCIDevice), .class_init = vfio_pci_dev_class_init, }; diff --git a/hw/vfio/platform.c b/hw/vfio/platform.c new file mode 100644 index 0000000..646aa53 --- /dev/null +++ b/hw/vfio/platform.c @@ -0,0 +1,267 @@ +/* + * vfio based device assignment support - platform devices + * + * Copyright Linaro Limited, 2014 + * + * Authors: + * Kim Phillips + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * Based on vfio based PCI device assignment support: + * Copyright Red Hat, Inc. 2012 + */ + +#include +#include +#include + +#include "qemu/error-report.h" +#include "qemu/range.h" +#include "sysemu/sysemu.h" +#include "hw/sysbus.h" + +#include "vfio-common.h" + + +typedef struct VFIOPlatformDevice { + SysBusDevice sbdev; + VFIODevice vdev; /* not a QOM object */ +/* interrupts to come later on */ +} VFIOPlatformDevice; + + +static const MemoryRegionOps vfio_region_ops = { + .read = vfio_region_read, + .write = vfio_region_write, + .endianness = DEVICE_NATIVE_ENDIAN, +}; + +/* + * It is mandatory to pass a VFIOPlatformDevice since VFIODevice + * is not an Object and cannot be passed to memory region functions +*/ + +static void vfio_map_region(VFIOPlatformDevice *vdev, int nr) +{ + VFIORegion *region = vdev->vdev.regions[nr]; + unsigned size = region->size; + char name[64]; + + snprintf(name, sizeof(name), "VFIO %s region %d", vdev->vdev.name, nr); + + /* A "slow" read/write mapping underlies all regions */ + memory_region_init_io(®ion->mem, OBJECT(vdev), &vfio_region_ops, + region, name, size); + + strncat(name, " mmap", sizeof(name) - strlen(name) - 1); + + if (vfio_mmap_region(OBJECT(vdev), region, ®ion->mem, + ®ion->mmap_mem, ®ion->mmap, size, 0, name)) { + error_report("%s unsupported. Performance may be slow", name); + } +} + + +static void vfio_unmap_region(VFIODevice *vdev, int nr) +{ + VFIORegion *region = vdev->regions[nr]; + + if (!region->size) { + return; + } + + memory_region_del_subregion(®ion->mem, ®ion->mmap_mem); + munmap(region->mmap, memory_region_size(®ion->mmap_mem)); + memory_region_destroy(®ion->mmap_mem); + + memory_region_destroy(®ion->mem); +} + +static void vfio_unmap_regions(VFIODevice *vdev) +{ + int i; + for (i = 0; i < vdev->num_regions; i++) { + vfio_unmap_region(vdev, i); + } +} + + +static int vfio_platform_get_device_regions(VFIODevice *vbasedev) +{ + struct vfio_region_info reg_info = { .argsz = sizeof(reg_info) }; + int i, ret = errno; + + vbasedev->regions = g_malloc0(sizeof(VFIORegion *) * vbasedev->num_regions); + + for (i = 0; i < vbasedev->num_regions; i++) { + vbasedev->regions[i] = g_malloc0(sizeof(VFIORegion)); + + reg_info.index = i; + + ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, ®_info); + if (ret) { + error_report("vfio: Error getting region %d info: %m", i); + goto error; + } + + vbasedev->regions[i]->flags = reg_info.flags; + vbasedev->regions[i]->size = reg_info.size; + vbasedev->regions[i]->fd_offset = reg_info.offset; + vbasedev->regions[i]->fd = vbasedev->fd; + vbasedev->regions[i]->nr = i; + vbasedev->regions[i]->vdev = vbasedev; + } + + print_regions(vbasedev); + + return ret; + +error: + for (i = 0; i < vbasedev->num_regions; i++) { + g_free(vbasedev->regions[i]); + } + g_free(vbasedev->regions); + vfio_put_base_device(vbasedev); + return ret; +} + + +/* not implemented yet */ +static int vfio_platform_check_device(VFIODevice *vdev) +{ + return 0; +} + +/* not implemented yet */ +static bool vfio_platform_compute_needs_reset(VFIODevice *vdev) +{ +return false; +} + +static int vfio_platform_hot_reset_multi(VFIODevice *vdev) +{ +return 0; +} + + +/* not implemented yet */ +static int vfio_platform_get_device_interrupts(VFIODevice *vdev) +{ + return 0; +} + +/* not implemented yet */ +static void vfio_platform_eoi(VFIODevice *vdev) +{ +} + +static VFIODeviceOps vfio_platform_ops = { + .vfio_eoi = vfio_platform_eoi, + .vfio_compute_needs_reset = vfio_platform_compute_needs_reset, + .vfio_hot_reset_multi = vfio_platform_hot_reset_multi, + .vfio_check_device = vfio_platform_check_device, + .vfio_get_device_regions = vfio_platform_get_device_regions, + .vfio_get_device_interrupts = vfio_platform_get_device_interrupts, +}; + + +static void vfio_platform_realize(DeviceState *dev, Error **errp) +{ + SysBusDevice *sbdev = SYS_BUS_DEVICE(dev); + VFIOPlatformDevice *vdev = container_of(sbdev, VFIOPlatformDevice, sbdev); + VFIODevice *vbasedev = &vdev->vdev; + int i, ret; + + vbasedev->ops = &vfio_platform_ops; + + /* TODO: pass device name on command line */ + vbasedev->name = malloc(PATH_MAX); + snprintf(vbasedev->name, PATH_MAX, "%s", "fff51000.ethernet"); + + ret = vfio_base_device_init(vbasedev, VFIO_DEVICE_TYPE_PLATFORM); + if (ret < 0) { + return; + } + + for (i = 0; i < vbasedev->num_regions; i++) { + vfio_map_region(vdev, i); + sysbus_init_mmio(sbdev, &vbasedev->regions[i]->mem); + } +} + +static void vfio_platform_unrealize(DeviceState *dev, Error **errp) +{ + int i; + SysBusDevice *sbdev = SYS_BUS_DEVICE(dev); + VFIOPlatformDevice *vdev = container_of(sbdev, VFIOPlatformDevice, sbdev); + VFIODevice *vbasedev = &vdev->vdev; + VFIOGroup *group = vbasedev->group; + /* + * placeholder for + * vfio_unregister_err_notifier(vdev) + * vfio_disable_interrupts(vdev); + * timer free + * g_free vdev dynamic fields + */ + vfio_unmap_regions(vbasedev); + + for (i = 0; i < vbasedev->num_regions; i++) { + g_free(vbasedev->regions[i]); + } + g_free(vbasedev->regions); + + vfio_put_base_device(vbasedev); + vfio_put_group(group, vfio_reset_handler); + +} + +static const VMStateDescription vfio_platform_vmstate = { + .name = TYPE_VFIO_PLATFORM, + .unmigratable = 1, +}; + +typedef struct VFIOPlatformDeviceClass { + DeviceClass parent_class; + + int (*init)(VFIODevice *dev); +} VFIOPlatformDeviceClass; + +#define VFIO_PLATFORM_DEVICE(obj) \ + OBJECT_CHECK(VFIOPlatformDevice, (obj), TYPE_VFIO_PLATFORM) +#define VFIO_PLATFORM_DEVICE_CLASS(klass) \ + OBJECT_CLASS_CHECK(VFIOPlatformDeviceClass, (klass), TYPE_VFIO_PLATFORM) +#define VFIO_PLATFORM_DEVICE_GET_CLASS(obj) \ + OBJECT_GET_CLASS(VFIOPlatformDeviceClass, (obj), TYPE_VFIO_PLATFORM) + + + +static void vfio_platform_dev_class_init(ObjectClass *klass, void *data) +{ + DeviceClass *dc = DEVICE_CLASS(klass); + VFIOPlatformDeviceClass *vdc = VFIO_PLATFORM_DEVICE_CLASS(klass); + + dc->realize = vfio_platform_realize; + dc->unrealize = vfio_platform_unrealize; + dc->vmsd = &vfio_platform_vmstate; + dc->desc = "VFIO-based platform device assignment"; + set_bit(DEVICE_CATEGORY_MISC, dc->categories); + + vdc->init = NULL; +} + +static const TypeInfo vfio_platform_dev_info = { + .name = TYPE_VFIO_PLATFORM, + .parent = TYPE_SYS_BUS_DEVICE, + .instance_size = sizeof(VFIOPlatformDevice), + .class_init = vfio_platform_dev_class_init, + .class_size = sizeof(VFIOPlatformDeviceClass), +}; + +static void register_vfio_platform_dev_type(void) +{ + type_register_static(&vfio_platform_dev_info); +} + +type_init(register_vfio_platform_dev_type) diff --git a/hw/vfio/vfio-common.h b/hw/vfio/vfio-common.h new file mode 100644 index 0000000..2699fba --- /dev/null +++ b/hw/vfio/vfio-common.h @@ -0,0 +1,143 @@ +/* + * common header for vfio based device assignment support + * + * Copyright Red Hat, Inc. 2012 + * + * Authors: + * Alex Williamson + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + * Based on qemu-kvm device-assignment: + * Adapted for KVM by Qumranet. + * Copyright (c) 2007, Neocleus, Alex Novik (alex@neocleus.com) + * Copyright (c) 2007, Neocleus, Guy Zana (guy@neocleus.com) + * Copyright (C) 2008, Qumranet, Amit Shah (amit.shah@qumranet.com) + * Copyright (C) 2008, Red Hat, Amit Shah (amit.shah@redhat.com) + * Copyright (C) 2008, IBM, Muli Ben-Yehuda (muli@il.ibm.com) + */ + +#include "hw/hw.h" + +/*#define DEBUG_VFIO*/ +#ifdef DEBUG_VFIO +#define DPRINTF(fmt, ...) \ + do { fprintf(stderr, "vfio: " fmt, ## __VA_ARGS__); } while (0) +#else +#define DPRINTF(fmt, ...) \ + do { } while (0) +#endif + +/* Extra debugging, trap acceleration paths for more logging */ +#define VFIO_ALLOW_MMAP 1 +#define VFIO_ALLOW_KVM_INTX 1 +#define VFIO_ALLOW_KVM_MSI 1 +#define VFIO_ALLOW_KVM_MSIX 1 + +#define TYPE_VFIO_PLATFORM "vfio-platform" + +enum { + VFIO_DEVICE_TYPE_PCI = 0, + VFIO_DEVICE_TYPE_PLATFORM = 1, +}; + +struct VFIOGroup; +struct VFIODevice; + +typedef struct VFIODeviceOps VFIODeviceOps; + +/* Base Class for a VFIO Region */ + +typedef struct VFIORegion { + struct VFIODevice *vdev; + off_t fd_offset; /* offset of region within device fd */ + int fd; /* device fd, allows us to pass VFIORegion as opaque data */ + MemoryRegion mem; /* slow, read/write access */ + MemoryRegion mmap_mem; /* direct mapped access */ + void *mmap; + size_t size; + uint32_t flags; /* VFIO region flags (rd/wr/mmap) */ + uint8_t nr; /* cache the region number for debug */ +} VFIORegion; + + +/* Base Class for a VFIO device */ + +typedef struct VFIODevice { + QLIST_ENTRY(VFIODevice) next; + struct VFIOGroup *group; + unsigned int num_regions; + VFIORegion **regions; + unsigned int num_irqs; + char *name; + int fd; + int type; + bool reset_works; + bool needs_reset; + VFIODeviceOps *ops; +} VFIODevice; + + +typedef struct VFIOType1 { + MemoryListener listener; + int error; + bool initialized; +} VFIOType1; + +typedef struct VFIOContainer { + int fd; /* /dev/vfio/vfio, empowered by the attached groups */ + struct { + /* enable abstraction to support various iommu backends */ + union { + VFIOType1 type1; + }; + void (*release)(struct VFIOContainer *); + } iommu_data; + QLIST_HEAD(, VFIOGroup) group_list; + QLIST_ENTRY(VFIOContainer) next; +} VFIOContainer; + +typedef struct VFIOGroup { + int fd; + int groupid; + VFIOContainer *container; + QLIST_HEAD(, VFIODevice) device_list; + QLIST_ENTRY(VFIOGroup) next; + QLIST_ENTRY(VFIOGroup) container_next; +} VFIOGroup; + + +struct VFIODeviceOps { + bool (*vfio_compute_needs_reset)(VFIODevice *vdev); + int (*vfio_hot_reset_multi)(VFIODevice *vdev); + void (*vfio_eoi)(VFIODevice *vdev); + int (*vfio_check_device)(VFIODevice *vdev); + int (*vfio_get_device_regions)(VFIODevice *vdev); + int (*vfio_get_device_interrupts)(VFIODevice *vdev); +}; + + + +VFIOGroup *vfio_get_group(int groupid, QEMUResetHandler *reset_handler); +void vfio_put_group(VFIOGroup *group, QEMUResetHandler *reset_handler); + +void vfio_reset_handler(void *opaque); + +void vfio_unmask_irqindex(VFIODevice *vdev, int index); +void vfio_disable_irqindex(VFIODevice *vdev, int index); +void vfio_mask_int(VFIODevice *vdev, int index); + +void vfio_region_write(void *opaque, hwaddr addr, uint64_t data, unsigned size); +uint64_t vfio_region_read(void *opaque, hwaddr addr, unsigned size); + +int vfio_get_base_device(VFIOGroup *group, const char *name, + struct VFIODevice *vdev); +void vfio_put_base_device(VFIODevice *vdev); +int vfio_base_device_init(VFIODevice *vdev, int type); +void print_regions(VFIODevice *vdev); + +int vfio_mmap_region(Object *vdev, VFIORegion *region, + MemoryRegion *mem, MemoryRegion *submem, + void **map, size_t size, off_t offset, + const char *name); diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h index 26c218e..ef4815d 100644 --- a/linux-headers/linux/vfio.h +++ b/linux-headers/linux/vfio.h @@ -154,6 +154,7 @@ struct vfio_device_info { __u32 flags; #define VFIO_DEVICE_FLAGS_RESET (1 << 0) /* Device supports reset */ #define VFIO_DEVICE_FLAGS_PCI (1 << 1) /* vfio-pci device */ +#define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2) /* vfio-pci device */ __u32 num_regions; /* Max region index + 1 */ __u32 num_irqs; /* Max IRQ index + 1 */ };