Message ID | 20210209123845.4856-1-michael.christie@oracle.com |
---|---|
Headers | show |
Series | target: fix cmd plugging and completion | expand |
On 09.02.21 13:38, Mike Christie wrote: > > +void target_queued_compl_work(struct work_struct *work) > +{ > + struct se_cmd_queue *cq = container_of(work, struct se_cmd_queue, > + work); > + struct se_cmd *se_cmd, *next_cmd; > + struct llist_node *cmd_list; > + > + cmd_list = llist_del_all(&cq->cmd_list); Probably nit-picking: I'd like to reverse the list before processing like you did during submission. > + llist_for_each_entry_safe(se_cmd, next_cmd, cmd_list, se_cmd_list) > + target_complete_cmd_work(se_cmd); > +} > +
On 2/9/21 11:01 AM, Bodo Stroesser wrote: > On 09.02.21 13:38, Mike Christie wrote: >> >> +void target_queued_compl_work(struct work_struct *work) >> +{ >> + struct se_cmd_queue *cq = container_of(work, struct se_cmd_queue, >> + work); >> + struct se_cmd *se_cmd, *next_cmd; >> + struct llist_node *cmd_list; >> + >> + cmd_list = llist_del_all(&cq->cmd_list); > > Probably nit-picking: I'd like to reverse the list before processing like you did during submission. > Yeah, I flip flopped a lot on both. I'll just do it. I didn't notice any perf differences.
On 2/9/21 10:32 AM, Bodo Stroesser wrote: > On 09.02.21 13:38, Mike Christie wrote: >> This patch adds plug/unplug callouts for tcmu, so we can avoid the >> number of times we switch to userspace. Using this driver with tcm >> loop is a common config, and dependng on the nr_hw_queues >> (nr_hw_queues=1 performs much better) and fio jobs (lower num jobs >> around 4) this patch can increase IOPs by only around 5-10% because >> we hit other issues like the big per tcmu device mutex.> >> Signed-off-by: Mike Christie <michael.christie@oracle.com> >> --- >> drivers/target/target_core_user.c | 27 +++++++++++++++++++++++++-- >> 1 file changed, 25 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c >> index a5991df23581..a030ca6f0f4c 100644 >> --- a/drivers/target/target_core_user.c >> +++ b/drivers/target/target_core_user.c >> @@ -111,6 +111,7 @@ struct tcmu_dev { >> struct kref kref; >> >> struct se_device se_dev; >> + struct se_dev_plug se_plug; >> >> char *name; >> struct se_hba *hba; >> @@ -119,6 +120,7 @@ struct tcmu_dev { >> #define TCMU_DEV_BIT_BROKEN 1 >> #define TCMU_DEV_BIT_BLOCKED 2 >> #define TCMU_DEV_BIT_TMR_NOTIFY 3 >> +#define TCM_DEV_BIT_PLUGGED 4 >> unsigned long flags; >> >> struct uio_info uio_info; >> @@ -959,6 +961,25 @@ static uint32_t ring_insert_padding(struct tcmu_dev *udev, size_t cmd_size) >> return cmd_head; >> } >> >> +static void tcmu_unplug_device(struct se_dev_plug *se_plug) >> +{ >> + struct se_device *se_dev = se_plug->se_dev; >> + struct tcmu_dev *udev = TCMU_DEV(se_dev); >> + >> + uio_event_notify(&udev->uio_info); > > Don't we have a race here? > > Let's assume that > - just here the thread is interrupted > - userspace starts,empties the ring and sleeps again > - another cpu queues a new CDB in the ring > In that - of course very rare condition - userspace will not wake up for the freshly queued CDB. > > I think, first clearing the bit, then doing the uio_event_notify would work (without need to take the big tcmu mutex). You,re right. Will fix. I have the same issue in iblock and there I made a mistake where it per cpu when it should be per task.
On Tue, Feb 09, 2021 at 06:38:33AM -0600, Mike Christie wrote: > The next patch splits target_submit_cmd_map_sgls so the initialization > and submission part can be called at different times. If the init part > fails we can reference the t_task_cdb early in some of the logging > and tracing code. This moves it to transport_init_se_cmd so we don't > hit NULL pointer crashes. > > Signed-off-by: Mike Christie <michael.christie@oracle.com> Looks good, Reviewed-by: Christoph Hellwig <hch@lst.de>
On Tue, Feb 09, 2021 at 06:38:37AM -0600, Mike Christie wrote: > This just has tcm loop use the block layer cmd allocator for se_cmds > instead of using the tcm_loop_cmd_cache. In future patches when we > can use the host tags for internal requests like TMFs we can completely > kill the tcm_loop_cmd_cache. > > Signed-off-by: Mike Christie <michael.christie@oracle.com> Looks good, Reviewed-by: Christoph Hellwig <hch@lst.de>
On Tue, Feb 09, 2021 at 06:38:43AM -0600, Mike Christie wrote: > Doing a work per cmd can lead to lots of threads being created. > This patch just replaces the completion work per cmd with a per cpu > list. Combined with the first patches this allows tcm loop on top of > initiators like iser to go from around 700K IOPs to 1000K and reduces > the number of threads that get created when the system is under heavy > load and hitting the initiator drivers tagging limits. OTOH it does increase completion latency, which might be the preference for some workloads. Do we need a tunable here? > +static void target_queue_cmd_work(struct se_cmd_queue *q, struct se_cmd *se_cmd, > + int cpu, struct workqueue_struct *wq) > { > - struct se_cmd *cmd = container_of(work, struct se_cmd, work); > + llist_add(&se_cmd->se_cmd_list, &q->cmd_list); > + queue_work_on(cpu, wq, &q->work); > +} Do we need this helper at all? Having it open coded in the two callers would seem easier to follow to me.
> struct se_device *se_dev = se_cmd->se_dev; > - int cpu = se_cmd->cpuid; > + int cpu; > + > + if (se_cmd->se_cmd_flags & SCF_IGNORE_CPUID_COMPL) > + cpu = smp_processor_id(); > + else > + cpu = se_cmd->cpuid; > > target_queue_cmd_work(&se_dev->queues[cpu].cq, se_cmd, cpu, > target_completion_wq); Can't we just use se_cmd->cpuid == WORK_CPU_UNBOUND as the no affinity indicator, which would remove all branches here.
On 2/10/21 2:42 AM, Christoph Hellwig wrote: > On Tue, Feb 09, 2021 at 06:38:43AM -0600, Mike Christie wrote: >> Doing a work per cmd can lead to lots of threads being created. >> This patch just replaces the completion work per cmd with a per cpu >> list. Combined with the first patches this allows tcm loop on top of >> initiators like iser to go from around 700K IOPs to 1000K and reduces >> the number of threads that get created when the system is under heavy >> load and hitting the initiator drivers tagging limits. > > OTOH it does increase completion latency, which might be the preference > for some workloads. Do we need a tunable here? > I didn't see an increase with higher perf setups like 100G (really 60 somehting) iser/srp and even using ram disks. I think it's because either way we run one cmd on each cpu at a time, or if we do kick off cmds on different threads they still run on the same cpu so latency didn't change overall. 1. Without my patch each cmd has a work struct. target_complete_cmd does queue_work_on which puts the cmd's work on the worker pool for that cpu. With faster drivers, workqueue.c does the equivalent of a loop over each work/cmd running one work on the cpu, that completes, then running the next. If we get to the point where the workqueue code runs cmds in different threads on the same cpu, they end up sharing the cpu so latency wasn't changing (I didn't debug this very much). I can't tell how much the rescuer thread comes into this though. Maybe it could come in and run cmds on a different cpu, and that would help. 2. With the patch, we add the cmd to the device's per cpu list, then do queue_work_on on the list's per cpu work. queue_work_on then either hits the pending check or we run like #1 but on a per device basis instead of per cmd. However, 1. While I was checking this I noticed for commands that set transport_complete_callback, then I need to do the per cmd work, since they can be slow. I'll fix this on the next posting. 2. For the latency issue, did you want the cmd to be able to run on any cpu? I think this won't happen now under normal cases, because the workqueue does not set WQ_UNBOUND. Did we want a tunable for that? >> +static void target_queue_cmd_work(struct se_cmd_queue *q, struct se_cmd *se_cmd, >> + int cpu, struct workqueue_struct *wq) >> { >> - struct se_cmd *cmd = container_of(work, struct se_cmd, work); >> + llist_add(&se_cmd->se_cmd_list, &q->cmd_list); >> + queue_work_on(cpu, wq, &q->work); >> +} > > Do we need this helper at all? Having it open coded in the two callers > would seem easier to follow to me. I can do that.
On 2/10/21 2:44 AM, Christoph Hellwig wrote: >> struct se_device *se_dev = se_cmd->se_dev; >> - int cpu = se_cmd->cpuid; >> + int cpu; >> + >> + if (se_cmd->se_cmd_flags & SCF_IGNORE_CPUID_COMPL) >> + cpu = smp_processor_id(); >> + else >> + cpu = se_cmd->cpuid; >> >> target_queue_cmd_work(&se_dev->queues[cpu].cq, se_cmd, cpu, >> target_completion_wq); > > Can't we just use se_cmd->cpuid == WORK_CPU_UNBOUND as the no affinity > indicator, which would remove all branches here. We can't right now because the workqueue_struct does not have the WQ_UNBOUND bit set. __queue_work will then do: /* pwq which will be used unless @work is executing elsewhere */ if (wq->flags & WQ_UNBOUND) { if (req_cpu == WORK_CPU_UNBOUND) cpu = wq_select_unbound_cpu(raw_smp_processor_id()); pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); } else { if (req_cpu == WORK_CPU_UNBOUND) cpu = raw_smp_processor_id(); pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); } so even if you pass in WORK_CPU_UNBOUND, it will do raw_smp_processor_id and add the work to the cpu's worker pool. I think if I add in a new tunable to make the workqueue bound or unbound like I mentioned in the other thread then I think it will do what you want for both review comments.
On 2/10/21 12:43 PM, Mike Christie wrote: > On 2/10/21 2:44 AM, Christoph Hellwig wrote: >>> struct se_device *se_dev = se_cmd->se_dev; >>> - int cpu = se_cmd->cpuid; >>> + int cpu; >>> + >>> + if (se_cmd->se_cmd_flags & SCF_IGNORE_CPUID_COMPL) >>> + cpu = smp_processor_id(); >>> + else >>> + cpu = se_cmd->cpuid; >>> >>> target_queue_cmd_work(&se_dev->queues[cpu].cq, se_cmd, cpu, >>> target_completion_wq); >> >> Can't we just use se_cmd->cpuid == WORK_CPU_UNBOUND as the no affinity >> indicator, which would remove all branches here. > > We can't right now because the workqueue_struct does > not have the WQ_UNBOUND bit set. __queue_work will then do: > > /* pwq which will be used unless @work is executing elsewhere */ > if (wq->flags & WQ_UNBOUND) { > if (req_cpu == WORK_CPU_UNBOUND) > cpu = wq_select_unbound_cpu(raw_smp_processor_id()); > pwq = unbound_pwq_by_node(wq, cpu_to_node(cpu)); > } else { > if (req_cpu == WORK_CPU_UNBOUND) > cpu = raw_smp_processor_id(); > pwq = per_cpu_ptr(wq->cpu_pwqs, cpu); > } > > so even if you pass in WORK_CPU_UNBOUND, it will do raw_smp_processor_id > and add the work to the cpu's worker pool. Nevermind. What I wrote is exactly what you asked for :) I can do that.