Reactor spends too much time polling empty smp queues #986

xemul · 2021-12-23T07:04:34Z

E.g. testing the io_tester with this config:

- data_size: 8192MB
  name: reads
  shard_info:
    parallelism: 16
    reqsize: 1kB
    rps: 251
    shares: 100000
  shards: all
  type: randread

resulits in this top-10 entries in the perf report

   1,38%  io_tester  io_tester                           [.] seastar::smp_message_queue::flush_request_batch                 ◆
   0,44%  io_tester  io_tester                           [.] seastar::smp_message_queue::process_completions                 ▒
   0,28%  io_tester  io_tester                           [.] seastar::smp::poll_queues                                       ▒
   0,27%  io_tester  io_tester                           [.] seastar::smp_message_queue::process_queue<4ul, seastar::smp_mess▒
   0,25%  io_tester  io_tester                           [.] seastar::smp_message_queue::process_incoming                    ▒
   0,14%  io_tester  io_tester                           [.] seastar::smp_message_queue::process_queue<2ul, seastar::smp_mess▒
   0,13%  io_tester  io_tester                           [.] seastar::smp_message_queue::flush_response_batch                ▒
   0,10%  io_tester  io_tester                           [.] seastar::internal::try_reap_events                              ▒
   0,07%  io_tester  [vdso]                              [.] 0x00000000000006c8                                              ▒
   0,06%  io_tester  io_tester                           [.] seastar::reactor::poll_once                                     ▒

Summing up all seastar::smp entries would be

   1,38%  io_tester  io_tester                           [.] seastar::smp_message_queue::flush_request_batch
   0,44%  io_tester  io_tester                           [.] seastar::smp_message_queue::process_completions
   0,28%  io_tester  io_tester                           [.] seastar::smp::poll_queues
   0,27%  io_tester  io_tester                           [.] seastar::smp_message_queue::process_queue<4ul, seastar::smp_message_queue::process_completions(unsigned int)::{lambda(seastar::smp_message_queue::work_item*)#1}>
   0,25%  io_tester  io_tester                           [.] seastar::smp_message_queue::process_incoming
   0,14%  io_tester  io_tester                           [.] seastar::smp_message_queue::process_queue<2ul, seastar::smp_message_queue::process_incoming()::{lambda(seastar::smp_message_queue::work_item*)#1}>
   0,13%  io_tester  io_tester                           [.] seastar::smp_message_queue::flush_response_batch
   0,00%  io_tester  io_tester                           [.] seastar::noncopyable_function<seastar::future<void> (seastar::future<seastar::semaphore_units<seastar::named_semaphore_exception_factory, seastar::lowres_clock> >&&)>::direct_vtable_for<seastar::future<seastar::semaphore_units<seastar::named_semaphore_exception_factory, seastar::lowres_clock> >::then_wrapped_maybe_erase<true, void, seastar::smp_message_queue::submit_item(unsigned int, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >, std::unique_ptr<seastar::smp_message_queue::work_item, std::default_delete<seastar::smp_message_queue::work_item> >)::{lambda(seastar::future<seastar::semaphore_units<seastar::named_semaphore_exception_factory, seastar::lowres_clock> >)#1}>(seastar::smp_message_queue::submit_item(unsigned int, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000l> > >, std::unique_ptr<seastar::smp_message_queue::work_item, std::default_delete<seastar::smp_message_queue::work_item> >)::{lambda(seastar::future<seastar::semaphore_units<seastar::named_semaphore_exception_factory, seastar::lowres_clock> >)#1}&&)::{lambda(seastar::future<seastar::semaphore_units<seastar::named_semaphore_exception_factory, seastar::lowres_clock> >&&)#1}>::call
   0,00%  io_tester  io_tester                           [.] seastar::smp_message_queue::lf_queue::maybe_wakeup
   0,00%  io_tester  io_tester                           [.] seastar::smp_message_queue::~smp_message_queue

This seems to be way too much for the component that's not used in this test.

The text was updated successfully, but these errors were encountered:

" There are 2 limitations that iotune collects from the disk -- the bandwidth and the iops limit (both separately for read and write, but this separation is not important here). Next the scheduler makes sure that both of these limits are respected when dispatching IO into the disk. The latter is maintained by summing up the number of requests executing and their lengths and stopping the dispatch once the sum hist the respective limit. This algo assumes that the disk is linear in all dimension, in particular that reads and writes are equal in terms of affecting each other, while this is not so. The recent study showed that disks manage mixed workload differently and linear summing does overwhelm the disk. The solution is based on the observation that throttling reads and writes somehow below the collected maximum throughputs/iops helps keeping the latencies in bounds. This set replaces the current ticket-based capacity management with the rate-limited one. The rate-limited formula is the bw_r / max_bw_r + iops_r / max_iops_r + { same for _w } <= 1.0 To make this happen first patches encapsulate the capacity management inside io-group. Patch #16 adds the rate-limiter itself. The last patch is the test that shows how rate-limited reads compete with unbound writes. First, the read-only workload is run on its own: througput(kbs) iops lat95 queued executed (microseconds) 514536 128634 1110.2 262.1 465.3 the scheduler coefficients are K_bandwidth = 0.231 K_iops = 0.499 K = 0.729 The workload is configured for 50% of the disk read IOPS, and this is what it shows. The K value is 0.5 + the bandwidth tax. Next, the very same read workload is run in parallel with unbount writes. Shares are 100 for write and 2500 for read. througput(kbs) iops lat95 queued executed (microseconds) w: 226349 3536 118357.4 26726.8 397.4 r: 500465 125116 2134.4 1182.7 321.7 the scheduler coefficients are write: K_bandwidth = 0.226 K_iops = 0.020 K = 0.246 read: K_bandwidth = 0.224 K_iops = 0.485 K = 0.709 Comments: 1. K_read / K_write is not 2500 / 100 because reads do not need that much. Changing reads' shares will hit the wake-up latency (see below) 2. read 1.1ms queued time comes from two places: First, the queued request waits ~1 tick until io-queue is polled Second, when the queue tries to submit the request it most likely hits the rate limit and needs to wait at least 1 more tick until some other request completes and replenisher releases its capacity back 3. read lat95 of 2ms is the best out of several runs. More typical one is around 6ms and higher. This latency = queued + executed + wakeup times. In the "best" case the wakeup time is thus 0.7ms, in "more typical" one it's ~4ms without good explanation where it comes from (issue #989) These results are achieved with several in-test tricks that mitigated some problems that seem to come from CPU scheduler and/or reactor loop: a. read workload needs the exact amount of fibers so that CPU scheduler resonates with the io-scheduler shares. Less fibers under-generate requests, more fiers cause much larger wakeup latencies (issue #988) b. read fibers busy-loop, not sleeps between submitting requests. Sleeping makes both -- lower iops and larger latencies. Partially this is connected to the previous issue -- if having few fibers the sleep timer needs to tick more often than the reactor allows (the extreme case on one fiber requires 8us ticking). If having many fibers so that each ticks at some reasonable rate and generating a uniform requests distribution would still make reactor poll too often (125k IOPS -> 8us per request). Reactor cannot do it, one of the reasons is that polling empty smp queues takes ~5% of CPU (issues #986, #987) c. read fibers' busy-loop future is return do_until( [] { return time_is_now(); }, [] { return later(); } ) but the later() is open-coded to create the task in the workload's sched group (existing later() implementation uses default_sched_group). Using default later() affects the latency the bad way (issue #990) Nearest TODO: - Capacity management and shares accounting use differently scaled values (the shares accounting uses ~750 times smaller ones). Keep only one. - Add the metrics to show K's for different classes. This needs previous task to be done first. - Keep the capacity directly on the fair_queue_entry. Now there's the ticket sitting it it and it's converted to capacity on demand. tests: io_tester(dev) manual.rl-sched(dev) unit(dev) " * 'br-rate-limited-scheduling-4' of https://github.com/xemul/seastar: (22 commits) tests: Add rl-iosched manual test fair_queue: Add debug prints with request capacities reactor: Make rate factor configurable io_queue, fair_queue: Remove paces io_queue: Calculate max ticket from fq fair_queue: Linearize accumulators fair_queue: Rate-limiter based scheduling fair_queue: Rename maximum_capacity into shares_capacity fair_queue: Rename bits req_count,bytes_count -> weight,size fair_queue: Make group config creation without constructor fair_queue: Swap and constify max capacity fair_queue: Introduce ticket capacity fair_queue: Add fetch_add helpers fair_queue: Replace group rover type with ticket fair_queue: Make maybe_ahead_of() non-method helper fair_queue: Dont export head value from group fair_queue: Replace pending.orig_tail with head io_queue: Configure queue in terms of blocks io_tester: Indentation fix io_tester: Introduce polling_sleep option ...

xemul self-assigned this Dec 23, 2021

xemul mentioned this issue Dec 23, 2021

Revise reactor polling overhead #987

Open

xemul mentioned this issue Jan 10, 2022

idle cpu consumption >40% #996

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reactor spends too much time polling empty smp queues #986

Reactor spends too much time polling empty smp queues #986

xemul commented Dec 23, 2021