Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sched-group shares vs IO-class shares #988

Closed
xemul opened this issue Dec 23, 2021 · 1 comment
Closed

Sched-group shares vs IO-class shares #988

xemul opened this issue Dec 23, 2021 · 1 comment
Assignees

Comments

@xemul
Copy link
Contributor

xemul commented Dec 23, 2021

Right now scylla (and io-tester too) configure different "workloads" with equal shares set on both -- sched group and IO class. This, however, sometimes results in a hard-to-predict interaction. In particular, two different IO workloads are naturally expected to share disk capacity according to their shares, however CPU scheduler doesn't always let read and write fibers to be run at the desired rate.

Came from the io-tester script:

- data_size: 8192MB
  name: read_rated
  shard_info:
    parallelism: 16
    reqsize: 4kB
    rps: 251
    shares: 2500
  shards: all
  type: randread
- data_size: 8192MB
  name: write_unbound
  shard_info:
    parallelism: 2
    reqsize: 64kB
    shares: 100
  shards: all
  type: seqwrite

It turns out to be very sensitive to the number of reading fibers.

@xemul xemul self-assigned this Dec 23, 2021
avikivity added a commit that referenced this issue Dec 26, 2021
"
There are 2 limitations that iotune collects from the disk -- the bandwidth
and the iops limit (both separately for read and write, but this separation
is not important here). Next the scheduler makes sure that both of these
limits are respected when dispatching IO into the disk. The latter is
maintained by summing up the number of requests executing and their lengths
and stopping the dispatch once the sum hist the respective limit.

This algo assumes that the disk is linear in all dimension, in particular
that reads and writes are equal in terms of affecting each other, while
this is not so. The recent study showed that disks manage mixed workload
differently and linear summing does overwhelm the disk.

The solution is based on the observation that throttling reads and writes
somehow below the collected maximum throughputs/iops helps keeping the
latencies in bounds.

This set replaces the current ticket-based capacity management with the
rate-limited one. The rate-limited formula is the

  bw_r / max_bw_r + iops_r / max_iops_r + { same for _w }  <= 1.0

To make this happen first patches encapsulate the capacity management inside
io-group. Patch #16 adds the rate-limiter itself. The last patch is the test
that shows how rate-limited reads compete with unbound writes.

First, the read-only workload is run on its own:

     througput(kbs)   iops     lat95  queued  executed (microseconds)
        514536      128634    1110.2   262.1     465.3

the scheduler coefficients are

     K_bandwidth = 0.231  K_iops = 0.499    K = 0.729

The workload is configured for 50% of the disk read IOPS, and this is what
it shows. The K value is 0.5 + the bandwidth tax.

Next, the very same read workload is run in parallel with unbount writes.
Shares are 100 for write and 2500 for read.

     througput(kbs)   iops     lat95  queued  executed (microseconds)
w:       226349       3536  118357.4 26726.8     397.4
r:       500465     125116    2134.4  1182.7     321.7

the scheduler coefficients are

write: K_bandwidth = 0.226 K_iops = 0.020    K = 0.246
read:  K_bandwidth = 0.224 K_iops = 0.485    K = 0.709

Comments:

1. K_read / K_write is not 2500 / 100 because reads do not need that much.
   Changing reads' shares will hit the wake-up latency (see below)

2. read 1.1ms queued time comes from two places:

   First, the queued request waits ~1 tick until io-queue is polled

   Second, when the queue tries to submit the request it most likely
   hits the rate limit and needs to wait at least 1 more tick until some
   other request completes and replenisher releases its capacity back

3. read lat95 of 2ms is the best out of several runs. More typical one is
   around 6ms and higher. This latency = queued + executed + wakeup times.
   In the "best" case the wakeup time is thus 0.7ms, in "more typical" one
   it's ~4ms without good explanation where it comes from (issue #989)

   These results are achieved with several in-test tricks that mitigated
   some problems that seem to come from CPU scheduler and/or reactor loop:

   a. read workload needs the exact amount of fibers so that CPU scheduler
      resonates with the io-scheduler shares. Less fibers under-generate
      requests, more fiers cause much larger wakeup latencies (issue #988)

   b. read fibers busy-loop, not sleeps between submitting requests.
      Sleeping makes both -- lower iops and larger latencies. Partially
      this is connected to the previous issue -- if having few fibers the
      sleep timer needs to tick more often than the reactor allows (the
      extreme case on one fiber requires 8us ticking). If having many fibers
      so that each ticks at some reasonable rate and generating a uniform
      requests distribution would still make reactor poll too often (125k
      IOPS -> 8us per request). Reactor cannot do it, one of the reasons is
      that polling empty smp queues takes ~5% of CPU (issues #986, #987)

   c. read fibers' busy-loop future is

            return do_until(
	       [] { return time_is_now(); },
	       [] { return later(); }
	    )

      but the later() is open-coded to create the task in the workload's
      sched group (existing later() implementation uses default_sched_group).
      Using default later() affects the latency the bad way (issue #990)

Nearest TODO:

- Capacity management and shares accounting use differently scaled values
  (the shares accounting uses ~750 times smaller ones). Keep only one.
- Add the metrics to show K's for different classes. This needs previous
  task to be done first.
- Keep the capacity directly on the fair_queue_entry. Now there's the
  ticket sitting it it and it's converted to capacity on demand.

tests: io_tester(dev)
       manual.rl-sched(dev)
       unit(dev)
"

* 'br-rate-limited-scheduling-4' of https://github.com/xemul/seastar: (22 commits)
  tests: Add rl-iosched manual test
  fair_queue: Add debug prints with request capacities
  reactor: Make rate factor configurable
  io_queue, fair_queue: Remove paces
  io_queue: Calculate max ticket from fq
  fair_queue: Linearize accumulators
  fair_queue: Rate-limiter based scheduling
  fair_queue: Rename maximum_capacity into shares_capacity
  fair_queue: Rename bits req_count,bytes_count -> weight,size
  fair_queue: Make group config creation without constructor
  fair_queue: Swap and constify max capacity
  fair_queue: Introduce ticket capacity
  fair_queue: Add fetch_add helpers
  fair_queue: Replace group rover type with ticket
  fair_queue: Make maybe_ahead_of() non-method helper
  fair_queue: Dont export head value from group
  fair_queue: Replace pending.orig_tail with head
  io_queue: Configure queue in terms of blocks
  io_tester: Indentation fix
  io_tester: Introduce polling_sleep option
  ...
@xemul
Copy link
Contributor Author

xemul commented Jul 21, 2023

obsoleted by f94b1bb

@xemul xemul closed this as completed Jul 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant