Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added stub seastarfs file implementation #16

Merged
merged 3 commits into from
Nov 20, 2019
Merged

Added stub seastarfs file implementation #16

merged 3 commits into from
Nov 20, 2019

Conversation

rokinsky
Copy link

see #4

@rokinsky rokinsky added fs ZPP: FS project zpp ZPP: student project labels Nov 17, 2019
@rokinsky rokinsky force-pushed the stub-file-impl branch 3 times, most recently from b0d1b5a to 73fdda4 Compare November 17, 2019 19:34
@rokinsky rokinsky requested review from psarna and varqox November 18, 2019 16:42
@psarna
Copy link
Owner

psarna commented Nov 19, 2019

Looks good - one technical remark - for the sake of making it easier to merge this branch into master in the future, try to confine your changes to fs/ directory (although tests can obviously go to /tests. For now it's fine for the temporary file util to reside in fs/ as well.

By the way, we already have a file_desc::temporary utility that does a similar thing, but it doesn't expose a path to the file - optionally you could also try adding this functionality, but again, it's even better to have a separate util in the fs/ directory to make our branch as independent as possible.

Also, io_priority_class is an important bit of seastar's I/O API, so it should be passed to the block device as well (I'll make the same comment in #11, where it needs to be introduced first).

@rokinsky
Copy link
Author

For now it's fine for the temporary file util to reside in fs/ as well.

We have moved the temporary_file header in #11.

Also, io_priority_class is an important bit of seastar's I/O API, so it should be passed to the block device as well (I'll make the same comment in #11, where it needs to be introduced first).

Done.

@psarna psarna merged commit e8a660e into zpp_fs Nov 20, 2019
@varqox varqox deleted the stub-file-impl branch December 28, 2019 22:33
StarostaGit pushed a commit to StarostaGit/seastar that referenced this pull request Mar 20, 2020
* Improved metadata

* Added coroutine to the code

* Fixed things

* Added a whole lot of changes I didn't segment

* Remove unnecessary lines

* Removed debug code

* Readded WError flag

* Removed chrono

* Removed unnecessary includes

* Removed debug code

* Fixed formatting

* Added safety semaphore

* Removed unnecessary space
StarostaGit pushed a commit to StarostaGit/seastar that referenced this pull request Mar 20, 2020
* Improved metadata

* Added coroutine to the code

* Fixed things

* Added a whole lot of changes I didn't segment

* Remove unnecessary lines

* Removed debug code

* Readded WError flag

* Removed chrono

* Removed unnecessary includes

* Removed debug code

* Fixed formatting

* Added safety semaphore

* Removed unnecessary space
StarostaGit pushed a commit to StarostaGit/seastar that referenced this pull request Mar 20, 2020
This commit consists of many squashed parts all building
the complete kafka producer.

kafka: Implement kafka_producer and demo app

kafka: Implement kafka_connection class

Implement kafka_connection class which provides functionality
to send Kafka protocol messages without manual serialization or
deserialization.

Add support for API versioning as kafka_connection asks
the broker for ApiVersions and serializes, deserializes next messages
based on broker versions.

kafka: Implement tcp_connection timeout

Added timeout functionality to tcp_connection in connect, read, write.
Updated kafka_connection and kafka_producer with timeout support.

kafka: Implement tcp_connection_exception

Implement tcp_connection_exception which is thrown when connection is
disconnected before reading the response.

kafka: Implement exception translation

In the case of network error or parsing error, send now catches the
exception, translates it into appropriate error code and returns
response message with this code.

Add Connection Manager class

This change adds a connection manager unit to keep track of open connections
and allow for a fast access by the producer. With that we can now manage
multiple brokers and read their addresses from metadata requests.
It also introduces a dummy partitioner for further development.

Connection Pool + Multibroker demo

kafka: Implement error codes

Implement classes to support parsing of
error codes and list of error codes.

kafka: Implement retry_helper

Implement retry_helper: a future utility class which allows retries of
futures. Retry count is capped and controlled by constructor argument.
Retries are started with backoff time, using exponential backoff with
jitter strategy.

kafka: Implement metadata_manager

Added metadata_manager class.

kafka: Allow multiple in-flight messages

This change allows kafka_connection to be used in non-sequential way,
allowing to "queue" multiple messages without waiting for the previous
to have ended.

Semaphores with count=1 are used to implement queuing of sends and
receives, taking advantage of semaphore FIFO functionality. It is
crucial to preserve the ordering, so that receive reads the correct
response.

kafka: Non-sequential use of connection_manager

The change allows use of connection_manager in non-sequential manner.

kafka: Implement batching

Implement batching of messages, initial error handling (via retries).
A manual flush is currently needed to send the messages in producer
(no automatic flushing).

kafka: E2E tests

Implement "framework" for E2E Kafka producer tests and a few basic
producer tests.

Introduces partitioner abstract base class while preserving old behav… (psarna#17)

* Introduces partitioner abstract base class while preserving old behaviour and existing partitioner class. Introduces round robin partitioner as derived class.

* Replaced partitioner* type with std::unique_ptr<partitioner> inside producer.
Replaced atomic uint with normal uint.

* Added hashing in case of non-zero length key.

Zpp kafka metadata enrichment (psarna#16)

* Improved metadata

* Added coroutine to the code

* Fixed things

* Added a whole lot of changes I didn't segment

* Remove unnecessary lines

* Removed debug code

* Readded WError flag

* Removed chrono

* Removed unnecessary includes

* Removed debug code

* Fixed formatting

* Added safety semaphore

* Removed unnecessary space

Add producer properties as a way for the kafka producer (psarna#21)

to be customizable by the user.

This change ties together various producer parameters into
a single properties object, easily customizable by the user.
Among the fields are parameters such as timeouts, backoff
strategies and partitioners as well as some that might be
introduced in the future (like acks)
StarostaGit pushed a commit to StarostaGit/seastar that referenced this pull request Mar 20, 2020
Group commit for kafka producer consisting of many
smaller ones squashed together

kafka: Local Kafka deployment for development testing.

Implemented Terraform configuration to provision Kafka cluster locally using Docker. Network IP address and number of Kafka brokers are configurable by user. Decided to write Terraform configuration from scratch, as the other already available (Docker Compose) solution was incorrect. I chose Terraform as it could be also used in the future to model other test deployments (e.g. AWS).

Kafka TCP connection class

This change adds a basic, low level network connection management
for the Kafka client. Later, this should be encapsulated in a TCP Client class
to work on the Kafka messages abstraction.
So far the tests are also somewhat makeshift - ideally, they have to be separated
into both unit and integration tests, with the former not rellying on existing
Kafka brokers.

kafka: Primitives, ApiVersions, Metadata parsing

Implement classes for parsing basic Kafka protocol primitives,
as well as ApiVersions and Metadata request/response messages.

kafka: Varint parsing

Implement 32-bit zigzag varint parsing, as defined in the Kafka
protocol specification.

kafka: RECORDS type structures and record parsing

Implement structures storing RECORDS type (as defined in Kafka protocol
specification) and parsing of a single record (only version 2).

kafka: Parsing of records batch

Implement parsing of records batch from RECORDS Kafka protocol
type (version 2).

kafka: Parsing of RECORDS type

Implemented parsing of RECORDS type (version 2), as defined in the
Kafka protocol specification.

kafka: Produce request parsing

Implement parsing of Produce request from Kafka protocol.

kafka: Produce response parsing

Implement parsing of Produce response from Kafka protocol.

kafka: Implement kafka_producer and demo app

kafka: Implement kafka_connection class

Implement kafka_connection class which provides functionality
to send Kafka protocol messages without manual serialization or
deserialization.

Add support for API versioning as kafka_connection asks
the broker for ApiVersions and serializes, deserializes next messages
based on broker versions.

kafka: Implement tcp_connection timeout

Added timeout functionality to tcp_connection in connect, read, write.
Updated kafka_connection and kafka_producer with timeout support.

kafka: Implement tcp_connection_exception

Implement tcp_connection_exception which is thrown when connection is
disconnected before reading the response.

kafka: Implement exception translation

In the case of network error or parsing error, send now catches the
exception, translates it into appropriate error code and returns
response message with this code.

Add Connection Manager class

This change adds a connection manager unit to keep track of open connections
and allow for a fast access by the producer. With that we can now manage
multiple brokers and read their addresses from metadata requests.
It also introduces a dummy partitioner for further development.

Connection Pool + Multibroker demo

kafka: Implement error codes

Implement classes to support parsing of
error codes and list of error codes.

kafka: Implement retry_helper

Implement retry_helper: a future utility class which allows retries of
futures. Retry count is capped and controlled by constructor argument.
Retries are started with backoff time, using exponential backoff with
jitter strategy.

kafka: Implement metadata_manager

Added metadata_manager class.

kafka: Allow multiple in-flight messages

This change allows kafka_connection to be used in non-sequential way,
allowing to "queue" multiple messages without waiting for the previous
to have ended.

Semaphores with count=1 are used to implement queuing of sends and
receives, taking advantage of semaphore FIFO functionality. It is
crucial to preserve the ordering, so that receive reads the correct
response.

kafka: Non-sequential use of connection_manager

The change allows use of connection_manager in non-sequential manner.

kafka: Implement batching

Implement batching of messages, initial error handling (via retries).
A manual flush is currently needed to send the messages in producer
(no automatic flushing).

kafka: E2E tests

Implement "framework" for E2E Kafka producer tests and a few basic
producer tests.

Introduces partitioner abstract base class while preserving old behav… (psarna#17)

* Introduces partitioner abstract base class while preserving old behaviour and existing partitioner class. Introduces round robin partitioner as derived class.

* Replaced partitioner* type with std::unique_ptr<partitioner> inside producer.
Replaced atomic uint with normal uint.

* Added hashing in case of non-zero length key.

Zpp kafka metadata enrichment (psarna#16)

* Improved metadata

* Added coroutine to the code

* Fixed things

* Added a whole lot of changes I didn't segment

* Remove unnecessary lines

* Removed debug code

* Readded WError flag

* Removed chrono

* Removed unnecessary includes

* Removed debug code

* Fixed formatting

* Added safety semaphore

* Removed unnecessary space

Add producer properties as a way for the kafka producer (psarna#21)

to be customizable by the user.

This change ties together various producer parameters into
a single properties object, easily customizable by the user.
Among the fields are parameters such as timeouts, backoff
strategies and partitioners as well as some that might be
introduced in the future (like acks)
@StarostaGit StarostaGit mentioned this pull request Mar 20, 2020
psarna pushed a commit that referenced this pull request Jun 8, 2021
…o_with

Fixes failures in debug mode:
```
$ build/debug/tests/unit/closeable_test -l all -t deferred_close_test
WARNING: debug mode. Not for benchmarking or production
random-seed=3064133628
Running 1 test case...
Entering test module "../../tests/unit/closeable_test.cc"
../../tests/unit/closeable_test.cc(0): Entering test case "deferred_close_test"
../../src/testing/seastar_test.cc(43): info: check true has passed
==9449==WARNING: ASan doesn't fully support makecontext/swapcontext functions and may produce false positives in some cases!
terminate called after throwing an instance of 'seastar::broken_promise'
  what():  broken promise
==9449==WARNING: ASan is ignoring requested __asan_handle_no_return: stack top: 0x7fbf1f49f000; bottom 0x7fbf40971000; size: 0xffffffffdeb2e000 (-558702592)
False positive error reports may follow
For details see google/sanitizers#189
=================================================================
==9449==AddressSanitizer CHECK failed: ../../../../libsanitizer/asan/asan_thread.cpp:356 "((ptr[0] == kCurrentStackFrameMagic)) != (0)" (0x0, 0x0)
    #0 0x7fbf45f39d0b  (/lib64/libasan.so.6+0xb3d0b)
    #1 0x7fbf45f57d4e  (/lib64/libasan.so.6+0xd1d4e)
    #2 0x7fbf45f3e724  (/lib64/libasan.so.6+0xb8724)
    #3 0x7fbf45eb3e5b  (/lib64/libasan.so.6+0x2de5b)
    #4 0x7fbf45eb51e8  (/lib64/libasan.so.6+0x2f1e8)
    #5 0x7fbf45eb7694  (/lib64/libasan.so.6+0x31694)
    #6 0x7fbf45f39398  (/lib64/libasan.so.6+0xb3398)
    #7 0x7fbf45f3a00b in __asan_report_load8 (/lib64/libasan.so.6+0xb400b)
    #8 0xfe6d52 in bool __gnu_cxx::operator!=<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > >(__gnu_cxx::__normal_iterator<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > > const&, __gnu_cxx::__normal_iterator<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > > const&) /usr/include/c++/10/bits/stl_iterator.h:1116
    #9 0xfe615c in dl_iterate_phdr ../../src/core/exception_hacks.cc:121
    #10 0x7fbf44bd1810 in _Unwind_Find_FDE (/lib64/libgcc_s.so.1+0x13810)
    #11 0x7fbf44bcd897  (/lib64/libgcc_s.so.1+0xf897)
    #12 0x7fbf44bcea5f  (/lib64/libgcc_s.so.1+0x10a5f)
    #13 0x7fbf44bcefd8 in _Unwind_RaiseException (/lib64/libgcc_s.so.1+0x10fd8)
    #14 0xfe6281 in _Unwind_RaiseException ../../src/core/exception_hacks.cc:148
    #15 0x7fbf457364bb in __cxa_throw (/lib64/libstdc++.so.6+0xaa4bb)
    #16 0x7fbf45e10a21  (/lib64/libboost_unit_test_framework.so.1.73.0+0x1aa21)
    #17 0x7fbf45e20fe0 in boost::execution_monitor::execute(boost::function<int ()> const&) (/lib64/libboost_unit_test_framework.so.1.73.0+0x2afe0)
    #18 0x7fbf45e21094 in boost::execution_monitor::vexecute(boost::function<void ()> const&) (/lib64/libboost_unit_test_framework.so.1.73.0+0x2b094)
    #19 0x7fbf45e43921 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned long) (/lib64/libboost_unit_test_framework.so.1.73.0+0x4d921)
    #20 0x7fbf45e5eae1  (/lib64/libboost_unit_test_framework.so.1.73.0+0x68ae1)
    #21 0x7fbf45e5ed31  (/lib64/libboost_unit_test_framework.so.1.73.0+0x68d31)
    #22 0x7fbf45e2e547 in boost::unit_test::framework::run(unsigned long, bool) (/lib64/libboost_unit_test_framework.so.1.73.0+0x38547)
    #23 0x7fbf45e43618 in boost::unit_test::unit_test_main(bool (*)(), int, char**) (/lib64/libboost_unit_test_framework.so.1.73.0+0x4d618)
    #24 0x44798d in seastar::testing::entry_point(int, char**) ../../src/testing/entry_point.cc:77
    #25 0x4134b5 in main ../../include/seastar/testing/seastar_test.hh:65
    #26 0x7fbf44a1b1e1 in __libc_start_main (/lib64/libc.so.6+0x281e1)
    #27 0x4133dd in _start (/home/bhalevy/dev/seastar/build/debug/tests/unit/closeable_test+0x4133dd)
```

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210406100911.12278-1-bhalevy@scylladb.com>
psarna pushed a commit that referenced this pull request Mar 1, 2022
"
There are 2 limitations that iotune collects from the disk -- the bandwidth
and the iops limit (both separately for read and write, but this separation
is not important here). Next the scheduler makes sure that both of these
limits are respected when dispatching IO into the disk. The latter is
maintained by summing up the number of requests executing and their lengths
and stopping the dispatch once the sum hist the respective limit.

This algo assumes that the disk is linear in all dimension, in particular
that reads and writes are equal in terms of affecting each other, while
this is not so. The recent study showed that disks manage mixed workload
differently and linear summing does overwhelm the disk.

The solution is based on the observation that throttling reads and writes
somehow below the collected maximum throughputs/iops helps keeping the
latencies in bounds.

This set replaces the current ticket-based capacity management with the
rate-limited one. The rate-limited formula is the

  bw_r / max_bw_r + iops_r / max_iops_r + { same for _w }  <= 1.0

To make this happen first patches encapsulate the capacity management inside
io-group. Patch #16 adds the rate-limiter itself. The last patch is the test
that shows how rate-limited reads compete with unbound writes.

First, the read-only workload is run on its own:

     througput(kbs)   iops     lat95  queued  executed (microseconds)
        514536      128634    1110.2   262.1     465.3

the scheduler coefficients are

     K_bandwidth = 0.231  K_iops = 0.499    K = 0.729

The workload is configured for 50% of the disk read IOPS, and this is what
it shows. The K value is 0.5 + the bandwidth tax.

Next, the very same read workload is run in parallel with unbount writes.
Shares are 100 for write and 2500 for read.

     througput(kbs)   iops     lat95  queued  executed (microseconds)
w:       226349       3536  118357.4 26726.8     397.4
r:       500465     125116    2134.4  1182.7     321.7

the scheduler coefficients are

write: K_bandwidth = 0.226 K_iops = 0.020    K = 0.246
read:  K_bandwidth = 0.224 K_iops = 0.485    K = 0.709

Comments:

1. K_read / K_write is not 2500 / 100 because reads do not need that much.
   Changing reads' shares will hit the wake-up latency (see below)

2. read 1.1ms queued time comes from two places:

   First, the queued request waits ~1 tick until io-queue is polled

   Second, when the queue tries to submit the request it most likely
   hits the rate limit and needs to wait at least 1 more tick until some
   other request completes and replenisher releases its capacity back

3. read lat95 of 2ms is the best out of several runs. More typical one is
   around 6ms and higher. This latency = queued + executed + wakeup times.
   In the "best" case the wakeup time is thus 0.7ms, in "more typical" one
   it's ~4ms without good explanation where it comes from (issue scylladb#989)

   These results are achieved with several in-test tricks that mitigated
   some problems that seem to come from CPU scheduler and/or reactor loop:

   a. read workload needs the exact amount of fibers so that CPU scheduler
      resonates with the io-scheduler shares. Less fibers under-generate
      requests, more fiers cause much larger wakeup latencies (issue scylladb#988)

   b. read fibers busy-loop, not sleeps between submitting requests.
      Sleeping makes both -- lower iops and larger latencies. Partially
      this is connected to the previous issue -- if having few fibers the
      sleep timer needs to tick more often than the reactor allows (the
      extreme case on one fiber requires 8us ticking). If having many fibers
      so that each ticks at some reasonable rate and generating a uniform
      requests distribution would still make reactor poll too often (125k
      IOPS -> 8us per request). Reactor cannot do it, one of the reasons is
      that polling empty smp queues takes ~5% of CPU (issues scylladb#986, scylladb#987)

   c. read fibers' busy-loop future is

            return do_until(
	       [] { return time_is_now(); },
	       [] { return later(); }
	    )

      but the later() is open-coded to create the task in the workload's
      sched group (existing later() implementation uses default_sched_group).
      Using default later() affects the latency the bad way (issue scylladb#990)

Nearest TODO:

- Capacity management and shares accounting use differently scaled values
  (the shares accounting uses ~750 times smaller ones). Keep only one.
- Add the metrics to show K's for different classes. This needs previous
  task to be done first.
- Keep the capacity directly on the fair_queue_entry. Now there's the
  ticket sitting it it and it's converted to capacity on demand.

tests: io_tester(dev)
       manual.rl-sched(dev)
       unit(dev)
"

* 'br-rate-limited-scheduling-4' of https://github.com/xemul/seastar: (22 commits)
  tests: Add rl-iosched manual test
  fair_queue: Add debug prints with request capacities
  reactor: Make rate factor configurable
  io_queue, fair_queue: Remove paces
  io_queue: Calculate max ticket from fq
  fair_queue: Linearize accumulators
  fair_queue: Rate-limiter based scheduling
  fair_queue: Rename maximum_capacity into shares_capacity
  fair_queue: Rename bits req_count,bytes_count -> weight,size
  fair_queue: Make group config creation without constructor
  fair_queue: Swap and constify max capacity
  fair_queue: Introduce ticket capacity
  fair_queue: Add fetch_add helpers
  fair_queue: Replace group rover type with ticket
  fair_queue: Make maybe_ahead_of() non-method helper
  fair_queue: Dont export head value from group
  fair_queue: Replace pending.orig_tail with head
  io_queue: Configure queue in terms of blocks
  io_tester: Indentation fix
  io_tester: Introduce polling_sleep option
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fs ZPP: FS project zpp ZPP: student project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants