Added stub seastarfs file implementation #16

rokinsky · 2019-11-17T18:44:25Z

see #4

psarna · 2019-11-19T09:25:13Z

Looks good - one technical remark - for the sake of making it easier to merge this branch into master in the future, try to confine your changes to fs/ directory (although tests can obviously go to /tests. For now it's fine for the temporary file util to reside in fs/ as well.

By the way, we already have a file_desc::temporary utility that does a similar thing, but it doesn't expose a path to the file - optionally you could also try adding this functionality, but again, it's even better to have a separate util in the fs/ directory to make our branch as independent as possible.

Also, io_priority_class is an important bit of seastar's I/O API, so it should be passed to the block device as well (I'll make the same comment in #11, where it needs to be introduced first).

rokinsky · 2019-11-19T10:22:21Z

For now it's fine for the temporary file util to reside in fs/ as well.

We have moved the temporary_file header in #11.

Also, io_priority_class is an important bit of seastar's I/O API, so it should be passed to the block device as well (I'll make the same comment in #11, where it needs to be introduced first).

Done.

* Improved metadata * Added coroutine to the code * Fixed things * Added a whole lot of changes I didn't segment * Remove unnecessary lines * Removed debug code * Readded WError flag * Removed chrono * Removed unnecessary includes * Removed debug code * Fixed formatting * Added safety semaphore * Removed unnecessary space

This commit consists of many squashed parts all building the complete kafka producer. kafka: Implement kafka_producer and demo app kafka: Implement kafka_connection class Implement kafka_connection class which provides functionality to send Kafka protocol messages without manual serialization or deserialization. Add support for API versioning as kafka_connection asks the broker for ApiVersions and serializes, deserializes next messages based on broker versions. kafka: Implement tcp_connection timeout Added timeout functionality to tcp_connection in connect, read, write. Updated kafka_connection and kafka_producer with timeout support. kafka: Implement tcp_connection_exception Implement tcp_connection_exception which is thrown when connection is disconnected before reading the response. kafka: Implement exception translation In the case of network error or parsing error, send now catches the exception, translates it into appropriate error code and returns response message with this code. Add Connection Manager class This change adds a connection manager unit to keep track of open connections and allow for a fast access by the producer. With that we can now manage multiple brokers and read their addresses from metadata requests. It also introduces a dummy partitioner for further development. Connection Pool + Multibroker demo kafka: Implement error codes Implement classes to support parsing of error codes and list of error codes. kafka: Implement retry_helper Implement retry_helper: a future utility class which allows retries of futures. Retry count is capped and controlled by constructor argument. Retries are started with backoff time, using exponential backoff with jitter strategy. kafka: Implement metadata_manager Added metadata_manager class. kafka: Allow multiple in-flight messages This change allows kafka_connection to be used in non-sequential way, allowing to "queue" multiple messages without waiting for the previous to have ended. Semaphores with count=1 are used to implement queuing of sends and receives, taking advantage of semaphore FIFO functionality. It is crucial to preserve the ordering, so that receive reads the correct response. kafka: Non-sequential use of connection_manager The change allows use of connection_manager in non-sequential manner. kafka: Implement batching Implement batching of messages, initial error handling (via retries). A manual flush is currently needed to send the messages in producer (no automatic flushing). kafka: E2E tests Implement "framework" for E2E Kafka producer tests and a few basic producer tests. Introduces partitioner abstract base class while preserving old behav… (psarna#17) * Introduces partitioner abstract base class while preserving old behaviour and existing partitioner class. Introduces round robin partitioner as derived class. * Replaced partitioner* type with std::unique_ptr<partitioner> inside producer. Replaced atomic uint with normal uint. * Added hashing in case of non-zero length key. Zpp kafka metadata enrichment (psarna#16) * Improved metadata * Added coroutine to the code * Fixed things * Added a whole lot of changes I didn't segment * Remove unnecessary lines * Removed debug code * Readded WError flag * Removed chrono * Removed unnecessary includes * Removed debug code * Fixed formatting * Added safety semaphore * Removed unnecessary space Add producer properties as a way for the kafka producer (psarna#21) to be customizable by the user. This change ties together various producer parameters into a single properties object, easily customizable by the user. Among the fields are parameters such as timeouts, backoff strategies and partitioners as well as some that might be introduced in the future (like acks)

Group commit for kafka producer consisting of many smaller ones squashed together kafka: Local Kafka deployment for development testing. Implemented Terraform configuration to provision Kafka cluster locally using Docker. Network IP address and number of Kafka brokers are configurable by user. Decided to write Terraform configuration from scratch, as the other already available (Docker Compose) solution was incorrect. I chose Terraform as it could be also used in the future to model other test deployments (e.g. AWS). Kafka TCP connection class This change adds a basic, low level network connection management for the Kafka client. Later, this should be encapsulated in a TCP Client class to work on the Kafka messages abstraction. So far the tests are also somewhat makeshift - ideally, they have to be separated into both unit and integration tests, with the former not rellying on existing Kafka brokers. kafka: Primitives, ApiVersions, Metadata parsing Implement classes for parsing basic Kafka protocol primitives, as well as ApiVersions and Metadata request/response messages. kafka: Varint parsing Implement 32-bit zigzag varint parsing, as defined in the Kafka protocol specification. kafka: RECORDS type structures and record parsing Implement structures storing RECORDS type (as defined in Kafka protocol specification) and parsing of a single record (only version 2). kafka: Parsing of records batch Implement parsing of records batch from RECORDS Kafka protocol type (version 2). kafka: Parsing of RECORDS type Implemented parsing of RECORDS type (version 2), as defined in the Kafka protocol specification. kafka: Produce request parsing Implement parsing of Produce request from Kafka protocol. kafka: Produce response parsing Implement parsing of Produce response from Kafka protocol. kafka: Implement kafka_producer and demo app kafka: Implement kafka_connection class Implement kafka_connection class which provides functionality to send Kafka protocol messages without manual serialization or deserialization. Add support for API versioning as kafka_connection asks the broker for ApiVersions and serializes, deserializes next messages based on broker versions. kafka: Implement tcp_connection timeout Added timeout functionality to tcp_connection in connect, read, write. Updated kafka_connection and kafka_producer with timeout support. kafka: Implement tcp_connection_exception Implement tcp_connection_exception which is thrown when connection is disconnected before reading the response. kafka: Implement exception translation In the case of network error or parsing error, send now catches the exception, translates it into appropriate error code and returns response message with this code. Add Connection Manager class This change adds a connection manager unit to keep track of open connections and allow for a fast access by the producer. With that we can now manage multiple brokers and read their addresses from metadata requests. It also introduces a dummy partitioner for further development. Connection Pool + Multibroker demo kafka: Implement error codes Implement classes to support parsing of error codes and list of error codes. kafka: Implement retry_helper Implement retry_helper: a future utility class which allows retries of futures. Retry count is capped and controlled by constructor argument. Retries are started with backoff time, using exponential backoff with jitter strategy. kafka: Implement metadata_manager Added metadata_manager class. kafka: Allow multiple in-flight messages This change allows kafka_connection to be used in non-sequential way, allowing to "queue" multiple messages without waiting for the previous to have ended. Semaphores with count=1 are used to implement queuing of sends and receives, taking advantage of semaphore FIFO functionality. It is crucial to preserve the ordering, so that receive reads the correct response. kafka: Non-sequential use of connection_manager The change allows use of connection_manager in non-sequential manner. kafka: Implement batching Implement batching of messages, initial error handling (via retries). A manual flush is currently needed to send the messages in producer (no automatic flushing). kafka: E2E tests Implement "framework" for E2E Kafka producer tests and a few basic producer tests. Introduces partitioner abstract base class while preserving old behav… (psarna#17) * Introduces partitioner abstract base class while preserving old behaviour and existing partitioner class. Introduces round robin partitioner as derived class. * Replaced partitioner* type with std::unique_ptr<partitioner> inside producer. Replaced atomic uint with normal uint. * Added hashing in case of non-zero length key. Zpp kafka metadata enrichment (psarna#16) * Improved metadata * Added coroutine to the code * Fixed things * Added a whole lot of changes I didn't segment * Remove unnecessary lines * Removed debug code * Readded WError flag * Removed chrono * Removed unnecessary includes * Removed debug code * Fixed formatting * Added safety semaphore * Removed unnecessary space Add producer properties as a way for the kafka producer (psarna#21) to be customizable by the user. This change ties together various producer parameters into a single properties object, easily customizable by the user. Among the fields are parameters such as timeouts, backoff strategies and partitioners as well as some that might be introduced in the future (like acks)

…o_with Fixes failures in debug mode: ``` $ build/debug/tests/unit/closeable_test -l all -t deferred_close_test WARNING: debug mode. Not for benchmarking or production random-seed=3064133628 Running 1 test case... Entering test module "../../tests/unit/closeable_test.cc" ../../tests/unit/closeable_test.cc(0): Entering test case "deferred_close_test" ../../src/testing/seastar_test.cc(43): info: check true has passed ==9449==WARNING: ASan doesn't fully support makecontext/swapcontext functions and may produce false positives in some cases! terminate called after throwing an instance of 'seastar::broken_promise' what(): broken promise ==9449==WARNING: ASan is ignoring requested __asan_handle_no_return: stack top: 0x7fbf1f49f000; bottom 0x7fbf40971000; size: 0xffffffffdeb2e000 (-558702592) False positive error reports may follow For details see google/sanitizers#189 ================================================================= ==9449==AddressSanitizer CHECK failed: ../../../../libsanitizer/asan/asan_thread.cpp:356 "((ptr[0] == kCurrentStackFrameMagic)) != (0)" (0x0, 0x0) #0 0x7fbf45f39d0b (/lib64/libasan.so.6+0xb3d0b) #1 0x7fbf45f57d4e (/lib64/libasan.so.6+0xd1d4e) #2 0x7fbf45f3e724 (/lib64/libasan.so.6+0xb8724) #3 0x7fbf45eb3e5b (/lib64/libasan.so.6+0x2de5b) #4 0x7fbf45eb51e8 (/lib64/libasan.so.6+0x2f1e8) #5 0x7fbf45eb7694 (/lib64/libasan.so.6+0x31694) #6 0x7fbf45f39398 (/lib64/libasan.so.6+0xb3398) #7 0x7fbf45f3a00b in __asan_report_load8 (/lib64/libasan.so.6+0xb400b) #8 0xfe6d52 in bool __gnu_cxx::operator!=<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > >(__gnu_cxx::__normal_iterator<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > > const&, __gnu_cxx::__normal_iterator<dl_phdr_info*, std::vector<dl_phdr_info, std::allocator<dl_phdr_info> > > const&) /usr/include/c++/10/bits/stl_iterator.h:1116 #9 0xfe615c in dl_iterate_phdr ../../src/core/exception_hacks.cc:121 #10 0x7fbf44bd1810 in _Unwind_Find_FDE (/lib64/libgcc_s.so.1+0x13810) #11 0x7fbf44bcd897 (/lib64/libgcc_s.so.1+0xf897) #12 0x7fbf44bcea5f (/lib64/libgcc_s.so.1+0x10a5f) #13 0x7fbf44bcefd8 in _Unwind_RaiseException (/lib64/libgcc_s.so.1+0x10fd8) #14 0xfe6281 in _Unwind_RaiseException ../../src/core/exception_hacks.cc:148 #15 0x7fbf457364bb in __cxa_throw (/lib64/libstdc++.so.6+0xaa4bb) #16 0x7fbf45e10a21 (/lib64/libboost_unit_test_framework.so.1.73.0+0x1aa21) #17 0x7fbf45e20fe0 in boost::execution_monitor::execute(boost::function<int ()> const&) (/lib64/libboost_unit_test_framework.so.1.73.0+0x2afe0) #18 0x7fbf45e21094 in boost::execution_monitor::vexecute(boost::function<void ()> const&) (/lib64/libboost_unit_test_framework.so.1.73.0+0x2b094) #19 0x7fbf45e43921 in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned long) (/lib64/libboost_unit_test_framework.so.1.73.0+0x4d921) #20 0x7fbf45e5eae1 (/lib64/libboost_unit_test_framework.so.1.73.0+0x68ae1) #21 0x7fbf45e5ed31 (/lib64/libboost_unit_test_framework.so.1.73.0+0x68d31) #22 0x7fbf45e2e547 in boost::unit_test::framework::run(unsigned long, bool) (/lib64/libboost_unit_test_framework.so.1.73.0+0x38547) #23 0x7fbf45e43618 in boost::unit_test::unit_test_main(bool (*)(), int, char**) (/lib64/libboost_unit_test_framework.so.1.73.0+0x4d618) #24 0x44798d in seastar::testing::entry_point(int, char**) ../../src/testing/entry_point.cc:77 #25 0x4134b5 in main ../../include/seastar/testing/seastar_test.hh:65 #26 0x7fbf44a1b1e1 in __libc_start_main (/lib64/libc.so.6+0x281e1) #27 0x4133dd in _start (/home/bhalevy/dev/seastar/build/debug/tests/unit/closeable_test+0x4133dd) ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210406100911.12278-1-bhalevy@scylladb.com>

" There are 2 limitations that iotune collects from the disk -- the bandwidth and the iops limit (both separately for read and write, but this separation is not important here). Next the scheduler makes sure that both of these limits are respected when dispatching IO into the disk. The latter is maintained by summing up the number of requests executing and their lengths and stopping the dispatch once the sum hist the respective limit. This algo assumes that the disk is linear in all dimension, in particular that reads and writes are equal in terms of affecting each other, while this is not so. The recent study showed that disks manage mixed workload differently and linear summing does overwhelm the disk. The solution is based on the observation that throttling reads and writes somehow below the collected maximum throughputs/iops helps keeping the latencies in bounds. This set replaces the current ticket-based capacity management with the rate-limited one. The rate-limited formula is the bw_r / max_bw_r + iops_r / max_iops_r + { same for _w } <= 1.0 To make this happen first patches encapsulate the capacity management inside io-group. Patch #16 adds the rate-limiter itself. The last patch is the test that shows how rate-limited reads compete with unbound writes. First, the read-only workload is run on its own: througput(kbs) iops lat95 queued executed (microseconds) 514536 128634 1110.2 262.1 465.3 the scheduler coefficients are K_bandwidth = 0.231 K_iops = 0.499 K = 0.729 The workload is configured for 50% of the disk read IOPS, and this is what it shows. The K value is 0.5 + the bandwidth tax. Next, the very same read workload is run in parallel with unbount writes. Shares are 100 for write and 2500 for read. througput(kbs) iops lat95 queued executed (microseconds) w: 226349 3536 118357.4 26726.8 397.4 r: 500465 125116 2134.4 1182.7 321.7 the scheduler coefficients are write: K_bandwidth = 0.226 K_iops = 0.020 K = 0.246 read: K_bandwidth = 0.224 K_iops = 0.485 K = 0.709 Comments: 1. K_read / K_write is not 2500 / 100 because reads do not need that much. Changing reads' shares will hit the wake-up latency (see below) 2. read 1.1ms queued time comes from two places: First, the queued request waits ~1 tick until io-queue is polled Second, when the queue tries to submit the request it most likely hits the rate limit and needs to wait at least 1 more tick until some other request completes and replenisher releases its capacity back 3. read lat95 of 2ms is the best out of several runs. More typical one is around 6ms and higher. This latency = queued + executed + wakeup times. In the "best" case the wakeup time is thus 0.7ms, in "more typical" one it's ~4ms without good explanation where it comes from (issue scylladb#989) These results are achieved with several in-test tricks that mitigated some problems that seem to come from CPU scheduler and/or reactor loop: a. read workload needs the exact amount of fibers so that CPU scheduler resonates with the io-scheduler shares. Less fibers under-generate requests, more fiers cause much larger wakeup latencies (issue scylladb#988) b. read fibers busy-loop, not sleeps between submitting requests. Sleeping makes both -- lower iops and larger latencies. Partially this is connected to the previous issue -- if having few fibers the sleep timer needs to tick more often than the reactor allows (the extreme case on one fiber requires 8us ticking). If having many fibers so that each ticks at some reasonable rate and generating a uniform requests distribution would still make reactor poll too often (125k IOPS -> 8us per request). Reactor cannot do it, one of the reasons is that polling empty smp queues takes ~5% of CPU (issues scylladb#986, scylladb#987) c. read fibers' busy-loop future is return do_until( [] { return time_is_now(); }, [] { return later(); } ) but the later() is open-coded to create the task in the workload's sched group (existing later() implementation uses default_sched_group). Using default later() affects the latency the bad way (issue scylladb#990) Nearest TODO: - Capacity management and shares accounting use differently scaled values (the shares accounting uses ~750 times smaller ones). Keep only one. - Add the metrics to show K's for different classes. This needs previous task to be done first. - Keep the capacity directly on the fair_queue_entry. Now there's the ticket sitting it it and it's converted to capacity on demand. tests: io_tester(dev) manual.rl-sched(dev) unit(dev) " * 'br-rate-limited-scheduling-4' of https://github.com/xemul/seastar: (22 commits) tests: Add rl-iosched manual test fair_queue: Add debug prints with request capacities reactor: Make rate factor configurable io_queue, fair_queue: Remove paces io_queue: Calculate max ticket from fq fair_queue: Linearize accumulators fair_queue: Rate-limiter based scheduling fair_queue: Rename maximum_capacity into shares_capacity fair_queue: Rename bits req_count,bytes_count -> weight,size fair_queue: Make group config creation without constructor fair_queue: Swap and constify max capacity fair_queue: Introduce ticket capacity fair_queue: Add fetch_add helpers fair_queue: Replace group rover type with ticket fair_queue: Make maybe_ahead_of() non-method helper fair_queue: Dont export head value from group fair_queue: Replace pending.orig_tail with head io_queue: Configure queue in terms of blocks io_tester: Indentation fix io_tester: Introduce polling_sleep option ...

rokinsky added fs ZPP: FS project zpp ZPP: student project labels Nov 17, 2019

rokinsky force-pushed the stub-file-impl branch 3 times, most recently from b0d1b5a to 73fdda4 Compare November 17, 2019 19:34

rokinsky requested review from psarna and varqox November 18, 2019 16:42

rokinsky force-pushed the stub-file-impl branch from 73fdda4 to 86582d3 Compare November 18, 2019 17:25

rokinsky force-pushed the stub-file-impl branch from 86582d3 to ad71eb2 Compare November 19, 2019 10:13

rokinsky added 2 commits November 19, 2019 21:05

fs: add initial file implementation

29b944a

tests: add simple parallel i/o unit test for seastarfs file

35925a8

rokinsky force-pushed the stub-file-impl branch from ad71eb2 to 35925a8 Compare November 19, 2019 20:06

Merge branch 'zpp_fs' into stub-file-impl

87a2df6

psarna approved these changes Nov 20, 2019

View reviewed changes

psarna merged commit e8a660e into zpp_fs Nov 20, 2019

varqox deleted the stub-file-impl branch December 28, 2019 22:33

StarostaGit mentioned this pull request Mar 20, 2020

Kafka producer #91

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added stub seastarfs file implementation #16

Added stub seastarfs file implementation #16

rokinsky commented Nov 17, 2019

psarna commented Nov 19, 2019

rokinsky commented Nov 19, 2019

Added stub seastarfs file implementation #16

Added stub seastarfs file implementation #16

Conversation

rokinsky commented Nov 17, 2019

psarna commented Nov 19, 2019

rokinsky commented Nov 19, 2019