-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V Concurrency Considerations #3814
Comments
I very very much want to see channels in V! They're one of the best ideas in Go; they're both simple and powerful, which is rare. |
@ylluminate See related discussion: #1868 |
@elimisteve |
@ylluminate @cristian-ilies-vasile @elimisteve thank you for summarizing the high level goals/requirements - they're pretty much what I envision 😉. I'll keep an eye on this to see how will everything evolve. |
An amendment:
should read:
|
@dumblob |
https://golang.org/doc/go1.14#runtime |
@cristian-ilies-vasile that's definitely interesting (see specification), but it's still by far not fully preemptible (therefore it's called asynchronously preemptible, because it actually still just inserts safe points, but now starting from go 1.14 they'll be inserted in many more places but still carefully enough to not increase the overhead much - actually they tuned it so that the overhead is in majority of cases not measurable - though in some cases the overhead seems to be enormous as seen from this benchmark and it's equivalent in plain C which doesn't suffer from any such bottlenecks). It's also good to note here, that the implementation of "asynchronously preemptible" in go is really tricky and is definitely not universally applicable (hehe) to other languages (e.g. because of the problem on Windows which they kind of "solved" for go semantics and internals). Though I generally like the go design with "safe points" - I deliberately didn't go into details like this when describing the "interleaving between concurrency and parallelism" as outlined in #1868 (comment) because it's a lot of work with "little benefit" (as you see even go lang tackles preemptiveness - and to date still just partially - first now after many years of development by hundreds of engineers with many of them full time go devs). |
Not sure if this adds any value to this conversation but ScyllaDB is build on top this async framework https://github.com/scylladb/seastar Maybe instead of fibers etc. this kind of library solution could be used and added to vlib? |
Regarding structured concurrency, I would suggest reading https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/ first, which explains reasons to use structured concurrency instead of go-style concurrency. |
@pquentin thanks for the pointer - didn't know about the Trio concurrency library (and the "nursery concept"). It's a very good approach to concurrency programming indeed. In the context of V I have these observations:
|
@pquentin Interesting concept, but preventing goroutines from being spawned without permitting execution of the main thread/goroutine is extremely limiting and would prevent a huge percentage of the value gained from having them in the first place. The argument made against goroutines is honestly a severe straw man, as the normal and common way to achieve a nursery-like pattern in Go is to use a WaitGroup, or to pass a "done channel" ( The fact that general solutions exist in the ecosystem is eventually admitted in the middle of the 4th footnote:
That said, the point about stack traces only going back to the point where the goroutine was launched, rather than going all the way back to |
I disagree. The argument against unstructured Go is not a straw man, just like the arguments against "goto" a generation years ago weren't straw men. Enforced structure allows new ways of reasoning about your code, thus enabling you to code the Happy Eyeballs algorithm in 40 lines of code instead of 400. This directly translates to fewer bugs. Yes, cool, the ecosystem has solutions, but if it's way easier to not use them (e.g. because they require a heap of boilerplate in every function call, can't handle cancellation, and whatnot) they're ultimately not helpful. In Trio, "waiting for your child tasks" and "leaving the scope of a nursery" is exactly the same thing and requires zero lines of code (just the end of a block, in Python that's a de-indent), which again helps reduce the bug count and reduces the cognitive load on the programmer. @dumblob The In current V (or Go) you'd use a
with
IMHO that second line shouldn't exist. Not for files, not for locks, and not for nurseries. In Go, starting a goroutine is as easy as it gets – but cleaning it up is not and needs to be done manually, esp when error handling is involved. Making coroutine creation slightly more difficult (by requiring a scoped nursery object to attach them to) is a no-brainer when the dividend you get from this is trivial cleanup.
I don't understand that sentence. |
Exactly that's the reason why I'm curious how would we implement the long running "background" tasks using the nursery concept in V as there are currently no means in V guaranteeing the existence of a proper |
Having no guarantees is detrimental to a whole lot of other constructs too. Locks for instance, or not leaving a file handle open when I'm done with using it. So that's not an argument against nurseries, that's an argument for statically-bounded-visibility of objects – with a destructor that is guaranteed to be called exactly once. As V doesn't seem to have that concept, perhaps the first step should be to add it. |
I don't think it's comparable. I argue, that everything from Regarding files, yes, that's undoubtedly a hole in the safety. On the other hand I argue, that open files (be it virtual unix files or any other files) are by far not that detrimental as concurrency issues. @medvednikov what about putting file handling also into In summary, implementing a destructor-like concept in V wouldn't IMHO solve anything (there have been several discussions in this issue tracker and elsewhere - feel free to join them as this thread is about something else). Back to the topic. Because I like the nursery concept, I'm asking for ideas how to implement the "background" long running tasks with guarantees, but without introducing |
@dumblob |
The scheduler is pluggable, so it's up to the programmer. If you're asking for defaults, then I'd follow the principle only good defaults determine success. So I'd say we really want to provide basic scaling ("elasticity") by default, so we should provide some scheduler working moderately well for common scenarios (e.g. average destop PCs, average laptops, average smartphones, average small/medium servers).
I think the default behavior (as mentioned above) should be without any custom spawning function - the scheduler will first look whether the go routine is "eligible" for elasticity (I envision this through the reentrancy flag) and then will look for every channel (the ring buffer) used in the go routine which is also used in other go routine (thus "connects" them while forming dataflow programming pattern - imagine e.g. NoFlo or Node-RED) and then mux (multiplex) the channels (inputs & outputs) accordingly.
By default I wouldn't do anything (except in debug mode I'd log this information if it wasn't logged before in the last minute or so). |
@dumblob consider a situation, where due to a temporary peak in load, it cannot keep up, so a channel/buffer/queue is filled to the max. The scheduler decides to launch a new worker, which increases the load even further, and so on and so forth => Things come to a grinding halt. Adding a positive feedback loop, should not be done without understanding its limits and the behavior of the system in the degraded cases. I personally think, that it can not be done in the general case. I agree that this kind of elastic scaling may be useful for some cases, where you know that the feedback will be limited by external factors, but I think the default should be to do nothing, except may be log that the buffer was found to be full, and do not try to add new consumers automatically. |
This is just one of many situations. The default scheduler should be pretty conservative. There are also many other measures the default (i.e. a simple one) scheduler can (shall) take into account - but in this case it's quite easy as the scheduler knows everything about the whole graph (unlike in many other technologies where one doesn't know the whole state) and connections between nodes - imagine you know the load of all buffers thus the min/max flow between any two nodes (preferably a go routine instance acting as pure source(s) and a go routine instance being a pure sink(s)) is easily observable and thus the whole system perfectly tunable as the underlying graph theory and approximation algorithms are out there. My hint to use full buffers as an important information for the scheduler was meant primarily as a trigger for reevaluating the situation, not as the only measure 😉.
That's the sole reason why reentrant go routines exist. Only reentrant go routines will take part on elasticity - any other instance of a go routine (i.e. non-reentrant instance) will not get any additionally spawned/stopped instance from the scheduler. So it's again 100% in the hands of the programmer and V in that case doesn't handle a general case you seemed to be afraid of 😉. To clarify, I'm not proposing having a cool conservative scheduler supporting elasticity already in V 1.0. The reentrant go routines can be implemented the very same way as non-reentrant ones (i.e. without elasticity) for half a decade and it'll be fine. What I'm proposing though is the semantics, which I find important for now and the upcoming decades and which thus must be part of V 1.0. |
discord quote |
@cristian-ilies-vasile that's basically how I envision that (though not that simplified 😉) with the only difference, that you've split my concept of a general scheduler into two parts - a scheduler and a separate monitor (I'd though rather leave them technically together as both work on the very same data and splitting them will undoubtedly lead to performance decrease). |
@dumblob |
Understanding Real-World Concurrency Bugs in Go |
Ok, I'm fine with that 😉. One minor thing is though the naming - I'm deliberately trying to not use any terminology resembling cooperative multitasking (e.g. "green thread"), because it's highly misleading (for the future as well as for the current state of things) as it implies many guarantees which the full preemption which I advocate for in V can not offer. @cristian-ilies-vasile could you rename the Green Thread Monitor to something way more generic? Maybe V Thread Monitor (and V Thread Scheduler for the scheduler API) or so?
Yeah, that's a good source of issues with the Go semantics (especially the behavior of Go channels to be precise). I think V can learn from that (though many of their decisions have been done due to performance reasons) and improve on that - the full preemptiveness I'm strongly suggesting for V should help with some of them too. Btw. for Go is this paper not any more valid after introducing the non-cooperative goroutine preemption in Go 1.14. |
OK, here are new descriptions: V Light Thread Monitor |
Seems that AI could help with subtle deadlocks errors. |
Bounding data races in space and time – part I Bounding data races in space and time – part II |
I did some research regarding NIM’s multi threading engine named Weave Weave is not designed for keeping latency extremely low, is designed for throughput, which is not automatically a bad thing. I know that V fans are working in the gaming industry - so main coders should decide what type of concurrency is best suited for language and applicability domains. However Weave is freaky fast |
Posting in hurry.
This is not updated - Weave actually has an optional centralized queue to bring in certain degree of fairness more than acceptable for game and similar use cases. @mratsim, could this be rephrased on the Picasso/Weave sites/docs/wikis?
You know me already 😉 - I'm definitely a proponent of fully preemptive solution to make V flawlessly interoperable with anything in the world (all those C/C++/FFI/... libraries). There are "tags" (like |
Maybe we can borrow some Weave’s design decision (channel “plumbing”, work stealing) in the preemptive run time. |
For Those About To Rock |
I don’t do Discord so I have no access to what is being discussed there.
This is quite onerous in terms of what a type system will need to prove to fulfill this guarantee. Why can’t each green thread behave as if it is a sequential green thread and then the data race safety issue can be handled separately for shared data, such as by making it immutable or implementing some reference capabilities with ownership for the shared data such as for the Pony language?
No shared memory access between threads? If yes, then that will be very limiting. With massively multicore arriving now, and the inevitable cost to sharing memory accesses across a communication bridge between a cluster of cores, there be a need for limiting sharing of memory to a cluster of cores. There may still be a use case for sharing memory within a cluster?
Golang presumably has a huge investment in this tuning. There are issues such as starvation pressure versus latency. This is very tricky to get right. Should you not be trying to lift much of the implementation from Go instead reinventing the wheel with your presumably much more meager resources? Google is presumably running Go on thousands of servers so has a significant incentive to invest in the runtime. Here’s some links to a collection of the research, thoughts and notes I collected on M:N concurrency: https://www.reddit.com/r/scala/comments/ir8ygb/what_is_missing_in_scala_ecosystem/g57axaz/?context=6
Are you asking what the data race safety model should be? If you are referring to bugs due to multithreaded contention, don’t go there. It’s an insoluble, hornet’s nest. Opt for a lockfree model. There will always be lock contention at some higher-level semantic but then that is not your concern.
I need to read this next. I will post what I have for this comment first.
I will need to watch that, thanks. Edit: I found the reference at ~11:30 juncture in the video. Okay so it’s apples-to-oranges in terms of what I am thinking should be lock-free. Dmitry is referring to the scheduler for the M:N user-threads (aka green threads). I am referring to the API that the Vlang programmer interfaces with. I am saying the the Vlang programmers should be prevented or strongly discouraged from using low-level thread synchronization primitives. And instead should employ other lockfree means for (avoiding) contention over shared resources. I elaborated in the other thread on this topic. I also found this: https://stackoverflow.com/questions/13102874/is-gos-buffered-channel-lockless
I will need to read that, thanks. Just pointing out what I have not yet factored into my reasoning above. |
@shelby3 thanks for reaching out to us! No rush here, V parallelism is work in progress so read slowly and carefully both threads (this one and #1868 ) and all referenced resources & links recursively. Then think of it for a few days, come back, clarify or adjust your concerns expressed above and we'll be more than happy to discuss open questions (if there are any 😉) and accept PRs. There should be enough time (look at the frequency with which both threads progress and correlate with commits in master), so don't hurry, we'll wait. Thanks. |
Now I remember I already incorporated that into my analysis in 2018: https://gitlab.com/shelby/Zer0/-/issues/11#spawning-goroutines-posited-to-be-as-harmful-as-goto The structured concurrency is applicable to when you want to explicitly express short-lived, concurrent operations which are considered to be children which should complete before the algorithm in the parent thread continues. Because in this case there are data race issues to contend with as the forked paths may return in any timing order. Whereas, for goroutines which are to be long-lived entities no longer owned or subservient to the thread which spawned them, then structured concurrency does not apply and only appropriate data race safety restrictions on shared data apply. IOW, I suggest we need both |
I read an interesting, 2020 issued, down to Earth paper on how to identify rare and hard to detect errors & bugs in multi-threading code (locks, unlocks, mutexes, semaphores, atomics etc) |
On V's concurrency front we start working on a Poof of Concept based on Weave project. |
@cristian-ilies-vasile finding and fixing needle-in-a-haystack thread synchronization heisenbugs is not equivalent to a paradigm which can never create those bugs. The latter would be provably 100% free from these bugs at compile-time. I urge you please do not repeat the design mistake of encouraging the use of thread synchronization primitives. I urge you to only allow access to these in Just as you would not allow pointer arithmetic in safe code, you should not allow in “safe code” a paradigm famous for creating insidious heisenbugs because it is well known to not be safe at all and to always proliferate insidious bugs. I believe Eric Raymond (re-)coined or first used (in 2014) the term “defect attractor” to describe intentional means to create defects in software. I understand pointer arithmetic is especially dangerous because it can enable security holes, but I would argue that thread synchronization bugs can also become security exploits. It’s impossible to thoroughly reason about thread synchronization because it is a dynamic synchronization problem and Turing-complete programs can’t be proven to halt (the Halting problem). Thus these unseeable heisenbugs are not equivalent in genre to other kinds of bugs which can be in many cases reasoned about and prevented (or at least discovered) by human brains and eyeballs or some formal methods, without even running the program. Additionally as you presumably know that thread synchronization defeats massively multicore (which has arrived already for servers) due to Amdahl’s law (especially in conjunction with amplified contention due to increasing numbers of contending threads). Really you want to embrace immutability, it is the only viable direction going to the future. And Golang does not have any capability for expressing immutability at this time, so this can be a significant advantage for Vlang. Another reason to favor “future-proof” immutability is what I wrote in the companion thread:
I have contemplated other benefits to immutability. P.S. I have not read about Weave yet. EDIT: Eric Raymond wrote which supports my warning:
|
@shelby3 How, specifically? |
Consider that a variable maintaining a value for security is being left in an indeterminate state from a data race. Targeted parallel calls cause races, causing variable to go undetermined, causing access to be granted when it shouldn't be. Attack may need to be sustained depending on likelihood of success. Memory is often reused in strange ways, the variable may not even be part of the race directly. Undefined behaviour, is undefined including leaving the vault unlocked. |
The first 250 simple tasks successfully ran on the R U S H proof of concept (inspired by Andreas Prell's PhD thesis) . fn x_sum(a int, b int, output_ch chan string) { // original function
x:= a + b
output_ch <- ''
output_ch <- strconv.v_sprintf("x_sum= %04d", x)
output_ch <- ''
} |
@cristian-ilies-vasile could you put all the sources to one gist to make it easier to comment on and discuss it (with the nice side effect of not polluting this thread)? Thanks 😉. |
A recent note on the above-discussed topic of the difference between IO-bound computation and CPU-bound workloads by the author of Weave: https://nim-lang.org/blog/2021/02/26/multithreading-flavors.html |
@dumblob o Achieving 11M IOPS & 66 GB/s IO on a Single ThreadRipper Workstation https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-threadripper-pro-workstation The new wave of thinking: One thread-per-core architecture |
These platforms (and many other) need to be programmed though in some language which needs to do all the heavy lifting 😉.
Hm, how is this different from what we've discussed so far (i.e. one os-thread per physical core and then millions of fibers/tasks per such os-thread using work stealing with some hacks providing preemptivness to avoid starvation and some level of fairness)? |
It is different a little bit.
That means once a task is put on a CPU, you cannot suspend that, just wait to finish. For a lot of tasks the load on each resource is fairy distributed due to work stealing algo. |
If it's really like this, then this has a serious issue with starvation - imagine writing few tasks each with an infinite loop (accidentally). It's also "not fair" and thus no real-time guarantees (required by games, audio/video calls, etc.) can be made. Weave has some counter measures (has some "good enough" fairness and if it's not enough, it provides a shared input queue; regarding starvation I think Weave doesn't do anything, but there is an easy solution with But if you're talking about something else than Weave, then this starvation and unfairness is a show stopper for a default concurrency backend in V. |
@dumblob Is it clear how much overhead Weave requires in order to do more sophisticated scheduling? |
Yes - just see the benchmarks it Weave repo. It's actually basically fully amortized (not kidding!). So the only question is always: which Weave parameters should I choose for this particular set of tasks in order to trade (tail) latency for throughput and vice versa. Weave defaults are though pretty much usable even for games, so this parameter choosing shall happen only in extreme scenarios (e.g. someone writes her own kernel or audio/video signal processing server like JACK, etc.). |
@dumblob I put here some non audited figures, just to have a perspective. Weave uses the word worker and I use the word robot for the same piece of code. A - load on each robot
B - tasks distribution spread by robot's state
|
On the way to an ultra low overhead equal-priority hard real time preemptive schedulerMotivated (and surprised) by https://twitter.com/v_language/status/1388154640768851969 I've decided to finally research and write down my ideas I have in my head for about a year. TL;DR It turns out V could have zero overhead full preemption on Linux and Windows As described in #1868 (comment) V could (should!) leverage combining cooperative scheduling of "go routines" among a dynamic pool (1:1 to number of currently powered on CPU cores) of system-level threads ("os threads") with os-triggered preemption (bare metal has counter/clock interrupt). This os-triggered preemption ("interrupt" - e.g. a signal) is though costly if not necessary (with cooperative scheduling majority is not necessary). It's the costlier the more real time responses are needed which is typical for low-latency scenarios like audio, video, games, etc. - all that are coincidentally domains which also utterly seek high-performance languages to leverage the HW up to the last bit and watt. And that's the domain where V shines or wants to shine. Now, when The idea is, that user space can write to kernel space (thanks to eBPF) without any interrupt nor any context switch nor any locking nor any other restriction. Just write at the pointer address basically. eBPF app (to whoms kernel space memory the user space app can arbitrarily write) can be then run each time a kernel timer(s) fires (doesn't matter who has set up this timer - whether some kernel driver or some other user space app or whatever - as we just care about being woken up "frequently enough" - in case of If there'll be a free running counter in eBPF app kernel space incremented only by a given user space app thread and the eBPF app will maintain a "cached" copy of this counter from last time the eBPF app ran together with monotonic timestamp of the cached value, then by comparing this timestamp it can decide whether it's already too long the user space app thread didn't increment the counter and can organize an interrupt of the given user space app thread. The user space app thread increments the counter each time the cooperative scheduling on that particular CPU (virtual) core happens. Note, that no slow atomic operations (reads, increments, etc.) are needed here as we don't care about the edge cases as in the worst case it'll just fire the interrupt negligibly more frequently. Of course, for thousands of CPU cores there might arise the need to somehow increase efficiency of the eBPF app - maybe by grouping the timer interrupts (though hrtimers in Linux should support tens and hundreds of thousands of timers without performance degradation), but that's just an implementation detail. Why should this be possible? Because eBPF nowadays supports:
Implementation note: on *nix systems sending an interrupt signal to the given thread seems to be a bit more work than expected (first the signal is being sent to the process and the signal handler will most then most of the time need to "distribute" the signal to the appropriate thread: signal handlers are per-process but signal masks are per thread). Note, this is a generic solution and I'm certain readers from other projects (incl. Go lang, D lang, Nim, Python, ... and other languages and frameworks) might jump on this and try to implement it as low hanging fruit (hey, an attribution would be nice guys 😉). @mintsuki @UweKrueger @cristian-ilies-vasile thoughts? Additional notes
|
Interesting - just 3 days after I published the above sketch, Microsoft announced support for eBPF directly in Windows - see https://github.com/Microsoft/ebpf-for-windows . And if they'll support some timer hooks (and I'm 100% sure they will), then I think there is no excuse to not writing a scheduler leveraging that sooner or later 😉. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Transferred from a document posted here in these documents by @cristian-ilies-vasile:
V concurrency high level design
After a carefully consideration of proposed concurrency models and suggestions expressed on discord v-chat channel the high level design is based on GO language model (message passing via channels) which in turn is a variation of communicating sequential processes.
I did read papers on actor model but seems that coders do not use all the primitives provided by language and resort to threads and queues (see Why Do Scala Developers Mix the Actor Model paper).
Almost all high level requirements are taken from Proper support for distributed computing, parallelism and concurrency published on github.
Because there are many words used more or less interchangeably like coroutine, goroutine, fiber, green threads etc I will use the term green thread.
Key points
Open points
The text was updated successfully, but these errors were encountered: