Description
Over the course of working through the PR to fix #128 and #129 I've run into a variety of different problems that I'd like to discuss. These problems are unfortunate, because Exq serves a very important role in the community. Many people are coming from a Ruby / Rails background and Exq helps make integrating Elixir with existing Ruby systems much easier. Nonetheless, the problems I present here individually and in aggregate are serious enough to warrant great caution on the part of any who might consider using this library.
I bring these problems up not only because I want Exq to succeed, but because I only have the available time to fix a handful of the most serious ones. A lot of valuable work has gone into this library, particularly around managing the redis association in such a way as to be compatible with Sidekiq. I'm hopeful that this post will bring about awareness of the situation so that others can address what I cannot.
NOTE: Two of these I've already created separate issues for, but I include them here for the sake of completeness.
* indicates that it will be at least partially addressed in benwilson512#1
Severe Issues
- *start_link calls inside GenServer init callback
- https://github.com/akira/exq/blob/master/lib/exq/manager/server.ex#L41
- Breaks supervision guarantees about children termination
- Breaks hot code loading
- Job Workers use GenServer.start calls
- https://github.com/akira/exq/blob/master/lib/exq/manager/server.ex#L203
- Totally unsupervised
- GenServers include catch all handle_call and handle_cast clauses
- Only handle_info should have a catch all clause
- *Only a single redis connection is shared throughout the entire application, sorta.
- The redis connection is always unsupervised
- Certain GenServers are configured to start_link a redis connection in their init function if they aren't passed one by their own
start_link
.
- Jobs restarting is handled not by the scheduler, nor by the manager, nor even the enqueuer, but rather the stats server
The ambiguities around supervision in general and job supervision in particular mean that I have no idea how this library behaves when I call ``./bin/myapp stop` via exrm for example. While it shuts down is it possible for a job to still get started? What happens to running jobs? Clarity here is paramount as I need to be sure that deploying new code to production will not orphan jobs.
General Issues
These are a hodgepodge of anti-patterns, code smells, and barriers to contribution.
- *No real top level Supervisor
- *Every GenServer has its own one_for_one supervisor for no discernable reason.
- *Hardcoded configuration defaults (not in themselves bad) exist in many disparate locations, often redundantly.
- https://github.com/akira/exq/blob/master/lib/exq/enqueuer/server.ex#L32
- https://github.com/akira/exq/blob/master/lib/exq/manager/server.ex#L36
- https://github.com/akira/exq/blob/master/lib/exq/redis/connection.ex#L14
- https://github.com/akira/exq/blob/master/lib/exq/scheduler/scheduler.ex#L30
- These should be consolidated in the already existing Config module.
- Max restarts in supervisors are higher than normal, and in some cases VERY high.
- Configuration not actually passed through to all children
- The
{:ok, sup} = Exq.Enqueuer.start_link([port: 6379])
example from the readme would seem to indicate that individual components of the system can be started with custom configuration. However, the various hardcodedCode.get
calls mean that some parts will still fallback toExq
defaults whereas others will actually use the provided configuration.
- The
- Queue state is largely duplicated between the manager's process state and an :ets table that is only accessed by that same process.
- https://github.com/akira/exq/blob/master/lib/exq/manager/server.ex#L206
- The
:ets
stuff should likely just be removed.
handle_call({:retries}, _from, state)
why is :retries wrapped in a tuple? This pattern is ubiquitous.GenServer.start_link(__MODULE__, [opts], [])
and theninit([opts])
- Also ubiquitous. The needless wrapping and unwrapping of values adds noise and confusion.
- handle_info catch all clauses with
Logger.error("INVALID MESSAGE #{info}")
- Unlike handle_call and handle_cast, handle_info is not constrained to only receive pre-defined messages. It is not necessary to go out of the way to Logger.error them.
- handle_info / terminate / stop clauses included in modules that don't do anything.
use GenServer
provides these basic callbacks already - Every test has
Code.require_file "test_helper.exs", __DIR__
at the top.mix test
runs test_helper by default - api.ex and enqueuer.ex contain a bunch of identical functions that do a GenServer.call or cast to an unspecified pid.
- Tests are very brittle
- The seed is hard coded
- Global state via mix config and process name registration abounds
:timer.sleep
calls abound.
- Documentation: none
- TypeSpec: none
Going Forward
These are the problems I've run into while trying to fix the supervision architecture. I have not yet had time to review in detail job lifecycle management, anything related to redis usage, nor the messaging patterns of the application.
I hope to have time during the next week or so to finish the changes I've started to improve the supervision tree. Notably, this may not include supervising the workers, as that should probably be addressed as part of a general look at job life cycle management. We will see.
Afterward, I believe the most useful next step would be documentation. What does each GenServer do? How does it interact with the other GenServers? What is the lifecycle of a job? Some clarity to these questions would go a long way towards facilitating further contributions that can improve upon the current state of affairs.
Once again I really do appreciate the effort that's been put into Exq, it is a valuable part of the Elixir community.
Thanks!