contribsys / faktory Goto Github PK

View Code? Open in Web Editor NEW

5.6K 5.6K 223.0 2.81 MB

Language-agnostic persistent background job server

Home Page: https://contribsys.com/faktory/

License: Other

Makefile 1.61% Go 85.73% Ruby 1.75% Shell 1.19% CSS 9.65% Dockerfile 0.07%

faktory's People

Contributors

Stargazers

Watchers

Forkers

kuadrosx rhencke aavoyants cjbottaro y-yagi lyrl andrewstucki maniacs-oss heypinch msonawane skyformat99 clevertechru number0 nycdavid brasilikum bradparks sadiqmmm zhanglei speedfirst zhaitianduo augustand denza edmarriner jamesjoshuahill sbusso dkdocs kustomzone yelled3 trendign chiastolite helloshane hhy5277 vermuz thebadmonkeydev mikebchillin cdrx aland-zhang techscientist cdoru rayleyva iracic s2t2 ferhatelmas vosmith fakihariefnoto ximet arlandism iziren jgallen23 linkedtec adelowo joseustra jweslley jeff-french jonhoo bigblind favadi bperham gandhi-jay williamtran29 swrobel hoangpq mausconi leobcn norsig rvillas stevenafranklin fengweijp nicolasleger aleksi amasses danielwestendorf icefoxen happy-ferret zhulux abhimanyubabbar ykankaya lucaslsl lulzzz firefoxxy8 elldritch renesugar ttilberg lexfork edson0404 dwmcnelis iketutg foxalready iamsingularity shammishailaj majedev go-xyf mcpolemic b2tt matthewmueller geoah antoinefink bmxpert1 thapakazi servicefoundation

faktory's Issues

Core runtime

Ensure all operations to be atomic within DB.
Blocking semantics on POP

Tooling

RPM/DEB packages
Add benchmark/load testing script

Web UI

Launch

Pre blog post
ANN blog post
Homepage
Wiki docs

Retry documentation

Does retry apply to only failed jobs or jobs that failed to ACK?
If it applies only to failed jobs, then can we set a limit for how many times a job can be enqueued again?

The Web UI currently is missing CSRF protection because we don't have any notion of session to store the current token value. The hooks are all there in the forms to inject an authenticity_token, we just need to implement it.

Job priority

Can we support simple job prioritization? Today the queue key in RocksDB is queueName|index, we could change the key to be queueName|priority|index but that would make it impossible to get the next job as an O(1) operation. We could use a live cursor that would walk the RocksDB key tree in real-time but I'm not sure of the performance implications. Other ideas?

OSX and Homebrew

We could use help building and distributing Faktory via Homebrew. This would make basic usage on OSX much, much easier. Anyone know how to do this?

[FEATURE] Prometheus metrics

I'm keeping this short: I'm pretty sure the project and its users can benefit from exporting prometheus metrics. Metrics are important in modern application deployments, especially if there are many moving parts.

I would suggest to export some basic metrics about the application itself, queues, jobs and the web server (latency, http status codes, etc.). They are also interesting for scaling Kubernetes deplyoments: See #19.

If this is desired, we can discuss it in detail and I would love to contribute a basic implementation.

Godoc format

Using the block comments to document the methods puts does not render well in the end. https://godoc.org/github.com/contribsys/faktory

Small task, but good docs might smooth out the learning curve. Hoping to tackle bigger features soon.

Worker library naming conventions

I noticed the Ruby client is named faktory_worker_ruby, but it has a top level namespace Faktory. Think I'm going to follow that convention for the Elixir app.

How do you feel about faktory_worker_ex instead of factory_worker_elixir (and curious why you didn't do faktory_working_rb)?

Asking the important questions here... ;)

Quiet / stop may cause (buggy) worker implementations to spin

When a worker calls FETCH, if there is no work to be done then the server will wait 2s before returning, which is great because there should usually be a worker waiting to accept a new job and the end to end latency is nice and low.

If a worker has its status set to quiet or stop then when worker calls FETCH it will get an immediate empty response. If a worker depends on the 2s timeout to FETCH for flow control then it'll spin and send 100s of FETCH requests / s that will never get any job.

If a worker is responsible for its own flow control, then it'll need to slow the frequency of FETCH requests which will increase the latency of tasks through the system.

I think if a worker calls FETCH on a quiet'd / stop'd connection then the server should wait 1s and then return an err ERR.

What are your thoughts? If you think it makes sense I'll create a PR

Design TODOs

Need @sethterpstra's help here:

Web UI refresh (webui/static/*, webui/*.ego)
Idle/busy icon (webui/static/status.png)
print-ready logo assets (SVG, PNG, etc)
favicon.ico

EOF keystroke causes panic

I was doing something unrelated and found a bug. Early software, love the project, no judgey mode here. 😄 🎈

./faktory-cli
Faktory 0.5.0
Copyright © 2017 Contributed Systems LLC
[snip]

> ^D
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x4129419]

goroutine 1 [running]:
main.repl(0xc42001a280, 0x18, 0x0, 0x0)
	[snip]/faktory/cmd/repl.go:59 +0x419
main.main()
	[snip]/faktory/cmd/repl.go:42 +0x168

exit and ctrl-c do not panic. I'll poke around and see if anything is easy to fix but I don't know this codebase yet.

Heatbeat (BEAT) issues

My worker processes persist in the web UI despite having been stopped for hours.

Also, the signals do not persist. For example, when I hit the quiet button in the UI, then the next BEAT response will report quiet, but any subsequent BEATs revert back to ok despite the UI still showing it's quiet.

Here's a screenshot: http://storage.stochasticbytes.com.s3.amazonaws.com/4KIdyMeh.png

✌️

Clean up protocol documentation

The best description of the protocol for talking to Faktory we currently have is https://github.com/contribsys/faktory/wiki/Worker-Lifecycle, with bits and pieces in https://github.com/contribsys/faktory/wiki/The-Job-Payload. For anything missing, implementors must consult the Go client source. This is suboptimal.

Faktory should provide an in-tree (not Wiki), detailed description of the protocol, with complete examples of interactions (including the exact REDIS RESP formatted response). I recommend looking at the IMAP RFC for inspiration (see, for example, the docs for SELECT in §6.3.1). Implementors should not need to look at the Go client source at all. This also means that we need to sort out some current disagreements between the Wiki docs and the Go source. For example, what exactly does BEAT return? The docs say:

The response can be OK or a JSON hash with further data for the worker:
{"state":"quiet"}

Whereas the Go client assumes it's always a string:

faktory/client.go

Line 329 in 2e7a019

val, err := c.Generic("BEAT " + fmt.Sprintf(`{"wid":"%s"}`, RandomProcessWid))

faktory/client.go

Line 325 in 2e7a019

return readString(c.rdr)

https://github.com/contribsys/faktory_worker_go/blob/05a5c14543ea9c2737e5ec4900f0336c5d2cc819/runner.go#L145-L152

Ideally, we should also supply a set of test cases with trace responses from the server, and expected messages from the client, but that can be another step down the line.

Differences from an AMQP server

Faktory looks like a decent message broker. But how do I advocate it to my teammates over, say, RabbitMQ?

Add command for checking if job is still in queue

Polling is a relatively common way of checking for job completion, but Faktory does not currently provide a way for applications to do that. To support this, it'd be useful to have an API call that checks whether a given job is still enqueued (i.e., has not been processed yet). Something like:

ENQUEUED {"queue": "<queue>", "jid": "<jid>"}

Which returns one of

+gone
+queued
+dead

queue should default to default if not given. We need the queue to be explicitly named to avoid searing all the queues and sets.

For further motivation and discussion, see the Gitter thread starting from https://gitter.im/contribsys/faktory?at=5a03413f7081b66876c7a6ae.

Add NOOP command

If a worker has a concurrency of 10, then it may have 10 (or 11) connections open.

If the BEAT is once per 15s per worker (not per connection) is it worth adding a ping / pong command that a thread tied up with a long running job can send to the server to keep the connection alive?

The worker could send a periodic BEAT but that updates the db (I think) which incurs some cost.

I’m happy to submit a PR

Race conditions

Running the testing suite with -race enabled detects a number of race conditions. Some of them are purely in the test suite i.e. the count incrementing in goroutines in storage/queue_test.go not being protected by a mutex, but some of them are deeper in the core, i.e. server/tasks.go appending to an internal tasks array in one goroutine and ranging over it in another.

Might want to consider enabling -race on the tests to track these down and to generate a list of places that need fixing.

Max queue size should be configurable

2017/10/30 12:02:34 failed to push task to faktory: ERR default is too large, currently 100001, max size is 100000

Was 100,000 picked for any reason? Is this a limit in RocksDB?

If not, would be great if this was configurable.

Reap connections that never heartbeat

I noticed that if a connection doesn't send a heartbeat, it doesn't get reaped in the 60s timeout. Would an acceptable solution to this be to write a heartbeat to the map the moment a connection is opened (clientWorkerFromHello)? Alternative? I'm willing to open a PR—wondering what the intended behavior is.

Docker install

Docker is a newer way of distributing an entire stack of dependencies as a single unit. Anyone want or need a Docker install for Faktory?

Support dequeueing scheduled jobs

If an application schedules a job using at, and then discovers that the job is no longer necessary (likely due to a user action), it would be convenient to have a way of cancelling that job, rather than expending resources on executing it. There are ways to do this out-of-band by having workers consult a "cancelled jobs" table somewhere, but this seems like unfortunate complexity to add to the application when Faktory already has the necessary control structures.

Note that this feature would not let you abort or cancel jobs that are currently running, nor does it give any guarantees that a job will not run. Rather this is a hint to the system that a given job need not subsequently be given out to workers. We could scope this only to jobs with an at set, but that seems like an unnecessary restriction.

The proposed API call would be something like:

DEQUEUE {"queue": "<queue>", "jid": "<jid>"}

which returns one of

+OK
+gone

queue defaults to default if not given. As with #81, we want to explicitly name the queue to avoid having to search over all queues. See also Gitter discussion starting at https://gitter.im/contribsys/faktory?at=5a02d85286d308b755c10755

Load testing tool

We have a very basic queue push/pop tool in test/load but it would be nice to have broader coverage of more commands, more connections, etc. What makes sense to add?

Restart loses Processed/Failures count in Web UI

The server doesn't populate the Stats from storage when booting so those numbers always start from zero. Read the current values from storage on bootup.

Support producer connections

Right now, all connections are supposed to send a WID. Oops, I forgot: not all connections are from workers! It will be normal for app servers to push jobs to Faktory and these connections should not be part of the data displayed on the Busy page. Consuming processes == workers == listed on Busy page. Figure out how this will change the HELLO and heartbeat handling.

Lower minimum for reserve_for

Why is there a hard limit of 60 seconds for reserve_for? We have a lot of small jobs (2-5 seconds each) which need to have a lower reserve_for to tolerate workers going down. Is there any chance of lowering the limit later?

RocksDB tuning

Work is needed to tune RocksDB for our queue usage patterns. Some knobs are already visible in the gorockdb bindings, others need direct C++ access.

Create benchmark and load scripts useful to validate tunings.

Refactor CLI

The cmd/repl.go code is a mess, needs to be refactored and tests added for it.

Handling poison pills?

A poison pill is a job which causes the worker process to crash. This will lead to the job being retried every hour (by default) when the job reservation times out, forever. How do we catch and prevent this? Should reservation recovery increment Job.Failure.RetryCount and follow the retry logic? Can we detect a poison pill and move it to the Dead set immediately?

Web UI security

Enable basic auth when a password is configured
Implement CSRF protection

Security and authentication

Security is always tricky, with compromise and tension between usability and safety. My basic policy: development should be easy, production should be safe. More thoughts:

Network safety and Encryption

Development mode should be quickly usable, without needing to read docs or configure options in order to get working. If someone is running Faktory locally, she should be able to install and start it immediately. The default would be to listen on localhost only, with no TLS or passwords required in this mode.

In production, we want to listen on a non-localhost interface. Once ports exposed to the outside world, we need to get more paranoid and require TLS and authentication.

We can't provide TLS without the user providing a cert for the hostname used to access the API and Web UI. The user will need to provide a TLS cert and CA chain, which I propose reside at a conventional location:

{~/.faktory,/etc/faktory}/public.crt
{~/.faktory,/etc/faktory}/private.key
{~/.faktory,/etc/faktory}/ca-chain.crt

Those filenames can be actual bytes or soft links to the files elsewhere on the disk.

In summary:

If listening on localhost, Faktory will use insecure TCP.
If those files are found, Faktory will use TLS.
If not listening on localhost and no certs, Faktory will exit immediately with an error.

Alternatives

The user can run Faktory in production using e.g. stunnel, proxy to expose insecure localhost ports instead of TLS. Since Faktory is listening on localhost, it doesn't care or require any security. It's up to the user to expose those ports securely.

Authentication

TLS doesn't solve our problem, we still need to limit access. I propose a simple password scheme for all connecting clients when sending the AHOY payload. Clients should send two attributes, pwdhash and salt where pwdhash is sha256(password+salt). Salt should be, at minimum, a 12-char random string and generated unique for each AHOY.

AHOY {"pwdhash":"d40b8917d7aff72a40a677c55992d2edc1b41331ec3b24641f2affa67b8dba09","salt":"aos,dfis33dkvn"}

This is the SHA256 for the password "helloworld":

$ echo -n helloworldaos,dfis33dkvn | shasum -a 256
d40b8917d7aff72a40a677c55992d2edc1b41331ec3b24641f2affa67b8dba09 -

Without a valid pwdhash and salt in the AHOY, Faktory will disconnect the client immediately.

Open questions

How do we pass the password to Faktory in production? A command line option will expose it in ps output, I believe current best practices suggest an ENV variable to be best. Other ideas?

Rethinking TLS/SSL requirements

Right now Faktory requires TLS by default if not listening explicitly on "localhost". This ignores modern development practices like containers with host-only networking where the service in the container is listening on non-localhost but is still effectively local.

Instead ignore the network binding and rely on the environment flag:

-e development means we don't require TLS by default. We'll still check the TLS directory for certs and use them if possible.
-e production means we require TLS by default. The user can pass -no-tls otherwise we exit early if TLS certs aren't found.

How do we ensure that people actively enable the production flag when in production? Should it be the default and we force people to enable development?

Clients should use "tcp+tls" for the protocol if they want TLS tcp+tls://192.168.1.1:7419 or "tcp" for an unencrypted connection. This feels ugly and alternative suggestions are welcome.

Is sharing the same port between TLS and non-TLS sockets a bad idea? Should we be using 7419 for plain sockets and 7443 for TLS sockets, to make the security setup more explicit?

Faktory crashes with lots of connections under OSX

We might want to put in tunable but soft and hard limits to the network connections allowed by Faktory.

To reproduce:

Terminal 1: `make run`
Terminal 2: 

#  should work fine
go run test/load/main.go 30000 10

# crash!
go run test/load/main.go 30000 500

dial tcp [::1]:7419: socket: too many open files

Worker requirements

I’m working on a Python worker for Faktory and have a couple of questions.

Do all workers need to implement BEAT? What is FLUSH for?

HA/Cluster Support

Wondering what your thoughts are on using something like https://github.com/hashicorp/raft and making this work with clustering support baked in?

Panic in queue init

Looks like this is what happens when master tries to boot an existing Store without priorities. @andrewstucki Any ideas on backwards compatibility or graceful fallback?

panic: runtime error: index out of range

goroutine 5 [running]:
encoding/binary.binary.bigEndian.Uint64(...)
	/usr/local/Cellar/go/1.9.2/libexec/src/encoding/binary/binary.go:124
github.com/contribsys/faktory/storage.decodeKey(...)
	/Users/mikeperham/src/github.com/contribsys/faktory/storage/queue_rocksdb.go:496
github.com/contribsys/faktory/storage.(*rocksQueue).Init(0xc42010d0a0, 0x0, 0x0)
	/Users/mikeperham/src/github.com/contribsys/faktory/storage/queue_rocksdb.go:213 +0x751
github.com/contribsys/faktory/storage.(*rocksStore).GetQueue(0xc420144000, 0xc4200149c8, 0x7, 0x0, 0x0, 0x0, 0x0)
	/Users/mikeperham/src/github.com/contribsys/faktory/storage/rocksdb.go:254 +0x239
github.com/contribsys/faktory/storage.(*rocksStore).init(0xc420144000, 0x0, 0x0)
	/Users/mikeperham/src/github.com/contribsys/faktory/storage/rocksdb.go:197 +0x566
github.com/contribsys/faktory/storage.OpenRocks(0x43e23f8, 0x13, 0x0, 0x0, 0x0, 0x0)
	/Users/mikeperham/src/github.com/contribsys/faktory/storage/rocksdb.go:94 +0x94f
github.com/contribsys/faktory/storage.Open(0x43dc968, 0x7, 0x43e23f8, 0x13, 0x0, 0x0, 0x0, 0x0)
	/Users/mikeperham/src/github.com/contribsys/faktory/storage/types.go:82 +0x12d
github.com/contribsys/faktory/server.(*Server).Start(0xc42010e090, 0x0, 0x0)
	/Users/mikeperham/src/github.com/contribsys/faktory/server/server.go:113 +0x7d
github.com/contribsys/faktory/webui.bootRuntime.func1(0xc42010e090)
	/Users/mikeperham/src/github.com/contribsys/faktory/webui/web_test.go:87 +0x2b
created by github.com/contribsys/faktory/webui.bootRuntime
	/Users/mikeperham/src/github.com/contribsys/faktory/webui/web_test.go:86 +0x12b

List of improvements over sidekiq

Besides being more language-agnostic, would love to understand the benefits of this over sidekiq in more detail. As an extension; how does this differ from queuing systems like kafka, zeromq, rabbitmq, etc. and what task-specific sugar is added on top?

Apologies if this is super apparent somewhere already! Looks like a cool project, excited to play around with it.

Kubernetes support?

What should Faktory look like in a world full of Kubernetes? My understanding is that Kubernetes could be very useful in scaling worker processes as queues grow. How can Faktory make this easy?

Rework authentication mechanism

Right now the password authentication is home-grown. I'd suggest we modify the protocol to use SASL PLAIN instead but I'm open to expert advice here. I've seen memcached add-ons on Heroku use SASL before.

Wider Linux distro support for built binary

Plain Go Linux binaries can be used on any Linux distro but using CGO to pull in RocksDB means that we also link in libc and a bunch of other dynamic libraries, making them distro-specific. I'd like to minimize that and statically compile anything that we possibly can, in order to increase platform support and minimize bugs due to differing versions between distros.

Investigate static linking as many things as possible. Today the list looks like this:

ubuntu@ubuntu-xenial:/vagrant$ make build
go generate ./...
go build -o faktory cmd/main.go
ubuntu@ubuntu-xenial:/vagrant$ ldd ./faktory
	/lib64/ld-linux-x86-64.so.2 (0x00007f1f65321000)
	linux-vdso.so.1 =>  (0x00007ffd3f1db000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f1f65104000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f1f64a79000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f1f63e67000)

	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f1f64d82000)
	libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f1f6485f000)
	libbz2.so.1.0 => /lib/x86_64-linux-gnu/libbz2.so.1.0 (0x00007f1f6464f000)
	libsnappy.so.1 => /usr/lib/x86_64-linux-gnu/libsnappy.so.1 (0x00007f1f64447000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f1f64231000)

I suspect that we can statically compile the bottom five into the distributed binary. I believe CockroachDB did some investigation into this in the Go 1.1 timeframe but it's not clear to me if the situation has gotten any better.

Faktory and Heroku

It would be awesome to use Faktory on Heroku. No idea what's involved in doing this, mainly opening this issue as a placeholder.

Use correct REDIS RESP NIL response

When a FETCH returns no results, it currently responds with:

$0\r\n
\r\n

However, REDIS RESP defines the Null Bulk String specifically "to signal non-existence of a value":

$-1\r\n

Which seems more appropriate.

Remove logrus dependency

AFAIK Logrus is not a huge dependency and he's stating he doesn't have time to support the project anymore, we should fork the basic functionality into the util package. Remember to preserve the copyright notice to keep to the MIT license.

No downtime backups and adminy commands

Right now faktory needs to be shut down to take a backup with faktory-cli. The debug page provides a live backup button but isn't easily scriptable. Faktory will get a bunch of adminy/opsy-type commands as it matures, how do we expose those?

How should someone be able to script a backup? Random thoughts:

BACKUP command or family of DB [VERB] commands?
A /var/run/faktory.pipe named pipe which allows access to more adminy things without worry about network security. Inspeqtor uses a pipe to provide introspection. This would provide a separate avenue for commands; do we need to have this split between worker commands over the network and admin commands over a pipe?
Redis throws everything into a big bag of commands with no fine-grained authorization, renaming commands seems to be their solution to turning off commands. We could allow admin commands over the network in development but pipe only in production.

Option to force language of GUI to be english

Hi, First of all, very interesting project 👏

I'm from The Netherlands and when I opened the web GUI it showed me everything in Dutch.
I looked it up, and it seems that it uses the Accept-Language header to automagically select this.

I don't like my technical tools to be translated in Dutch. I prefer English for that.

Would it be possible to make this optional?

Merge server/{scheduler,tasks}

They can be merged through some refaktoring.

Prune old backups

Add a cli prune command to prune everything but the last N backups. See API in contribsys/gorocksdb.

Windows builds?

@mperham:

Noticed that the idea of building on Windows was raised in Gitter. Thought that I'd take a stab at getting it building with CI on Windows. I'm not by any means even a fan of Windows systems, but figured that maybe it would help get more people contributing.

However, I'm not sure how much you want to actually support Windows--personally, with all the craziness of Windows systems and non-POSIX-compliant stuff... I wouldn't... but, it looks fairly possible given that the only real platform-specific dependency is pretty easy to install with a Microsoft supported package manager. Before I spend too much time on this though, I figured I'd get your thoughts.

On a side note, man, Windows builds of library dependencies are SLOW. It takes like 25 mins just to build all the required dependencies through appveyor. Thank goodness for caching.

Readme Fix: "Taking care of business and working overtime"

Bachman-Turner Overdrive would probably really enjoy the shout-out but only if you get those lyrics perfect:

Chorus:
And I'll be taking care of business (every day)
Taking care of business (every way)
I've been taking care of business (it's all mine)
Taking care of business and working overtime, work out

I think the Readme needs an update.

Test coverage scan

Run coverage reports for the {storage,server,webui} packages. Find any non-trivial blocks of uncovered code. Write a test for that block. Rinse, repeat until no blocks remain.

Add travis ?

Adding it should be as simple as running the build.sh I believe. But, it doesn't need to install the Go distribution because if we specify the language in travis, it automatically contains Go.

So, we just need to install rocks db and then run the tests.

Also, adding go vet and go lint will be good. Though that is a separate task.