pricingassistant / mrq Goto Github PK

View Code? Open in Web Editor NEW

876.0 59.0 123.0 1.95 MB

Mr. Queue - A distributed worker task queue in Python using Redis & gevent

License: MIT License

Python 86.01% CSS 0.51% JavaScript 7.24% HTML 5.59% Perl 0.13% Makefile 0.25% Dockerfile 0.27%

mrq's Introduction

MRQ

MRQ is a distributed task queue for python built on top of mongo, redis and gevent.

Full documentation is available on readthedocs

Why?

MRQ is an opinionated task queue. It aims to be simple and beautiful like RQ while having performances close to Celery

MRQ was first developed at Pricing Assistant and its initial feature set matches the needs of worker queues with heterogenous jobs (IO-bound & CPU-bound, lots of small tasks & a few large ones).

Main Features

Simple code: We originally switched from Celery to RQ because Celery's code was incredibly complex and obscure (Slides). MRQ should be as easy to understand as RQ and even easier to extend.
Great dashboard: Have visibility and control on everything: queued jobs, current jobs, worker status, ...
Per-job logs: Get the log output of each task separately in the dashboard
Gevent worker: IO-bound tasks can be done in parallel in the same UNIX process for maximum throughput
Supervisord integration: CPU-bound tasks can be split across several UNIX processes with a single command-line flag
Job management: You can retry, requeue, cancel jobs from the code or the dashboard.
Performance: Bulk job queueing, easy job profiling
Easy configuration: Every aspect of MRQ is configurable through command-line flags or a configuration file
Job routing: Like Celery, jobs can have default queues, timeout and ttl values.
Builtin scheduler: Schedule tasks by interval or by time of the day
Strategies: Sequential or parallel dequeue order, also a burst mode for one-time or periodic batch jobs.
Subqueues: Simple command-line pattern for dequeuing multiple sub queues, using auto discovery from worker side.
Thorough testing: Edge-cases like worker interrupts, Redis failures, ... are tested inside a Docker container.
Greenlet tracing: See how much time was spent in each greenlet to debug CPU-intensive jobs.
Integrated memory leak debugger: Track down jobs leaking memory and find the leaks with objgraph.

Dashboard Screenshots

Get Started

This 5-minute tutorial will show you how to run your first jobs with MRQ.

Installation

Make sure you have installed the dependencies : Redis and MongoDB
Install MRQ with pip install mrq
Start a mongo server with mongod &
Start a redis server with redis-server &

Write your first task

Create a new directory and write a simple task in a file called tasks.py :

$ mkdir test-mrq && cd test-mrq
$ touch __init__.py
$ vim tasks.py

from mrq.task import Task
import urllib2


class Fetch(Task):

    def run(self, params):

        with urllib2.urlopen(params["url"]) as f:
          t = f.read()
          return len(t)

Run it synchronously

You can now run it from the command line using mrq-run:

$ mrq-run tasks.Fetch url http://www.google.com

2014-12-18 15:44:37.869029 [DEBUG] mongodb_jobs: Connecting to MongoDB at 127.0.0.1:27017/mrq...
2014-12-18 15:44:37.880115 [DEBUG] mongodb_jobs: ... connected.
2014-12-18 15:44:37.880305 [DEBUG] Starting tasks.Fetch({'url': 'http://www.google.com'})
2014-12-18 15:44:38.158572 [DEBUG] Job None success: 0.278229s total
17655

Run it asynchronously

Let's schedule the same task 3 times with different parameters:

$ mrq-run --queue fetches tasks.Fetch url http://www.google.com &&
  mrq-run --queue fetches tasks.Fetch url http://www.yahoo.com &&
  mrq-run --queue fetches tasks.Fetch url http://www.wordpress.com

2014-12-18 15:49:05.688627 [DEBUG] mongodb_jobs: Connecting to MongoDB at 127.0.0.1:27017/mrq...
2014-12-18 15:49:05.705400 [DEBUG] mongodb_jobs: ... connected.
2014-12-18 15:49:05.729364 [INFO] redis: Connecting to Redis at 127.0.0.1...
5492f771520d1887bfdf4b0f
2014-12-18 15:49:05.957912 [DEBUG] mongodb_jobs: Connecting to MongoDB at 127.0.0.1:27017/mrq...
2014-12-18 15:49:05.967419 [DEBUG] mongodb_jobs: ... connected.
2014-12-18 15:49:05.983925 [INFO] redis: Connecting to Redis at 127.0.0.1...
5492f771520d1887c2d7d2db
2014-12-18 15:49:06.182351 [DEBUG] mongodb_jobs: Connecting to MongoDB at 127.0.0.1:27017/mrq...
2014-12-18 15:49:06.193314 [DEBUG] mongodb_jobs: ... connected.
2014-12-18 15:49:06.209336 [INFO] redis: Connecting to Redis at 127.0.0.1...
5492f772520d1887c5b32881

You can see that instead of executing the tasks and returning their results right away, mrq-run has added them to the queue named fetches and printed their IDs.

Now start MRQ's dasbhoard with mrq-dashboard & and go check your newly created queue and jobs on localhost:5555

They are ready to be dequeued by a worker. Start one with mrq-worker and follow it on the dashboard as it executes the queued jobs in parallel.

$ mrq-worker fetches

2014-12-18 15:52:57.362209 [INFO] Starting Gevent pool with 10 worker greenlets (+ report, logs, adminhttp)
2014-12-18 15:52:57.388033 [INFO] redis: Connecting to Redis at 127.0.0.1...
2014-12-18 15:52:57.389488 [DEBUG] mongodb_jobs: Connecting to MongoDB at 127.0.0.1:27017/mrq...
2014-12-18 15:52:57.390996 [DEBUG] mongodb_jobs: ... connected.
2014-12-18 15:52:57.391336 [DEBUG] mongodb_logs: Connecting to MongoDB at 127.0.0.1:27017/mrq...
2014-12-18 15:52:57.392430 [DEBUG] mongodb_logs: ... connected.
2014-12-18 15:52:57.523329 [INFO] Fetching 1 jobs from ['fetches']
2014-12-18 15:52:57.567311 [DEBUG] Starting tasks.Fetch({u'url': u'http://www.google.com'})
2014-12-18 15:52:58.670492 [DEBUG] Job 5492f771520d1887bfdf4b0f success: 1.135268s total
2014-12-18 15:52:57.523329 [INFO] Fetching 1 jobs from ['fetches']
2014-12-18 15:52:57.567747 [DEBUG] Starting tasks.Fetch({u'url': u'http://www.yahoo.com'})
2014-12-18 15:53:01.897873 [DEBUG] Job 5492f771520d1887c2d7d2db success: 4.361895s total
2014-12-18 15:52:57.523329 [INFO] Fetching 1 jobs from ['fetches']
2014-12-18 15:52:57.568080 [DEBUG] Starting tasks.Fetch({u'url': u'http://www.wordpress.com'})
2014-12-18 15:53:00.685727 [DEBUG] Job 5492f772520d1887c5b32881 success: 3.149119s total
2014-12-18 15:52:57.523329 [INFO] Fetching 1 jobs from ['fetches']
2014-12-18 15:52:57.523329 [INFO] Fetching 1 jobs from ['fetches']

You can interrupt the worker with Ctrl-C once it is finished.

Going further

This was a preview on the very basic features of MRQ. What makes it actually useful is that:

You can run multiple workers in parallel. Each worker can also run multiple greenlets in parallel.
Workers can dequeue from multiple queues
You can queue jobs from your Python code to avoid using mrq-run from the command-line.

These features will be demonstrated in a future example of a simple web crawler.

Full documentation is available on readthedocs

mrq's People

Contributors

Stargazers

Watchers

Forkers

nfredrik icsaas bossjones msabramo achauve serenytics frankrousseau mmongeon-aa iorlas ialwaysbecoding benjisg sebastiken florianperucki walkinreeds gwecho vfulco mark-99 cloudxtreme slitayem charleyfarley tume ashbt ecebuzz niksite fredstro fnavarrog arijit-basu florianludwig journeyqiao ksharpdabu awesome-python jijicanyu busz leezqcst murphydai aipacino zmyer igormarfin rlcjj samael cuevasdev lntoly wuce7758 xubingyue zerolugithub vincentchen mslsoftware delkyd rubenaguilera srault95 pomika symonsoft johnarnold khikawa xmanrui dasfranck sysulj reactor-feng 0xa-cc vieyahn nextiams poseidon1214 harlowja mraerino shihuaxing yssource iyuohz sapariduo jhyehuang jamesdougharty mir355 rizplate kwuite muye233 franklinharry balajeerc saadali1996 syarkhan hhy5277 xingfengf21 apploitech mongin manoelhc melalj gdfzero myhololens titospadini bytearchive kraptor karpitsky dynamic-graphics-inc orlandobcrra ldbfpiaoran xsseng swordfly lidongwen5 liujux 5udu3r marsch mmmcorpsvit

mrq's Issues

Make subpool_map accept size=False, 0, 1 params

Without creating an actual pool

how to set timeout ?

i have set TIMEOUT=3600 in mrq-config.py,but it is also says the timeout is 300,i don'e know how to set the timeout correctly,any help would be appreacicated!

Escape task params, traceback, logs & return

Vulnerable to XSS right now

More Mongo/Redis disconnect tests

Test disconnecting at other stages:

long-running queries in tasks & worker
specific spots in the worker code: how to make reproducible?

Push sphinx documentation to readthedocs

Task progress

from mrq.context import progress

... in a task ....
progress(0.42)

Show in dashboard
Where to store? Redis? Worker logs?
What happens with subtasks?

documentation setup

I have some documentation I'd like to add, is there any plans to set-up any documentation other than README.md?

I can add to the readme for now, just thought it would be worth starting a discussion :)

pep8 style

The code style is not very pep8-like
we can use autopep8 to convert code style according to pep8

does the mongodb support the multiprocess

as the mrq use supervisor to manage the process ,i want to use mongodb to write/read the data in job,is the mongodb support the multiprocess?
use mongodb read/write in environment(--gevent 3 --processes 40) it emeges errors:
CursorNotFound: cursor id '36660188981453623' not valid at server
how to resolve?
any help will be appreaciated?

Update dashboard skin to a lighter one

As Eddie said, the dark one is heavy on the day-to-day :)

Log scale for zset ETA graphes

Sparklines make this one a bit complicated... the actual value must be the log one, and then we'd need to exponentiate it again in the tooltip. Set the log factor in the raw_queues config?

Task callbacks

Have a task called with the results of a group of task when they finish (success/failed?)

Some jobs are being queued on Mongo on raw queues

Investigate refresh_timed_set from Pricing Assistant

Task never dequeued because of desync between redis and mongo

When the queue view show different count for redis and mongo, you can requeue those forgotten jobs by using the queue level "requeue" button. However, in this particular case the task level requeue button does not seem to have any effect.

Full linting

Implement reverse mode for raw queues

zsets could be unqueued from the right too.

Add code coverage to the tests

Must merge the coverage files of all the worker processes that ran.

Remove the second-resolution on timed queues

We have some int(time.time()) in the code, but redis actually accepts float arguments.

Graphes of the index page are leaking to other pages

More consistent API

Iif you do mrq-run mrq-test.Fetch '{"url":"xx"}' it will be run in the same python process, but if you do mrq-run --async mrq-test.Fetch it will be queued and return the job ID.

From the code the default is different, send_task() is async by default but can be used with send_task(sync=True).

We need to rename a couple things there to be consistent :)

Also, from mrq.queue import send_task is not the best we can do.

Compact IDs in redis queues

base64 or binary?

how to avoid too many connections to mongodb while too many processes

0.0.46 requires redis 2.6.0+

Upon upgrading to 0.0.46 from 0.0.36 I realized we need redis 2.6.0.
Not sure how you want to specify dependencies in the README, but thought it should be noted :)

IO monitoring

Would be incredibly useful to have a list of current IO in the Dashboard:

MongoDB/Redis queries
HTTP
?

We'd collect them by monkey patching in a few places (like we already do for Mongo) and store them in worker logs.

Python 3 support

Not an explicit goal ATM.

Will wait for full gevent support first: gevent/gevent#38

Show greenlet traces in Dashboard

They are already saved in Mongo by the monitoring.

Unique jobs

We want some jobs to be only queued once.

Store a hash of the task+params ?

Rate limiting

Implement a ratelimit() primitive backed by Redis on top of which we could add more useful features

report mongo + redis server IP addresses on mrq-dashboard

Even if running a single mongo+redis instance backend, could be useful to have the IP addresses of them on the dashboard somewhere.

Move task to another queue

Dashboard button
JobAction task

could mrq be deployed on several computers(cluster)?

ETAs broken

Just sometimes?

Show worker IPs in Dashboard

Useful for linking MongoDB queries to running workers.

config: allow non-default Redis db

Currently context.py uses db=0 on the Redis connection.

https://github.com/pricingassistant/mrq/blob/master/mrq/context.py#L92

Perhaps should be able to support URL's in the form:

redis://127.0.0.1:6379/db

May submit a pull-request for this over coming weekend

Allow all commandline flags for mrq-run

Currently they are only supported for mrq-worker

how to clear jobs in the dashboard or use command line?

very slow when use mongodb in the mrq job's function

@sylvinus
if use mongodb in the jobs,it will be more slow due to the mongodb used to update job's status in the mrq,and the gevent context switch from mongodb will be expensive! It will be a disaster when many mongodb connections from job's function.
any suggestion available?

on_retry(self, exception)
on_exception(self, exception)
on_timeout(self)

and perhaps

on_success(self)?

fix scheduler's daily time feature

this will allow scheduling tasks at a specific time of the day.

it's already in there but does not work and the test is commented out.