districtdatalabs / baleen Goto Github PK
View Code? Open in Web Editor NEWAn automated ingestion service for blogs to construct a corpus for NLP research.
License: MIT License
An automated ingestion service for blogs to construct a corpus for NLP research.
License: MIT License
The idea is to store hash of feed XML, and compare new one to previously stored.
Update the Docker image to use Python 3.5 as Baleen is now dependent solely on Python 3.5
Create a multiprocess queue model for either threading or multiprocess.
Create a test harness for testing the database (e.g. something that creates a testing version of the database and then destroys it when complete).
Add Mongo dependency to Travis for testing.
Create a report that let's us know how ingestion is going.
After connecting to the docker image, all I see is the requirements.txt folder.
To reproduce:
docker build -t "baleen_app_1" -f Dockerfile-app
docker exec -it baleen_app_1 /bin/bash
Results:
root@05c2ca45b232:/baleen# ls
requirements.txt
On the dates their is some weirdness, they say:
"03:21 PM"
Which is 15:21 but sort of looks like 03:21 am, and it's confusing.
Also, the dates are in UTC time, so either:
If you could also add humanization, that would be really helpful for example:
That would really help, thanks!
Baleen crashes when Mongo refuses a connection; not sure why that's happening though.
Update the project to be compatible with Python 3.5 so we have the option to use asyncio.
@bbengfort I'm having trouble pip installing feedparser from the requirements.txt. Here is the error I'm getting:
Collecting feedparser==5.2.1 (from -r requirements.txt (line 2))
Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f26add79990>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /simple/feedparser/
Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f26add79b10>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /simple/feedparser/
Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f26add79c90>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /simple/feedparser/
Retrying (Retry(total=1, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f26add79e10>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /simple/feedparser/
Retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f26add79f90>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /simple/feedparser/
Could not find a version that satisfies the requirement feedparser==5.2.1 (from -r requirements.txt (line 2)) (from versions: )
No matching distribution found for feedparser==5.2.1 (from -r requirements.txt (line 2))
ERROR: Service 'app' failed to build: The command '/bin/sh -c pip install -r requirements.txt' returned a non-zero code: 1
in deployment, going to /status causes the app to crash because the server runs out of memory (even though it has 4.0 GB worth of memory).
Looks like we're getting an occassional segfault on /status/
, the /var/log/uwsgi/app/baleen.log
has the following to say about it:
Note that this appears to the user as a 404 error from Nginx.
Thu Apr 7 18:00:38 2016 - !!! uWSGI process 3488 got Segmentation Fault !!!
Thu Apr 7 18:00:38 2016 - *** backtrace of 3488 ***
/usr/bin/uwsgi(uwsgi_backtrace+0x2e) [0x45121e]
/usr/bin/uwsgi(uwsgi_segfault+0x21) [0x4515f1]
/lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f0310177d40]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Malloc+0x248) [0x7f030eff9298]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyCapsule_New+0x28) [0x7f030ef835c8]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x1a0fe9) [0x7f030efc5fe9]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x1a2a0b) [0x7f030efc7a0b]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(_PyArg_ParseTuple_SizeT+0x89) [0x7f030ef839c9]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x8d7cd) [0x7f030eeb27cd]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4bd4) [0x7f030efb20d4]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x7f030efb354d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x48d8) [0x7f030efb1dd8]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x7f030efb354d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x48d8) [0x7f030efb1dd8]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4b59) [0x7f030efb2059]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x7f030efb354d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x1c36d0) [0x7f030efe86d0]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0xbb7bd) [0x7f030eee07bd]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x47) [0x7f030efcd577]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0xe19a6) [0x7f030ef069a6]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x93912) [0x7f030eeb8912]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x60f12) [0x7f030ee85f12]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x13268f) [0x7f030ef5768f]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x2316) [0x7f030efaf816]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4b59) [0x7f030efb2059]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x7f030efb354d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x1c37a5) [0x7f030efe87a5]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0xeb1) [0x7f030efae3b1]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4b59) [0x7f030efb2059]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4b59) [0x7f030efb2059]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4b59) [0x7f030efb2059]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x80d) [0x7f030efb354d]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x1c36d0) [0x7f030efe86d0]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0xbb7bd) [0x7f030eee07bd]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(+0x1347e5) [0x7f030ef597e5]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7f030ef54d43]
/usr/lib/x86_64-linux-gnu/libpython2.7.so.1.0(PyEval_CallObjectWithKeywords+0x47) [0x7f030efcd577]
/usr/lib/uwsgi/plugins/python_plugin.so(python_call+0x11) [0x7f030f3994f1]
/usr/lib/uwsgi/plugins/python_plugin.so(uwsgi_request_wsgi+0x127) [0x7f030f39b847]
/usr/bin/uwsgi(wsgi_req_recv+0xa1) [0x413f31]
/usr/bin/uwsgi(simple_loop_run+0xc4) [0x44d5d4]
/usr/bin/uwsgi(uwsgi_ignition+0x17b) [0x45180b]
/usr/bin/uwsgi(uwsgi_worker_run+0x26d) [0x4523ad]
/usr/bin/uwsgi(uwsgi_start+0x15e3) [0x453b23]
/usr/bin/uwsgi(main+0xfb5) [0x413595]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f0310162ec5]
/usr/bin/uwsgi() [0x413649]
*** end of backtr^@<80>>^B<AB><FD>^?^@Thu Apr 7 18:00:44 2016 - DAMN ! worker 2 (pid: 3488) died, killed by signal 11 :( trying respawn ...
Thu Apr 7 18:00:44 2016 - Respawned uWSGI worker 2 (new pid: 3708)
Create ingestion log with start/stop and feeds ingested/errored/posts ingested/errored information.
Add support for the following export commandline options
I found the seed file, feedly.opml, in /tests/fixtures/
According to the install instructions, we should move it to /fixtures/
Implement the export utility to generate a corpus.
Add mkdocs to the repository.
And also the version module; and tag this as the current version.
Fetch the URL from the feed directly, don't rely on RSS for the full text.
Export a citation.bib and a license to go along with the corpus for the Baleen corpus reader.
Create component architecture for Mongo that is more decoupled.
Create the Baleen daemon service that uses scheduling to run in the background.
Add the www package as a submodule of baleen.
Use bleach to sanitize the post HTML to ensure there are no harmful scripts.
Either on Export or for Mongo storage.
What?!
This method was originally written to wrap html snippets to look like a real web page. Now we have the ability to fetch complete web pages from RSS feeds. However In some use cases, such as when the RSS feed fails to download a web page, the old wrapping behavior will still be necessary.
Requirements:
htmlize() should be either return complete webpages
@bahadasx - just to make this more readable on a mobile phone, I'm going to toss in some Booststrap if that's ok.
Get the readme going and add the Baleen architecture diagram.
The status screen in currently running got a bit wonky by accident:
I think this was just caused by us writing updates at the same time; I made some changes and I'm sure you did too. So fixes:
It would be really helpful to get some "at a glance " times into the application, particularly for duration (finished - started).
baleen.models.Job
that computes the number of seconds between started and finished (unless not finished, then between now and started).Iconography from font awesome would also help make things stand out!
The version number was removed from the header of the status page to make the format consistent with the rest of the site. The footer seems like the best place to add it back in unless we add an "About" page at some point.
@bahadasx - I'm shortening the URLs to make it easier to type on my phone, e.g. job_status --> status. Hope that's ok!
Add commis for the command line interface.
The logging is awesome! But let's go ahead and refactor it and the mongolog into our mixin based logger that we now more routinely implement.
Also add logging configuration to confire.
Move the utilities into their own package, including:
Just add the app run command to the commis.
Get the correct issues thing going.
I did a basic job of adding bootstrap styles to both pages in the app, but I didn't really touch the status page. You can use bootstrap components and things will be a lot cleaner, for example:
to better layout the page.I know you might not do bootstrap, but it's really easy and it goes a long way to making things look great without having to be a designer.
Use the YAML configuration for Flask as well.
The method:
baleen.models.Feed.count_posts
Is too slow on the deployment server. It seems that:
Post.objects(feed=self).count()
is going through the entire collection and filtering, which is bad.
Need to figure out a better way to do this.
Index?
Update Quickstart documentation as we discover gaps at PyCon sprints.
During the NLP workshop, we had a good idea - why not create some NLP processing tools in Baleen? In particular, if we create a BaleenCorpusReader
class that extends or provides a similar API to the nltk.corpus.CategorizedPlaintextCorpusReader
- then we could use NLTK style analytics directly on top of the MongoDB.
Add the readability mechanism to get the good text from the HTML dump (or for insertion into mongo).
Add a timeout so that if a post or feed is having trouble being downloaded, we skip it and carry on.
Get this project protected with continuous integration
Need a few things
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.