thequbit / barkingowl Goto Github PK

View Code? Open in Web Editor NEW

19.0 5.0 7.0 635 KB

scalable web scraper framework for finding documents on websites.

License: GNU General Public License v3.0

Python 100.00%

barkingowl's Introduction

BarkingOwl

BarkingOwl is a scalable web crawler intended to be used to find specific document types such as PDFs.

Not a hard-core hacker? Check out the web front-end tool for barkingowl here

####Background and Description####

Barking Owl came out of the need presented at a Hacks and Hackers Rochester (#hhroc) meet-up in Syracuse, NY. A journalist expressed his need for a tool that would assist him in looking for key words within PDFs posted to town websites, such as meeting minutes.

####Objective####

I wanted to make the code for this project as reusable as possible as I knew it had several parallels to other work I had been doing and wanted to do in the future. The solution was a architecture that would allow for significant scalability and extensibility.

####How to get started####

BarkingOwl is on the pypi network, thus it can be installed using pip:

> pip install barkingowl

To use BarkingOwl you will need to install RabbitMQ. Information on how to install RabbitMQ can be found here: http://www.rabbitmq.com/download.html

####Documentation####

Check out the wiki!

barkingowl's People

Contributors

Stargazers

Watchers

Forkers

citruspi reustonium msabramo mostdev ralphbean python3pkg sandhyayadav0711

barkingowl's Issues

Scrapers do not pull down next URL correctly.

I'm not sure if this is a scraper issue or a dispatcher issue, but once a scraper completes, it does not get issues the next URL.

scraper appears to loose connection randomly

I think this is a non-wrapped error that is happening that start_consuming() is catching within pika. This happens when launching barkingowl-scraper.py. It may also happen when importing pypi package. Need to wrap all the things within try/except and find out what is going on.

Dispatch work via 0mq rather than threading

It would be nice to dispatch work via 0mq rather than launch threads from a single python script. This would make the scraper significantly more scale-able.

Add enabled/disable feature to URLs in dispatcher database

It would be nice to enabled and disable URLs so they don't have to be deleted and then added again.

Include what the error was with the url within the bad_urls list

It would be great to record why a link was marked as 'bad' within the _data['bad_urls'] list. This would make each entry in the list a dict of 'url' and 'error' rather than just a string.

There is at least one location, I think two places that _data['bad_urls'] is used. Will need to itterate through with some different code rather than just using "url in bad urls".

Add logging to scrapers

It would be nice to look at a log file of the scrapers running. Perhaps logstash??

Include title of page document was found on

It would be helpful to include the title/description of the page that the document was found on, as some times there is little to no link text and the document name is rather cryptic.

Allow for wildcard in magic match

Right now you have to supply a doctype ... it would be nice if widecard (anything) was supported.

Phrase detect

Right now, within runscapper.py, we aren't looking at any of the user phrases. We need to pull all the phrases for that url and perform the test.

Ad exception cause for "mailto:" URL

BarkingOwl currently labels 'mailto:' links as 'bad links'. Probably a good idea to not include this in the 'bad link' category as it really isn't a bad link, just not followable.

doctext encoding issue

Saw an error with Docs.add() where 'latin-1' could not be encoded when pushing to the database ...

Follow <embed> tags as <a> tags

Noticed that it may be useful to follow tags the way tags are followed. For tags we follow the href attribute, with tags we follow the src attribute.

This should be included as a possible boolean value passed into the barking owl scraper when initialized.

The document converter, nor the document processor are apart of the message bus.

Right now the document converter, nor the document processor are part of the message bus. They really don't need to be other than to receive a 'global shutdown' message. It would also be nice to ping them for real-time statistics about their progress.

The document converter and the document processor should be included on the message bus. This should be done by some kind of message bus class to abstract the RabbitMQ (pika) communications.

Broadcast to check if UUID is already in use

Every element within the BarkingOwl universe requires a unique ID. The default value for each elements init() function is to use str(uuid.uuid4()). It would be nice if we could check to make sure this ID isn't already in use if the default is not used. If anything some kind of debug output so the user knows whats going on.

Add ability to delete URLs from dispatcher database

It would be nice to be able to delete URLs from the dispatcher database.

Allow for configurable scraper sleep time

To prevent thrashing, have a sleep between downloading URLs. This should be defaulted to zero, but configurable.

Error check against a bad date format in CreationDate

I am falling through to the general try/except right now when the data isn't formatted correctly. I know this should never happen, but somehow the town of Gates, NY wrote "2013100" into all of their CreationDate fields .......

Handle relative links better

I believe that relative links are not being handles correctly. Additionally, if a URL goes above the root, the URL will end up as a bad URL. If it is above the root, we should just drive it to the root.

Review scraper and dispatcher exit

Right now exiting happens with a 'raise Exception()' ... it would be nice to review a better way to cleanly exit the thread.

Create a scraper control tool

Preferably within barkingowl-status.py (although that may change to barkingowl-control.py) as part of the flask app to control all of the scrapers.

Validate url_data within dispatcher

PR #39 provided more feedback for an invalid url_data payload for the scraper, the same thing needs to be implemented for the dispatcher. This should be abstracted into a utils.py file that everyone can use. I am sure there will be other functions/class that will be put in there as well.

Sanity check all inputs on URLs within controller web site

Set limits on the inputs on the webpage.

Sanity check all inputs on URLs within controller Flask ap.

Not looking at the validity of the inputs on the URL a properties that are being passed to the Flask app (barkingowl-controller.py). This needs to be done.

Handle spaces in URL names

Need to handle chars within URLs that are not quoted (not can't just throw it into url.quote() because of http[s]:// can't get quoted).

Allow for 'not' operation for document types.

It would be nice to return all documents that are not of a specified type, such as all documents that are not html.

Add abiliy to run barkingowl_scraper.py and barkingowl_dispatcher.py as daemons

It would be great if these files could be run as daemons, so they could be "set-and-forget". This would also allow for example code for others implementing barkingowl to do the same for their systems.

Add support for 'Root URL' for each 'Target URL'

The scraper assumes right now that each URL being passed in is the Root URL (i.e. that it sees "http://google.com" and checks to make sure all links only follow that root URL. So if you pass in "http://google.com/mydocs/" nothing will really work (or not the way you expect it to).

Adding this feature would also allow us to look for something like links for "http://ecode360.org/" on "http://henreitta.org/".