thequbit / barkingowl Goto Github PK

scalable web scraper framework for finding documents on websites.

License: GNU General Public License v3.0

Python 100.00%

barkingowl's Issues

Sanity check all inputs on URLs within controller Flask ap.

Not looking at the validity of the inputs on the URL a properties that are being passed to the Flask app (barkingowl-controller.py). This needs to be done.

Busy flag inconsistently being set back to false on completion

Even though the scraper finishes, the busy flag isn't always being set to false. This may be due to the scraper actually existing due to an error.

doctext encoding issue

Saw an error with Docs.add() where 'latin-1' could not be encoded when pushing to the database ...

add end time to scraps table`

It would be nice to see how long websites are taking to scrape. We should be collecting both start and end times for scraping runs.

Add logging to scrapers

It would be nice to look at a log file of the scrapers running. Perhaps logstash??

Allow for wildcard in magic match

Right now you have to supply a doctype ... it would be nice if widecard (anything) was supported.

Handle spaces in URL names

Need to handle chars within URLs that are not quoted (not can't just throw it into url.quote() because of http[s]:// can't get quoted).

Add hacking instructions

good to have a section in your readme that's like:

"so, you have no idea what this is. here's step by step instructions of how to run it so you can see what it is"

Phrase detect

Right now, within runscapper.py, we aren't looking at any of the user phrases. We need to pull all the phrases for that url and perform the test.

Broadcast to check if UUID is already in use

Every element within the BarkingOwl universe requires a unique ID. The default value for each elements init() function is to use str(uuid.uuid4()). It would be nice if we could check to make sure this ID isn't already in use if the default is not used. If anything some kind of debug output so the user knows whats going on.

Sanity check all inputs on URLs within controller web site

Set limits on the inputs on the webpage.

Add ability to delete URLs from dispatcher database

It would be nice to be able to delete URLs from the dispatcher database.

Create a scraper control tool

Preferably within barkingowl-status.py (although that may change to barkingowl-control.py) as part of the flask app to control all of the scrapers.

actually look at robots.txt ...

Yea, I need to do this so 1. i'm not being a jerk, and 2. I don't have to download as much stuff :D.

The document converter, nor the document processor are apart of the message bus.

Right now the document converter, nor the document processor are part of the message bus. They really don't need to be other than to receive a 'global shutdown' message. It would also be nice to ping them for real-time statistics about their progress.

The document converter and the document processor should be included on the message bus. This should be done by some kind of message bus class to abstract the RabbitMQ (pika) communications.

Allow for time masking

Allow for certain times of the day to be masked off to prevent scraping. This will be used to not scrap during 'normal business hours'.

Review scraper and dispatcher exit

Right now exiting happens with a 'raise Exception()' ... it would be nice to review a better way to cleanly exit the thread.

Include what the error was with the url within the bad_urls list

It would be great to record why a link was marked as 'bad' within the _data['bad_urls'] list. This would make each entry in the list a dict of 'url' and 'error' rather than just a string.

There is at least one location, I think two places that _data['bad_urls'] is used. Will need to itterate through with some different code rather than just using "url in bad urls".

Allow for 'not' operation for document types.

It would be nice to return all documents that are not of a specified type, such as all documents that are not html.

Complete flask-based web control app

Ad exception cause for "mailto:" URL

BarkingOwl currently labels 'mailto:' links as 'bad links'. Probably a good idea to not include this in the 'bad link' category as it really isn't a bad link, just not followable.

Add enabled/disable feature to URLs in dispatcher database

It would be nice to enabled and disable URLs so they don't have to be deleted and then added again.

Dispatch work via 0mq rather than threading

It would be nice to dispatch work via 0mq rather than launch threads from a single python script. This would make the scraper significantly more scale-able.

Add abiliy to run barkingowl_scraper.py and barkingowl_dispatcher.py as daemons

It would be great if these files could be run as daemons, so they could be "set-and-forget". This would also allow for example code for others implementing barkingowl to do the same for their systems.

Add support for 'Root URL' for each 'Target URL'

The scraper assumes right now that each URL being passed in is the Root URL (i.e. that it sees "http://google.com" and checks to make sure all links only follow that root URL. So if you pass in "http://google.com/mydocs/" nothing will really work (or not the way you expect it to).

Adding this feature would also allow us to look for something like links for "http://ecode360.org/" on "http://henreitta.org/".

Error check against a bad date format in CreationDate

I am falling through to the general try/except right now when the data isn't formatted correctly. I know this should never happen, but somehow the town of Gates, NY wrote "2013100" into all of their CreationDate fields .......

Follow <embed> tags as <a> tags

Noticed that it may be useful to follow tags the way tags are followed. For tags we follow the href attribute, with tags we follow the src attribute.

This should be included as a possible boolean value passed into the barking owl scraper when initialized.

Scrapers do not pull down next URL correctly.

I'm not sure if this is a scraper issue or a dispatcher issue, but once a scraper completes, it does not get issues the next URL.

Allow for configurable scraper sleep time

To prevent thrashing, have a sleep between downloading URLs. This should be defaulted to zero, but configurable.

scraper appears to loose connection randomly

I think this is a non-wrapped error that is happening that start_consuming() is catching within pika. This happens when launching barkingowl-scraper.py. It may also happen when importing pypi package. Need to wrap all the things within try/except and find out what is going on.

Validate url_data within dispatcher

PR #39 provided more feedback for an invalid url_data payload for the scraper, the same thing needs to be implemented for the dispatcher. This should be abstracted into a utils.py file that everyone can use. I am sure there will be other functions/class that will be put in there as well.

thequbit / barkingowl Goto Github PK

barkingowl's Issues

Recommend Projects

Recommend Topics

Recommend Org