Giter Club home page Giter Club logo

barkingowl's Introduction

BarkingOwl

Join the chat at https://gitter.im/thequbit/BarkingOwl

BarkingOwl is a scalable web crawler intended to be used to find specific document types such as PDFs.

Not a hard-core hacker? Check out the web front-end tool for barkingowl here

####Background and Description####

Barking Owl came out of the need presented at a Hacks and Hackers Rochester (#hhroc) meet-up in Syracuse, NY. A journalist expressed his need for a tool that would assist him in looking for key words within PDFs posted to town websites, such as meeting minutes.

####Objective####

I wanted to make the code for this project as reusable as possible as I knew it had several parallels to other work I had been doing and wanted to do in the future. The solution was a architecture that would allow for significant scalability and extensibility.

####How to get started####

BarkingOwl is on the pypi network, thus it can be installed using pip:

> pip install barkingowl

To use BarkingOwl you will need to install RabbitMQ. Information on how to install RabbitMQ can be found here: http://www.rabbitmq.com/download.html

####Documentation####

Check out the wiki!

barkingowl's People

Contributors

citruspi avatar gitter-badger avatar msabramo avatar ralphbean avatar thequbit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

barkingowl's Issues

scraper appears to loose connection randomly

I think this is a non-wrapped error that is happening that start_consuming() is catching within pika. This happens when launching barkingowl-scraper.py. It may also happen when importing pypi package. Need to wrap all the things within try/except and find out what is going on.

Include what the error was with the url within the bad_urls list

It would be great to record why a link was marked as 'bad' within the _data['bad_urls'] list. This would make each entry in the list a dict of 'url' and 'error' rather than just a string.

There is at least one location, I think two places that _data['bad_urls'] is used. Will need to itterate through with some different code rather than just using "url in bad urls".

Include title of page document was found on

It would be helpful to include the title/description of the page that the document was found on, as some times there is little to no link text and the document name is rather cryptic.

Phrase detect

Right now, within runscapper.py, we aren't looking at any of the user phrases. We need to pull all the phrases for that url and perform the test.

Ad exception cause for "mailto:" URL

BarkingOwl currently labels 'mailto:' links as 'bad links'. Probably a good idea to not include this in the 'bad link' category as it really isn't a bad link, just not followable.

doctext encoding issue

Saw an error with Docs.add() where 'latin-1' could not be encoded when pushing to the database ...

Follow <embed> tags as <a> tags

Noticed that it may be useful to follow tags the way tags are followed. For tags we follow the href attribute, with tags we follow the src attribute.

This should be included as a possible boolean value passed into the barking owl scraper when initialized.

The document converter, nor the document processor are apart of the message bus.

Right now the document converter, nor the document processor are part of the message bus. They really don't need to be other than to receive a 'global shutdown' message. It would also be nice to ping them for real-time statistics about their progress.

The document converter and the document processor should be included on the message bus. This should be done by some kind of message bus class to abstract the RabbitMQ (pika) communications.

Broadcast to check if UUID is already in use

Every element within the BarkingOwl universe requires a unique ID. The default value for each elements init() function is to use str(uuid.uuid4()). It would be nice if we could check to make sure this ID isn't already in use if the default is not used. If anything some kind of debug output so the user knows whats going on.

Error check against a bad date format in CreationDate

I am falling through to the general try/except right now when the data isn't formatted correctly. I know this should never happen, but somehow the town of Gates, NY wrote "2013100" into all of their CreationDate fields .......

Handle relative links better

I believe that relative links are not being handles correctly. Additionally, if a URL goes above the root, the URL will end up as a bad URL. If it is above the root, we should just drive it to the root.

Create a scraper control tool

Preferably within barkingowl-status.py (although that may change to barkingowl-control.py) as part of the flask app to control all of the scrapers.

Validate url_data within dispatcher

PR #39 provided more feedback for an invalid url_data payload for the scraper, the same thing needs to be implemented for the dispatcher. This should be abstracted into a utils.py file that everyone can use. I am sure there will be other functions/class that will be put in there as well.

Handle spaces in URL names

Need to handle chars within URLs that are not quoted (not can't just throw it into url.quote() because of http[s]:// can't get quoted).

add end time to scraps table`

It would be nice to see how long websites are taking to scrape. We should be collecting both start and end times for scraping runs.

Add hacking instructions

good to have a section in your readme that's like:

"so, you have no idea what this is. here's step by step instructions of how to run it so you can see what it is"

Allow for time masking

Allow for certain times of the day to be masked off to prevent scraping. This will be used to not scrap during 'normal business hours'.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.