thequbit / barkingowl Goto Github PK
View Code? Open in Web Editor NEWscalable web scraper framework for finding documents on websites.
License: GNU General Public License v3.0
scalable web scraper framework for finding documents on websites.
License: GNU General Public License v3.0
Not looking at the validity of the inputs on the URL a properties that are being passed to the Flask app (barkingowl-controller.py). This needs to be done.
Even though the scraper finishes, the busy flag isn't always being set to false. This may be due to the scraper actually existing due to an error.
Saw an error with Docs.add() where 'latin-1' could not be encoded when pushing to the database ...
It would be nice to see how long websites are taking to scrape. We should be collecting both start and end times for scraping runs.
It would be nice to look at a log file of the scrapers running. Perhaps logstash??
Right now you have to supply a doctype ... it would be nice if widecard (anything) was supported.
Need to handle chars within URLs that are not quoted (not can't just throw it into url.quote() because of http[s]:// can't get quoted).
good to have a section in your readme that's like:
"so, you have no idea what this is. here's step by step instructions of how to run it so you can see what it is"
Right now, within runscapper.py, we aren't looking at any of the user phrases. We need to pull all the phrases for that url and perform the test.
Every element within the BarkingOwl universe requires a unique ID. The default value for each elements init() function is to use str(uuid.uuid4()). It would be nice if we could check to make sure this ID isn't already in use if the default is not used. If anything some kind of debug output so the user knows whats going on.
Set limits on the inputs on the webpage.
It would be nice to be able to delete URLs from the dispatcher database.
Preferably within barkingowl-status.py (although that may change to barkingowl-control.py) as part of the flask app to control all of the scrapers.
Yea, I need to do this so 1. i'm not being a jerk, and 2. I don't have to download as much stuff :D.
Right now the document converter, nor the document processor are part of the message bus. They really don't need to be other than to receive a 'global shutdown' message. It would also be nice to ping them for real-time statistics about their progress.
The document converter and the document processor should be included on the message bus. This should be done by some kind of message bus class to abstract the RabbitMQ (pika) communications.
Allow for certain times of the day to be masked off to prevent scraping. This will be used to not scrap during 'normal business hours'.
Right now exiting happens with a 'raise Exception()' ... it would be nice to review a better way to cleanly exit the thread.
It would be great to record why a link was marked as 'bad' within the _data['bad_urls'] list. This would make each entry in the list a dict of 'url' and 'error' rather than just a string.
There is at least one location, I think two places that _data['bad_urls'] is used. Will need to itterate through with some different code rather than just using "url in bad urls".
It would be nice to return all documents that are not of a specified type, such as all documents that are not html.
BarkingOwl currently labels 'mailto:' links as 'bad links'. Probably a good idea to not include this in the 'bad link' category as it really isn't a bad link, just not followable.
It would be nice to enabled and disable URLs so they don't have to be deleted and then added again.
It would be nice to dispatch work via 0mq rather than launch threads from a single python script. This would make the scraper significantly more scale-able.
It would be great if these files could be run as daemons, so they could be "set-and-forget". This would also allow for example code for others implementing barkingowl to do the same for their systems.
The scraper assumes right now that each URL being passed in is the Root URL (i.e. that it sees "http://google.com" and checks to make sure all links only follow that root URL. So if you pass in "http://google.com/mydocs/" nothing will really work (or not the way you expect it to).
Adding this feature would also allow us to look for something like links for "http://ecode360.org/" on "http://henreitta.org/".
I am falling through to the general try/except right now when the data isn't formatted correctly. I know this should never happen, but somehow the town of Gates, NY wrote "2013100" into all of their CreationDate fields .......
Noticed that it may be useful to follow tags the way tags are followed. For tags we follow the href attribute, with tags we follow the src attribute.
This should be included as a possible boolean value passed into the barking owl scraper when initialized.
I'm not sure if this is a scraper issue or a dispatcher issue, but once a scraper completes, it does not get issues the next URL.
To prevent thrashing, have a sleep between downloading URLs. This should be defaulted to zero, but configurable.
I think this is a non-wrapped error that is happening that start_consuming() is catching within pika. This happens when launching barkingowl-scraper.py. It may also happen when importing pypi package. Need to wrap all the things within try/except and find out what is going on.
PR #39 provided more feedback for an invalid url_data payload for the scraper, the same thing needs to be implemented for the dispatcher. This should be abstracted into a utils.py file that everyone can use. I am sure there will be other functions/class that will be put in there as well.
If the file type comes back as null, we should really try again ... since it should be SOMETHING we can type.
I believe that relative links are not being handles correctly. Additionally, if a URL goes above the root, the URL will end up as a bad URL. If it is above the root, we should just drive it to the root.
If the scraper fails, there really is no feedback to the dispatcher that it failed, thus a URL could go without being scrapped. Perhaps some additional handshaking should be added.
It would be helpful to include the title/description of the page that the document was found on, as some times there is little to no link text and the document name is rather cryptic.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.