Comments (5)
Accept pr.
from gain.
@gaojiuli,I see we cannot install this project from pypi because that version is flawed. Can I expect this project to accept PR's or should I continue with my own fork?
from gain.
For people reading this message, here is my fork of the project
Changes
- Bug fixes
- Improved default url extraction
- Proxy can now be a generator so every fetch can now yield a proxy
- Implemented aiocache, with Redis as default (but you can choose) so you can cache every single requests and define which urls shouldn't be cached with
cache_disabled_urls
in case you are daily scraping a feed or blog and you are only interested in the new urls. Just setcache_enabled=True
and run the docker instance that I have mentioned in the README.md file. - There is now a spider
test
flag so you can test a single parsed page to improve your css or figure out issues with your code. Together with the cache feature this makes a very quick and reliable trial and error cycle. Increase the amount of tests to be run with themax_requests
- With the
limit_requests
flag you can adjust the max_requests value to actually limit the maximum amount of external requests. - Replaced Pyquery code with the lxml library and made it really easy to write extract data with css code. It's written with backwards compatibility in mind but if you use jquery you can still use the pyquery code by changing your Css class in Item to the Pyq class.
Extraction
With the new Css class the following can be done:
- Get the dict of an element
- Extract all elements into a list
- HTML tables to dict
- Control the index
- Iterate throught texts
- Request text and text content
In case you need to cleanup data or extract stuff like a phone number or email address, the following manipulate options are available:
- clean_string; Will become clean in the future, so it cleans list, dicts and strings to give much more control
- extract_email
- extract_phone
- extract_website
- to_date_iso
The order in which you supply manipulate options, is the order of execution so you can actually combine these manipulations.
And many more will be added in the future. I have written tests for all features so take a look at this file if you are interested.
With the current version on my dev branch you can:
- First focus on parsing pages with the right xpath or regex (go for this one, since it's very reliable in my version)
- Set cache_enabled to true and go through all pages to cache them
- Use the test feature to start writing your extraction code
- Once done, re-run your code on the Redis cache and push it to any datasource like csv, postgres, etc
I hope @gaojiuli , is interested in the way I moved forward with this project so we can merge our code once I am satisfied with a production version. I kept the philosophy of creating a scraper for everyone and with that in mind I changed the way we extract data.
from gain.
@gaojiuli, great news, happy that I can share my code.
Before the PR, the following I have to do:
- Update the clean code
- Ensure users can get back there url
- Go through all the code once more to make sure we can publish to Pypi
For the item.py I have a question regarding this code:
if hasattr(self, 'save_url'):
self.url = getattr(self, 'save_url')
else:
self.url = "file:///tmp.data"
Are you using this code or this junk code that can be removed?
from gain.
- Just pull request.
- I will review your code.
- I will realize a new version after I merge your pull request.
(Welcome any kind of optimization)
from gain.
Related Issues (20)
- Test failed on Windows HOT 1
- Please do a decent code review before accepting pull requests HOT 1
- Css selector add attr not work correctly HOT 1
- TypeError: write() argument must be str, not dict HOT 1
- Limit the interval between two requests. HOT 7
- Add PhantomJS support. HOT 1
- add encoding HOT 1
- Add document's own parsing HOT 1
- add cssParser
- Repeated bug
- aiofiles BUG HOT 1
- The ``sciencenet_spider.py`` example does not (seem to) work for python 3.6 HOT 5
- Does it work on OSX? HOT 1
- Add hooks before download and after download. HOT 3
- What does this statement mean? HOT 1
- SSL handshake failed on verifying the certificate HOT 1
- bug
- demo error
- The project is dead
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gain.