redco / goose-parser Goto Github PK
View Code? Open in Web Editor NEWUniversal scraping tool, which allows you to extract data using multiple environments
Home Page: https://andrew.red/posts/goose-parser-the-beginning
License: MIT License
Universal scraping tool, which allows you to extract data using multiple environments
Home Page: https://andrew.red/posts/goose-parser-the-beginning
License: MIT License
PhantomEnv should allow to set proxies list
and knows moment when to switch it between each others.
Remember which proxies was used and last time and url of using.
Probably, also set strategy of switching:
Now when we have plenty of docs, it's time to add a table of content in the head of Readme.md
Ability to parser URL in simple rule
Add selenium environment to simplify testing process
Before parse we need to check if that row required to be parsed.
For example we have list of ids previously parser, compare if current row _id is in the list - continue without parsing.
!! We need to parse _id
first before parsing all others.
For now on PhantomError we get only error, but this could happen after parsing several rows.
Add to docs:
Add ability to paginate via simple pages
That kind is very close to scroll pagination, but instead of scrolling you need to click on the block to load new page [extend current page list].
Need to think about a way to build custom pagination event, which allows to cover any case of pagination.
A project with configured build for browser env and working basic usage example. Users will be able to clone this repo and start hacking!
Sometimes we need to click or do some custom actions before parse a row from grid.
Example to specify event:
{event: 'parse.pre', scope: 'div.expl-open-ticket-button', action: 'click'}
For now on PhantomError we get only error, but this could happen after parsing several rows.
We need to provide results together with error.
Needed:
Example:
We have parsed data:
We need to get:
It should:
Implement:
Add ability to paginate via ajax pages
Add docker files to use goose-parser from envs:
Add and configure coveralls.io
Ability to parse scope attributes. For example row deep link.
Add ability to add custom pagination method
Add tests
Ability to parse or set row identifier depends on row scope.
By this identifier we can say that this row is unique among others.
Needed:
Example:
We have parsed data:
We need to get:
Return new parsing scope from action
import {
PhantomEnvironment,
Parser
} from 'goose-parser';
const env = new PhantomEnvironment({
url: 'http://www.gooseplanet.ru/'
});
TypeError: _gooseParser.PhantomEnvironment is not a constructor
I look at the imported entities and both of them is undefined.
If I write:
import Parser from 'goose-parser'
It return [Function: Parser]
But where I can find PhantomEnvironment???
Sometime we need to specify data type in the parsed result.
For now this is a "type" field. But also we use "type" keyword as determine type of actions and transforms . So we need to rename "type" to "dataType" in cases when it has data type meaning.
For example we need to do two actions:
Prepare basic documentation about
Add documentation
Add SlimerEnvironment
https://slimerjs.org
When page load pagination happens, application getting errors about Sizzle (because promise trying to check data with it).
We can extend rules and actions from any defined one in the system. And have an ability to override some particular properties.
This will allow to maintain a big amount of similar rules in easy way.
Needed:
Add excludes for loading resources on the page - attach to event which allows to cancel loading.
Needed:
Add ability to execute custom actions before start parsing.
For example we need to
This functionality will completely replace actions with once
flag
We have stores now and can set data in one action and get it inside another, useActionsResult
is used to get actions result from previous actions, it's rudiment and should be removed
Move current test system to new efficient way (Take a look on new tests here #76)
So, we have:
old tests here: tests/phantom_parser_test.js
new tests here: tests/phantom/lib/
To run tests, just call npm test
When move some test, remove it in tests/phantom_parser_test.js
At the end remove a file tests/phantom_parser_test.js and html page for it.
Ability to go deep to row information page to parse it inside.
For example, we have list of rows, but each of it has information page, where we should parse the row.
Add to docs:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.