Comments (10)
Hello @Dugnist
Which version of goose-parser are you using?
We've started a process to determine goose, environment and other blocks which can be used separately.
Also goose-parser since version v0.5 was:
- moved to ES6
- removed usage of vow (replaced with native Promises)
- environments was moved outside main repo: goose-phantom-environment, goose-chrome-environment, goose-jsdom-environment
Here is an example of usage since latest version 0.5.0-alpha.3
of goose-parser:
package.json:
{
"dependencies": {
"goose-parser": "^0.5.0-alpha.3",
"goose-phantom-environment": "^1.0.12"
}
}
Usage:
const Parser = require('goose-parser');
const PhantomEnvironment = require('goose-phantom-environment');
const env = new PhantomEnvironment({
url: 'http://www.gooseplanet.ru/',
});
const parser = new Parser({ environment: env });
(async function () {
try {
const results = await parser.parse(
require('./rules/rules'),
);
} catch (e) {
console.log(e.message, e.stack);
}
})();
Also you can consider to user version 0.2.*
of goose, it matches the original documentation. But we're working hard on 0.5
to bring all the amazing features soon, so you can use it as well.
from goose-parser.
Let me know if you have any other issues
from goose-parser.
@maZahaca
I connected goose-jsdom-environment because PhantomEnvironment install was crushed
const Parser = require('goose-parser');
const JsDOMEnvironment = require('goose-jsdom-environment');
const env = new JsDOMEnvironment({
url: 'http://www.google.com',
});
const parser = new Parser({ environment: env });
(async function () {
try {
const results = await parser.parse({
actions: [
{
type: 'wait',
timeout: 10 * 1000,
scope: '.container',
parentScope: 'body'
}
]
});
console.log(results);
} catch (e) {
console.log(e.message);
}
})();
and it throw me this error:
ReferenceError: arguments is not defined
from goose-parser.
I connected goose-jsdom-environment because PhantomEnvironment install was crushed
Could you please provide:
- operation system you use
- error what happened
- specify version (package.json) you've tried
- and code if there anything
Current issue with JSDom is related to the fact that this environment does not support dynamic javascript, so any wait, click or whatever iterations with the page won't work.
For dynamic JS you need to use one of goose-phantom-environment, goose-chrome-environment
from goose-parser.
@maZahaca ok, i change jsdom to goose-chrome-environment.
const Parser = require('goose-parser');
const ChromeEnvironment = require('goose-chrome-environment');
const env = new ChromeEnvironment({
url: 'https://www.google.com',
});
const parser = new Parser({ environment: env });
(async function () {
try {
const results = await parser.parse({
actions: [
{
type: 'wait',
timeout: 10 * 1000,
scope: '#ctr-p',
parentScope: 'body'
}
]
});
console.log(results);
} catch (e) {
console.log(e.message);
}
})();
It also throw errors:
tion id: 959): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 960): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 961): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 962): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 963): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 964): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 965): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 966): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 967): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 968): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 969): ReferenceError: arguments is not defined
Timeout for wait with arguments: body #ctr-p
from goose-parser.
@maZahaca also I change url address to 'https://habrahabr.ru' and I catch this error:
(node:7544) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): TypeError: msg.match is not a function
(node:7544) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
from goose-parser.
@Dugnist please provide your package.json and OS you're operating on, I will try to reproduce these issues.
It's wired bugs, cause we use this parsers in production for now
from goose-parser.
Also, let's stick to one website when testing, and try it out.
Please tell me what example data you want to scrape from this website
from goose-parser.
@maZahaca i'm using linux ubuntu 16.04 LTS
{
"dependencies": {
"goose-chrome-environment": "^1.0.2",
"goose-parser": "^0.5.0-alpha.3",
"phantomjs-prebuilt": "^2.1.16"
},
"devDependencies": {}
}
I want to get all html page with executed javascript (if target site use framework like React.js) and save result to html file and required assets.
from goose-parser.
@Dugnist This parsing tool (goose-parser) allows you to save only JSON results, not HTML and assets.
However we're planning to add ability to save assets and the whole HTML in the future.
Here is an example of using goose-parser+goose-chrome-environment to fetch json
results:
const Parser = require('goose-parser');
const ChromeEnvironment = require('goose-chrome-environment');
const env = new ChromeEnvironment({
url: 'https://www.google.com/search?newwindow=1&ei=mzDCWoPkOI-RmwWaoLzYCg&q=goose-parser&oq=goose-parser&gs_l=psy-ab.3..0i30k1.1186908.1189012.0.1189621.12.12.0.0.0.0.154.877.9j2.11.0....0...1c.1.64.psy-ab..1.11.876...0j0i131k1j0i131i67k1j0i67k1j0i10k1j0i19k1j0i30i19k1j0i10i30i19k1j0i13i30k1j0i8i30k1.0.lU1cumFem2s&gws_rd=cr&dcr=0&fg=1',
});
const parser = new Parser({ environment: env });
(async function () {
try {
const results = await parser.parse({
actions: [
{
type: 'wait',
timeout: 10 * 1000,
scope: '.srg>.g',
parentScope: 'body'
}
],
rules: {
scope: '.srg>.g',
collection: [[
{
name: 'url',
scope: 'h3.r>a',
attr: 'href',
},
{
name: 'text',
scope: 'h3.r>a',
}
]]
}
});
console.log(results);
} catch (e) {
console.log(e.message);
}
})();
And results will be:
[
{
url: 'https://www.npmjs.com/package/goose-parser',
text: 'goose-parser - npm'
},
{
url: 'https://github.com/advancedlogic/GoOse/blob/master/parser.go',
text: 'GoOse/parser.go at master · advancedlogic/GoOse · GitHub'
},
{
url: 'https://habrahabr.ru/post/271425/',
text: 'Как парсить интернет по-гусиному / Хабрахабр'
},
{
url: 'https://pypi.python.org/pypi/goose-extractor/',
text: 'goose-extractor 1.0.25 : Python Package Index'
},
{
url: 'https://toster.ru/q/337511',
text: 'Как добавлять комментарии в Instagram без api? — Toster.ru'
},
{
url: 'https://www.youtube.com/watch?v=BEbAhwyQeOM',
text: 'Continued Work on Goose\'s Parser - YouTube'
},
{
url: 'https://godoc.org/github.com/advancedlogic/GoOse',
text: 'goose - GoDoc'
},
{
url: 'http://blog.reddikh.com/goose-parser/',
text: 'Goose parser |'
},
{
url: 'https://www.kth.se/social/upload/538599b1f27654141f4cc333/Master',
text: 'Development of a library to generate and parse IEC 61850-90-5 ... - KTH'
},
{
url: 'http://nullege.com/codes/search/goose.parsers.Parser',
text: 'goose.parsers.Parser - Nullege Python Samples'
}
]
from goose-parser.
Related Issues (20)
- Ability to parse scope attributes
- Ability to parse or set row identifier depends on row scope
- Ability to set custom function to check row for necessity to parse
- Ability to parse deep row information from sub-page of item HOT 1
- Add excludes for loading resources on the page HOT 2
- On any error we should provide parsed results HOT 2
- Implement captcha handler HOT 3
- Ability to set proxies list HOT 1
- Ability to parser URL in simple rule HOT 1
- Ability to use in one scope results of another one located on the same level HOT 3
- Return new parsing scope from action HOT 1
- Continue parsing after failure HOT 1
- Rename "type" to "dataType" HOT 1
- Remove useActionsResult HOT 2
- Update tests according to the changes in 0.5.0 and add circle-ci HOT 3
- Prepare docs for version 0.5.0
- Issue HOT 2
- Release 0.5.0 HOT 1
- Prepare docker files to have an ability to use goose-parser within multiple envs HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from goose-parser.