Giter Club home page Giter Club logo

Comments (10)

maZahaca avatar maZahaca commented on June 8, 2024 1

Hello @Dugnist

Which version of goose-parser are you using?

We've started a process to determine goose, environment and other blocks which can be used separately.
Also goose-parser since version v0.5 was:

Here is an example of usage since latest version 0.5.0-alpha.3 of goose-parser:
package.json:

{
  "dependencies": {
    "goose-parser": "^0.5.0-alpha.3",
    "goose-phantom-environment": "^1.0.12"
  }
}

Usage:

const Parser = require('goose-parser');
const PhantomEnvironment = require('goose-phantom-environment');

const env = new PhantomEnvironment({
  url: 'http://www.gooseplanet.ru/',
});

const parser = new Parser({ environment: env });

(async function () {
  try {
    const results = await parser.parse(
      require('./rules/rules'),
    );
  } catch (e) {
    console.log(e.message, e.stack);
  }
})();

Also you can consider to user version 0.2.* of goose, it matches the original documentation. But we're working hard on 0.5 to bring all the amazing features soon, so you can use it as well.

from goose-parser.

maZahaca avatar maZahaca commented on June 8, 2024

Let me know if you have any other issues

from goose-parser.

Dugnist avatar Dugnist commented on June 8, 2024

@maZahaca
I connected goose-jsdom-environment because PhantomEnvironment install was crushed

const Parser = require('goose-parser');
const JsDOMEnvironment = require('goose-jsdom-environment');

const env = new JsDOMEnvironment({
  url: 'http://www.google.com',
});

const parser = new Parser({ environment: env });

(async function () {
  try {
    const results = await parser.parse({
      actions: [
        {
            type: 'wait',
            timeout: 10 * 1000,
            scope: '.container',
            parentScope: 'body'
        }
      ]
    });
    console.log(results);
  } catch (e) {
    console.log(e.message);
  }
})();

and it throw me this error:

ReferenceError: arguments is not defined

from goose-parser.

maZahaca avatar maZahaca commented on June 8, 2024

I connected goose-jsdom-environment because PhantomEnvironment install was crushed
Could you please provide:

  • operation system you use
  • error what happened
  • specify version (package.json) you've tried
  • and code if there anything

Current issue with JSDom is related to the fact that this environment does not support dynamic javascript, so any wait, click or whatever iterations with the page won't work.
For dynamic JS you need to use one of goose-phantom-environment, goose-chrome-environment

from goose-parser.

Dugnist avatar Dugnist commented on June 8, 2024

@maZahaca ok, i change jsdom to goose-chrome-environment.

const Parser = require('goose-parser');
const ChromeEnvironment = require('goose-chrome-environment');

const env = new ChromeEnvironment({
  url: 'https://www.google.com',
});

const parser = new Parser({ environment: env });

(async function () {
  try {
    const results = await parser.parse({
      actions: [
        {
            type: 'wait',
            timeout: 10 * 1000,
            scope: '#ctr-p',
            parentScope: 'body'
        }
      ]
    });
    console.log(results);
  } catch (e) {
    console.log(e.message);
  }
})();

It also throw errors:

tion id: 959): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 960): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 961): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 962): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 963): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 964): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 965): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 966): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 967): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 968): ReferenceError: arguments is not defined
(node:5371) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 969): ReferenceError: arguments is not defined
Timeout for wait with arguments: body #ctr-p

from goose-parser.

Dugnist avatar Dugnist commented on June 8, 2024

@maZahaca also I change url address to 'https://habrahabr.ru' and I catch this error:

(node:7544) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1): TypeError: msg.match is not a function
(node:7544) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

from goose-parser.

maZahaca avatar maZahaca commented on June 8, 2024

@Dugnist please provide your package.json and OS you're operating on, I will try to reproduce these issues.

It's wired bugs, cause we use this parsers in production for now

from goose-parser.

maZahaca avatar maZahaca commented on June 8, 2024

Also, let's stick to one website when testing, and try it out.
Please tell me what example data you want to scrape from this website

from goose-parser.

Dugnist avatar Dugnist commented on June 8, 2024

@maZahaca i'm using linux ubuntu 16.04 LTS

{
  "dependencies": {
    "goose-chrome-environment": "^1.0.2",
    "goose-parser": "^0.5.0-alpha.3",
    "phantomjs-prebuilt": "^2.1.16"
  },
  "devDependencies": {}
}

I want to get all html page with executed javascript (if target site use framework like React.js) and save result to html file and required assets.

from goose-parser.

maZahaca avatar maZahaca commented on June 8, 2024

@Dugnist This parsing tool (goose-parser) allows you to save only JSON results, not HTML and assets.
However we're planning to add ability to save assets and the whole HTML in the future.

Here is an example of using goose-parser+goose-chrome-environment to fetch json results:

const Parser = require('goose-parser');
const ChromeEnvironment = require('goose-chrome-environment');

const env = new ChromeEnvironment({
  url: 'https://www.google.com/search?newwindow=1&ei=mzDCWoPkOI-RmwWaoLzYCg&q=goose-parser&oq=goose-parser&gs_l=psy-ab.3..0i30k1.1186908.1189012.0.1189621.12.12.0.0.0.0.154.877.9j2.11.0....0...1c.1.64.psy-ab..1.11.876...0j0i131k1j0i131i67k1j0i67k1j0i10k1j0i19k1j0i30i19k1j0i10i30i19k1j0i13i30k1j0i8i30k1.0.lU1cumFem2s&gws_rd=cr&dcr=0&fg=1',
});

const parser = new Parser({ environment: env });

(async function () {
  try {
    const results = await parser.parse({
      actions: [
        {
          type: 'wait',
          timeout: 10 * 1000,
          scope: '.srg>.g',
          parentScope: 'body'
        }
      ],
      rules: {
        scope: '.srg>.g',
        collection: [[
          {
            name: 'url',
            scope: 'h3.r>a',
            attr: 'href',
          },
          {
            name: 'text',
            scope: 'h3.r>a',
          }
        ]]
      }
    });
    console.log(results);
  } catch (e) {
    console.log(e.message);
  }
})();

And results will be:

[
  {
    url: 'https://www.npmjs.com/package/goose-parser',
    text: 'goose-parser - npm'
  },
  {
    url: 'https://github.com/advancedlogic/GoOse/blob/master/parser.go',
    text: 'GoOse/parser.go at master · advancedlogic/GoOse · GitHub'
  },
  {
    url: 'https://habrahabr.ru/post/271425/',
    text: 'Как парсить интернет по-гусиному / Хабрахабр'
  },
  {
    url: 'https://pypi.python.org/pypi/goose-extractor/',
    text: 'goose-extractor 1.0.25 : Python Package Index'
  },
  {
    url: 'https://toster.ru/q/337511',
    text: 'Как добавлять комментарии в Instagram без api? — Toster.ru'
  },
  {
    url: 'https://www.youtube.com/watch?v=BEbAhwyQeOM',
    text: 'Continued Work on Goose\'s Parser - YouTube'
  },
  {
    url: 'https://godoc.org/github.com/advancedlogic/GoOse',
    text: 'goose - GoDoc'
  },
  {
    url: 'http://blog.reddikh.com/goose-parser/',
    text: 'Goose parser |'
  },
  {
    url: 'https://www.kth.se/social/upload/538599b1f27654141f4cc333/Master',
    text: 'Development of a library to generate and parse IEC 61850-90-5 ... - KTH'
  },
  {
    url: 'http://nullege.com/codes/search/goose.parsers.Parser',
    text: 'goose.parsers.Parser - Nullege Python Samples'
  }
]

from goose-parser.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.