bartozzz / crawlerr Goto Github PK

A simple and fully customizable web crawler/spider for Node.js with server-side DOM. Comes with elegant and hell-simple APIs.

Home Page: https://npmjs.com/package/crawlerr

License: MIT License

JavaScript 100.00%

web-crawler crawler spider scraper nodejs jsdom

crawlerr's Introduction

crawlerr

crawlerr is simple, yet powerful web crawler for Node.js, based on Promises. This tool allows you to crawl specific urls only based on wildcards. It uses Bloom filter for caching. A browser-like feeling.

Simple: our crawler is simple to use;
Elegant: provides a verbose, Express-like API;
MIT Licensed: free for personal and commercial use;
Server-side DOM: we use JSDOM to make you feel like in your browser;
Configurable pool size, retries, rate limit and more;

Installation

$ npm install crawlerr

Usage

crawlerr(base [, options])

You can find several examples in the examples/ directory. There are the some of the most important ones:

Example 1: Requesting title from a page

const spider = crawlerr("http://google.com/");

spider.get("/")
  .then(({ req, res, uri }) => console.log(res.document.title))
  .catch(error => console.log(error));

Example 2: Scanning a website for specific links

const spider = crawlerr("http://blog.npmjs.org/");

spider.when("/post/[digit:id]/[all:slug]", ({ req, res, uri }) => {
  const post = req.param("id");
  const slug = req.param("slug").split("?")[0];

  console.log(`Found post with id: ${post} (${slug})`);
});

Example 3: Server side DOM

const spider = crawlerr("http://example.com/");

spider.get("/").then(({ req, res, uri }) => {
  const document = res.document;
  const elementA = document.getElementById("someElement");
  const elementB = document.querySelector(".anotherForm");

  console.log(element.innerHTML);
});

Example 4: Setting cookies

const url = "http://example.com/";
const spider = crawlerr(url);

spider.request.setCookie(spider.request.cookie("foobar=…"), url);
spider.request.setCookie(spider.request.cookie("session=…"), url);

spider.get("/profile").then(({ req, res, uri }) => {
  //… spider.request.getCookieString(url);
  //… spider.request.setCookies(url);
});

API

`crawlerr(base [, options])`

Creates a new Crawlerr instance for a specific website with custom options. All routes will be resolved to base.

Option	Default	Description
`concurrent`	`10`	How many request can be run simultaneously
`interval`	`250`	How often should new request be send (in ms)
…	`null`	See `request` defaults for more informations

public `.get(url)`

Requests url. Returns a Promise which resolves with { req, res, uri }, where:

req is the Request object;
res is the Response object;
uri is the absolute url (resolved from base).

Example:

spider
  .get("/")
  .then(({ res, req, uri }) => …);

public `.when(pattern)`

Searches the entire website for urls which match the specified pattern. pattern can include named wildcards which can be then retrieved in the response via res.param.

Example:

spider
  .when("/users/[digit:userId]/repos/[digit:repoId]", ({ res, req, uri }) => …);

public `.on(event, callback)`

Executes a callback for a given event. For more informations about which events are emitted, refer to queue-promise.

Example:

spider.on("error", …);
spider.on("resolve", …);

public `.start()`/`.stop()`

Starts/stops the crawler.

Example:

spider.start();
spider.stop();

public `.request`

A configured request object which is used by retry-request when crawling webpages. Extends from request.jar(). Can be configured when initializing a new crawler instance through options. See crawler options and request documentation for more informations.

Example:

const url = "https://example.com";
const spider = crawlerr(url);
const request = spider.request;

request.post(`${url}/login`, (err, res, body) => {
  request.setCookie(request.cookie("session=…"), url);
  // Next requests will include this cookie

  spider.get("/profile").then(…);
  spider.get("/settings").then(…);
});

Request

^{Extends the default Node.js incoming message.}

public `get(header)`

Returns the value of a HTTP header. The Referrer header field is special-cased, both Referrer and Referer are interchangeable.

Example:

req.get("Content-Type"); // => "text/plain"
req.get("content-type"); // => "text/plain"

public `is(...types)`

Check if the incoming request contains the "Content-Type" header field, and it contains the give mime type. Based on type-is.

Example:

// Returns true with "Content-Type: text/html; charset=utf-8"
req.is("html");
req.is("text/html");
req.is("text/*");

public `param(name [, default])`

Return the value of param name when present or defaultValue:

checks route placeholders, ex: user/[all:username];
checks body params, ex: id=12, {"id":12};
checks query string params, ex: ?id=12;

Example:

// .when("/users/[all:username]/[digit:someID]")
req.param("username");  // /users/foobar/123456 => foobar
req.param("someID");    // /users/foobar/123456 => 123456

Response

public `jsdom`

Returns the JSDOM object.

public `window`

Returns the DOM window for response content. Based on JSDOM.

public `document`

Returns the DOM document for response content. Based on JSDOM.

Example:

res.document.getElementById(…);
res.document.getElementsByTagName(…);
// …

Tests

npm test

crawlerr's People

Contributors

Stargazers

Watchers

Forkers

chantysothy goody85 dragnucs ishan-marikar alamehor kauandotnet caothetoan

crawlerr's Issues

Action required: Greenkeeper could not be activated 🚨

🚨 You need to enable Continuous Integration on all branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because we are using your CI build statuses to figure out when to notify you about breaking changes.

Since we did not receive a CI status on the greenkeeper/initial branch, we assume that you still need to configure it.

If you have already set up a CI for this repository, you might need to check your configuration. Make sure it will run on all new branches. If you don’t want it to run on every branch, you can whitelist branches starting with greenkeeper/.

We recommend using Travis CI, but Greenkeeper will work with every other CI service as well.

Once you have installed CI on this repository, you’ll need to re-trigger Greenkeeper’s initial Pull Request. To do this, please delete the greenkeeper/initial branch in this repository, and then remove and re-add this repository to the Greenkeeper integration’s white list on Github. You'll find this list on your repo or organiszation’s settings page, under Installed GitHub Apps.

Scanning web sites stop after first match

When trying to scan my website or npmjs blog using the example as is, crawlerr stops finding matches after the first one. Other than that, params are set as undefined. Please notice that crawler does not stop, it still prints success messages and is able to access desired pages, it only can't match them.

'use strict'

const crawler = require('crawlerr')
const spider = crawler("https://touha.me/")

spider
    .when('/post/[all:slug]')
    .then(({ req, res, uri }) => {
        const slug = req.param('slug')
        console.log(`Found post ${slug}`)
    })

spider.on("error", error => {
  console.log(`[Error] ${error}`);
})

spider.on("request", url => {
  console.log(`[Success] ${url}`);
});

spider.start();

Output sample

Found post undefined
[Success] https://touha.me/#
[Success] https://touha.me/
[Success] https://touha.me/about/
[Success] https://touha.me/contact/
[Success] https://touha.me/cv/
[Success] https://touha.me/projets/
[Success] https://touha.me/selfhosting/
[Success] https://touha.me/index.xml
[Success] https://touha.me/page/2/
[Success] https://touha.me/post/meta-federated-social-network.en/
[Success] https://touha.me/tags/federation/
[Success] https://touha.me/tags/telecomunication/
[Success] https://touha.me/post/2-click-social-media-buttons-ou-le-partage-social-ethique-pratique/
[Success] https://touha.me/tags/pratique/
[Success] https://touha.me/tags/vie-priv%C3%A9e/
[Success] https://touha.me/tags/wordpress/
[Success] https://touha.me/post/les-clients-twitter-libre-natifs-gnu-linux/
[Success] https://touha.me/tags/twitter/
[Success] https://touha.me/post/rooter-samsung-galaxy-tab-3-avec-heimdall-sm-t210-sm-t210r/
[Success] https://touha.me/tags/android/
[Success] https://touha.me/tags/heimdall/
[Success] https://touha.me/tags/root/
[Success] https://touha.me/post/tablette-samsung-galaxy-tab-3-nouvelle-experomentation/
[Success] https://touha.me/tags/foss/
[Success] https://touha.me/tags/nonfree/
[Success] https://touha.me/tags/tablette/
[Success] https://touha.me/post/python-utiliser-gtksourceview-avec-fichier-glade/
[Success] https://touha.me/tags/glade/
[Success] https://touha.me/tags/gtk3/

Are all these babel plugins required?

Getting error below if I attempt to run the code from examples/get_title.js in my own file:

module.js:549
    throw err;
    ^

Error: Cannot find module 'babel-runtime/helpers/typeof'
    at Function.Module._resolveFilename (module.js:547:15)
    at Function.Module._load (module.js:474:25)
    at Module.require (module.js:596:17)
    at require (internal/module.js:11:18)
    at Object.<anonymous> (/home/username/crawler-test/node_modules/queue-promise/dist/index.js:7:16)
    at Module._compile (module.js:652:30)
    at Object.Module._extensions..js (module.js:663:10)
    at Module.load (module.js:565:32)
    at tryModuleLoad (module.js:505:12)
    at Function.Module._load (module.js:497:3)

I do not have any of the babel modules queue-promise is trying to install, and none of those are listed as dependencies (aside from dev-dependencies) in either crawlerr or queue-promise package.json. Am I doing something wrong, or was this a build issue you didn't catch?

My test file run with node ./index.js:

"use strict";

const crawler = require("crawlerr");
const spider = crawler("http://google.com/");

spider
  .get("/")
  .then(({ req, res, uri }) => {
    console.log(`Title from ${uri}:`, res.document.title);
  })
  .catch(error => {
    console.log(error);
  });

An in-range update of eslint-plugin-prettier is breaking the build 🚨

The devDependency eslint-plugin-prettier was updated from `3.1.1` to `3.1.2`.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

eslint-plugin-prettier is a devDependency of this project. It might not break your production code or affect downstream projects, but probably breaks your build or test tools, which may prevent deploying or publishing.

Status Details

❌ continuous-integration/travis-ci/push: The Travis CI build could not complete due to an error (Details).

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

An in-range update of queue-promise is breaking the build 🚨

The dependency queue-promise was updated from `1.3.2` to `1.3.3`.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

queue-promise is a direct dependency of this project, and it is very likely causing it to break. If other packages depend on yours, this update is probably also breaking those in turn.

Status Details

❌ continuous-integration/travis-ci/push: The Travis CI build could not complete due to an error (Details).

Release Notes for v1.3.3

Changes:

Update dependencies;
Minify code when bundling;

Commits

The new version differs by 12 commits.

e8dd70e Update .travis.yml
1667cf5 Bump version 1.3.3
337e4db Minify code on build
a68f56f Merge pull request #67 from Bartozzz/greenkeeper/flow-bin-0.113.0
8bbb13d chore(package): update lockfile package-lock.json
066f4e0 chore(package): update flow-bin to version 0.113.0
a94455a Merge pull request #66 from Bartozzz/greenkeeper/flow-bin-0.112.0
27826d0 chore(package): update lockfile package-lock.json
f1c2fe7 chore(package): update flow-bin to version 0.112.0
ae0c6a7 Merge branch 'master' into development
6300fd6 Merge pull request #40 from Bartozzz/development
e753198 Merge pull request #17 from Bartozzz/development

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

bartozzz / crawlerr Goto Github PK

crawlerr's Introduction

crawlerr

Installation

Usage

Example 1: Requesting title from a page

Example 2: Scanning a website for specific links

Example 3: Server side DOM

Example 4: Setting cookies

API

crawlerr(base [, options])

public .get(url)

public .when(pattern)

public .on(event, callback)

public .start()/.stop()

public .request

Request

public get(header)

public is(...types)

public param(name [, default])

Response

public jsdom

public window

public document

Tests

crawlerr's People

Contributors

Stargazers

Watchers

Forkers

crawlerr's Issues

The devDependency eslint-plugin-prettier was updated from 3.1.1 to 3.1.2.

The dependency queue-promise was updated from 1.3.2 to 1.3.3.

Recommend Projects

Recommend Topics

Recommend Org

`crawlerr(base [, options])`

public `.get(url)`

public `.when(pattern)`

public `.on(event, callback)`

public `.start()`/`.stop()`

public `.request`

public `get(header)`

public `is(...types)`

public `param(name [, default])`

public `jsdom`

public `window`

public `document`

The devDependency eslint-plugin-prettier was updated from `3.1.1` to `3.1.2`.

The dependency queue-promise was updated from `1.3.2` to `1.3.3`.