ionicabizau / scrape-it Goto Github PK

View Code? Open in Web Editor NEW

4.0K 64.0 220.0 369 KB

🔮 A Node.js scraper for humans.

Home Page: http://ionicabizau.net/blog/30-how-to-write-a-web-scraper-in-node-js

License: MIT License

JavaScript 82.37% HTML 11.27% Shell 4.09% Dockerfile 2.27%

scraper node-scraper hacktoberfest

scrape-it's Introduction

scrape-it

A Node.js scraper for humans.

Sponsored with ❤️ by:

Serpapi.com is a platform that allows you to scrape Google and other search engines from our fast, easy, and complete API.

Capsolver.com is an AI-powered service that specializes in solving various types of captchas automatically. It supports captchas such as reCAPTCHA V2, reCAPTCHA V3, hCaptcha, FunCaptcha, DataDome, AWS Captcha, Geetest, and Cloudflare Captcha / Challenge 5s, Imperva / Incapsula, among others. For developers, Capsolver offers API integration options detailed in their documentation, facilitating the integration of captcha solving into applications. They also provide browser extensions for Chrome and Firefox, making it easy to use their service directly within a browser. Different pricing packages are available to accommodate varying needs, ensuring flexibility for users.

☁️ Installation

# Using npm
npm install --save scrape-it

# Using yarn
yarn add scrape-it

💡 ProTip: You can install the cli version of this module by running npm install --global scrape-it-cli (or yarn global add scrape-it-cli).

FAQ

Here are some frequent questions and their answers.

1. How to parse scrape pages?

scrape-it has only a simple request module for making requests. That means you cannot directly parse ajax pages with it, but in general you will have those scenarios:

The ajax response is in JSON format. In this case, you can make the request directly, without needing a scraping library.
The ajax response gives you HTML back. Instead of calling the main website (e.g. example.com), pass to scrape-it the ajax url (e.g. example.com/api/that-endpoint) and you will you will be able to parse the response
The ajax request is so complicated that you don't want to reverse-engineer it. In this case, use a headless browser (e.g. Google Chrome, Electron, PhantomJS) to load the content and then use the .scrapeHTML method from scrape it once you get the HTML loaded on the page.

2. Crawling

There is no fancy way to crawl pages with scrape-it. For simple scenarios, you can parse the list of urls from the initial page and then, using Promises, parse each page. Also, you can use a different crawler to download the website and then use the .scrapeHTML method to scrape the local files.

3. Local files

Use the .scrapeHTML to parse the HTML read from the local files using fs.readFile.

📋 Example

const scrapeIt = require("scrape-it")

// Promise interface
scrapeIt("https://ionicabizau.net", {
    title: ".header h1"
  , desc: ".header h2"
  , avatar: {
        selector: ".header img"
      , attr: "src"
    }
}).then(({ data, status }) => {
    console.log(`Status Code: ${status}`)
    console.log(data)
});


// Async-Await
(async () => {
    const { data } = await scrapeIt("https://ionicabizau.net", {
        // Fetch the articles
        articles: {
            listItem: ".article"
          , data: {

                // Get the article date and convert it into a Date object
                createdAt: {
                    selector: ".date"
                  , convert: x => new Date(x)
                }

                // Get the title
              , title: "a.article-title"

                // Nested list
              , tags: {
                    listItem: ".tags > span"
                }

                // Get the content
              , content: {
                    selector: ".article-content"
                  , how: "html"
                }

                // Get attribute value of root listItem by omitting the selector
              , classes: {
                    attr: "class"
                }
            }
        }

        // Fetch the blog pages
      , pages: {
            listItem: "li.page"
          , name: "pages"
          , data: {
                title: "a"
              , url: {
                    selector: "a"
                  , attr: "href"
                }
            }
        }

        // Fetch some other data from the page
      , title: ".header h1"
      , desc: ".header h2"
      , avatar: {
            selector: ".header img"
          , attr: "src"
        }
    })
    console.log(data)
    // { articles:
    //    [ { createdAt: Mon Mar 14 2016 00:00:00 GMT+0200 (EET),
    //        title: 'Pi Day, Raspberry Pi and Command Line',
    //        tags: [Object],
    //        content: '<p>Everyone knows (or should know)...a" alt=""></p>\n',
    //        classes: [Object] },
    //      { createdAt: Thu Feb 18 2016 00:00:00 GMT+0200 (EET),
    //        title: 'How I ported Memory Blocks to modern web',
    //        tags: [Object],
    //        content: '<p>Playing computer games is a lot of fun. ...',
    //        classes: [Object] },
    //      { createdAt: Mon Nov 02 2015 00:00:00 GMT+0200 (EET),
    //        title: 'How to convert JSON to Markdown using json2md',
    //        tags: [Object],
    //        content: '<p>I love and ...',
    //        classes: [Object] } ],
    //   pages:
    //    [ { title: 'Blog', url: '/' },
    //      { title: 'About', url: '/about' },
    //      { title: 'FAQ', url: '/faq' },
    //      { title: 'Training', url: '/training' },
    //      { title: 'Contact', url: '/contact' } ],
    //   title: 'Ionică Bizău',
    //   desc: 'Web Developer,  Linux geek and  Musician',
    //   avatar: '/images/logo.png' }
})()

❓ Get Help

There are few ways to get help:

Please post questions on Stack Overflow. You can open issues with questions, as long you add a link to your Stack Overflow question.
For bug reports and feature requests, open issues. 🐛
For direct and quick help, you can use Codementor. 🚀

📝 Documentation

`scrapeIt(url, opts, cb)`

A scraping module for humans.

Params

String|Object url: The page url or request options.
Object opts: The options passed to scrapeHTML method.
Function cb: The callback function.

Return

Promise A promise object resolving with:
- data (Object): The scraped data.
- $ (Function): The Cheeerio function. This may be handy to do some other manipulation on the DOM, if needed.
- response (Object): The response object.
- body (String): The raw body as a string.

`scrapeIt.scrapeHTML($, opts)`

Scrapes the data in the provided element.

For the format of the selector, please refer to the Selectors section of the Cheerio library

Params

Cheerio $: The input element.
Object opts: An object containing the scraping information. If you want to scrape a list, you have to use the listItem selector:
- listItem (String): The list item selector.
- data (Object): The fields to include in the list objects:
  - <fieldName> (Object|String): The selector or an object containing:
    - selector (String): The selector.
    - convert (Function): An optional function to change the value.
    - how (Function|String): A function or function name to access the value.
    - attr (String): If provided, the value will be taken based on the attribute name.
    - trim (Boolean): If false, the value will not be trimmed (default: true).
    - closest (String): If provided, returns the first ancestor of the given element.
    - eq (Number): If provided, it will select the nth element.
    - texteq (Number): If provided, it will select the nth direct text child. Deep text child selection is not possible yet. Overwrites the how key.
    - listItem (Object): An object, keeping the recursive schema of the listItem object. This can be used to create nested lists.
Example:
```
{
   articles: {
       listItem: ".article"
     , data: {
           createdAt: {
               selector: ".date"
             , convert: x => new Date(x)
           }
         , title: "a.article-title"
         , tags: {
               listItem: ".tags > span"
           }
         , content: {
               selector: ".article-content"
             , how: "html"
           }
         , traverseOtherNode: {
               selector: ".upperNode"
             , closest: "div"
             , convert: x => x.length
           }
       }
   }
}
```
If you want to collect specific data from the page, just use the same schema used for the data field.

Example:
```
{
     title: ".header h1"
   , desc: ".header h2"
   , avatar: {
         selector: ".header img"
       , attr: "src"
     }
}
```

Return

Object The scraped data.

😋 How to contribute

Have an idea? Found a bug? See how to contribute.

💖 Support my projects

I open-source almost everything I can, and I try to reply to everyone needing help using these projects. Obviously, this takes time. You can integrate and use these projects in your applications for free! You can even change the source code and redistribute (even resell it).

However, if you get some profit from this or just want to encourage me to continue creating stuff, there are few ways you can do it:

Starring and sharing the projects you like 🚀
—I love books! I will remember you after years if you buy me one. 😁 📖
—You can make one-time donations via PayPal. I'll probably buy a ~~coffee~~ tea. 🍵
—Set up a recurring monthly donation and you will get interesting news about what I'm doing (things that I don't share with everyone).
Bitcoin—You can send me bitcoins at this address (or scanning the code below): 1P9BRsmazNQcuyTxEqveUsnf5CERdq35V6

Thanks! ❤️

💫 Where is this library used?

If you are using this library in one of your projects, add it in this list. ✨

3abn
@ben-wormald/bandcamp-scraper
@bogochunas/package-shopify-crawler
@lukekarrys/ebp
@thetrg/gibson
@tryghost/mg-webscraper
@web-master/node-web-scraper
airport-cluj
apixpress
bandcamp-scraper
beervana-scraper
bible-scraper
blankningsregistret
blockchain-notifier
brave-search-scraper
camaleon
carirs
cevo-lookup
cnn-market
codementor
codinglove-scraper
covidau
degusta-scrapper
dncli
egg-crawler
fa.js
flamescraper
fmgo-marketdata
gatsby-source-bandcamp
growapi
helyesiras
jishon
jobs-fetcher
leximaven
macoolka-net-scrape
macoolka-network
mersul-microbuzelor
mersul-trenurilor
mit-ocw-scraper
mix-dl
node-red-contrib-getdata-website
node-red-contrib-scrape-it
nurlresolver
paklek-cli
parn
picarto-lib
rayko-tools
rs-api
sahibinden
sahibindenServer
salesforcerelease-parser
scrape-it-cli
scrape-vinmonopolet
scrapos-worker
sgdq-collector
simple-ai-alpha
spon-market
startpage-quick-search
steam-workshop-scraper
trump-cabinet-picks
u-pull-it-ne-parts-finder
ubersetzung
ui-studentsearch
university-news-notifier
uniwue-lernplaetze-scraper
vandalen.rhyme.js
wikitools
yu-ncov-scrape-dxy

📜 License

MIT © Ionică Bizău

scrape-it's People

Stargazers

Watchers

Forkers

antixrist pragyanatvade spencerx bossjones sanshiro ykaaouachi kowsheek andreiashu jonnydubowsky okusnadi rlugojr adamli denielaa tanitall opawtech owenb132 raywint maskedhero23 myarete hiroyuki-konno hackerethic dkuspawono thewebsitenursery bennyh26 kachilero wonism ezhangle chrisjenglish khoama jopaterson dahal sam17 abhishekdgeek nkhuyu joujiahe coddent kannans a0viedo zuloloxi chantysothy ivolivares brunocascio panyuntao cuulee darrencoding hmphu formatted mohamdio orangecig sbepstein mtayyabhanif tlevesque bright-sparks sahilsarpal15 nata-jimi-shekar achenxu thelonglady r3t51w maniacs-js noscripter surfing-chef pallabganguly lamp-post hjc1983 vmrcommunications iamtekeste gauthiernpn bharathnr1 airanode ashokbasnet shohan494 mbelal javascript-forks w3aran haidermalik12 farskid mahmood tangten codelinkio semtle fanabarkah ceevicee miguelramosfdz nullnotfound toantran-ea robsonmobile rahulyhg rockwell421 psieniarski coltaemanuela timothydang script2learn cnojima comfreek frizadiga jmarroyave ascired pcsusantha rao123dk socialskyinc

scrape-it's Issues

How to scrape ajax pages

Hi,
Some pages use ajax to get real contents, so scrape-it only scrape a blank page...

Attributes option does not work with 'data' attributes

Is there a way to pull data attributes such as data-url?

I've tried the following (example):

{
     title: ".header h1"
   , desc: ".header h2"
   , avatar: {
         selector: ".header img"
       , attr: "data-url"
     }
}

It appears that cheerio has a separate method for handling data attributes:
$elm.data('url')

If this is needed, I can create a PR.

Single page applications

Is it anything special I need to so to wait for dynamic elements to be ready? Want to use it for a site that uses angular and everything is pulled in via ajax.

Lambda?

How difficult would it be to use it in AWS Lambda?

Charset

Has some property to setup charset of the request?

[Question] list item content

if I have a simple list of items

<p> text 1 </p>
<p> text 2 </p>
<p> text 3 </p>

how do I extract the content?

//Probably wrong
{
  listItem: "p",
  name: "paragraphs",
  data: {
    text: "*"
  }
}

Cheerio issue with Webpack

I get the following issue when i try to run this with webpack

Full stack trace:


ERROR in ./~/scrape-it/~/cheerio-req/~/cheerio/index.js
Module not found: Error: Cannot resolve 'file' or 'directory' ./package in /Users/Spicycurryman/Desktop/Main/styell/node_modules/scrape-it/node_modules/cheerio-req/node_modules/cheerio
resolve file
  /Users/Spicycurryman/Desktop/Main/styell/node_modules/scrape-it/node_modules/cheerio-req/node_modules/cheerio/package doesn't exist
  /Users/Spicycurryman/Desktop/Main/styell/node_modules/scrape-it/node_modules/cheerio-req/node_modules/cheerio/package.webpack.js doesn't exist
  /Users/Spicycurryman/Desktop/Main/styell/node_modules/scrape-it/node_modules/cheerio-req/node_modules/cheerio/package.web.js doesn't exist
  /Users/Spicycurryman/Desktop/Main/styell/node_modules/scrape-it/node_modules/cheerio-req/node_modules/cheerio/package.js doesn't exist
resolve directory
  /Users/Spicycurryman/Desktop/Main/styell/node_modules/scrape-it/node_modules/cheerio-req/node_modules/cheerio/package doesn't exist (directory default file)
  /Users/Spicycurryman/Desktop/Main/styell/node_modules/scrape-it/node_modules/cheerio-req/node_modules/cheerio/package/package.json doesn't exist (directory description file)
[/Users/Spicycurryman/Desktop/Main/styell/node_modules/scrape-it/node_modules/cheerio-req/node_modules/cheerio/package]
[/Users/Spicycurryman/Desktop/Main/styell/node_modules/scrape-it/node_modules/cheerio-req/node_modules/cheerio/package.webpack.js]
[/Users/Spicycurryman/Desktop/Main/styell/node_modules/scrape-it/node_modules/cheerio-req/node_modules/cheerio/package.web.js]
[/Users/Spicycurryman/Desktop/Main/styell/node_modules/scrape-it/node_modules/cheerio-req/node_modules/cheerio/package.js]
 @ ./~/scrape-it/~/cheerio-req/~/cheerio/index.js 11:18-38

I've tried update Cheerio inside the scrape-it module itself and outside, but no luck.

It exists.

EDIT:

Here's my webpack.config.js file


var webpack = require('webpack');
var path = require('path');
var fs = require('fs');
var request = require('request');
var CaseSensitivePathsPlugin = require('case-sensitive-paths-webpack-plugin');


module.exports = {
    entry: "./main.js",
    output: {
        path: __dirname,
        filename: "bundle.js"
    },
    module: {
    loaders: [
      { test: /\.json$/, loader: 'json-loader' }
        ]
    },
    resolve: {
        extensions: ['', '.webpack.js', '.web.js', '.js']
    },
    node: {
        console: 'empty',
        fs: 'empty',
        net: 'empty',
        tls: 'empty'
    }
}

How can this be resolved? Why is this occurring?

Promises!

K thanks.

Problem with closest

Thanks for creating such an awesome library! It was really easy to get up and running, and I'm enjoying it quite a bit.

I'm trying to write a simple function that scrapes a StackOverflow profile. Here's what I have so far:

function fetchUser(id) {
  return scrapeIt(`https://stackoverflow.com/users/${ id }`, {
    name: ".user-card-name",
    location: {
      selector: ".locationIcon"
      closest: "li"
    }
  });
}

I tried running this function against my own profile:

fetchUser(262125).then(console.log);

Here's what I got:

{ 
  name: 'LandonSchropp',
  location: '' 
}

This is what the DOM looks like on that page:

<ul class="list-unstyled">
  <li>
    <svg role="icon" class="svg-icon iconLocation" width="18" height="18" viewBox="0 0 18 18">
      <path d="..."></path>
    </svg>
    Seattle, WA
  </li>
  ...
</ul>

Is this a bug? Shouldn't the closest li to iconLocation contain the text Seattle, WA?

Thanks!

Question about parentNode

wondering if I use 'listItem', would I be able to traversing up the DOM tree to display info about the parentNode?

SyntaxError: Unexpected token }).then(page => {

I'm using the exact code in the example; however, I'm getting the following error:

}).then(page => {
	             ^^
SyntaxError: Unexpected token =>

I have the following code:

var express = require('express');
var router = express.Router();
var request = require('../node_modules/request');
const scrapeIt = require('../node_modules/scrape-it');


router.get('/', function(req, res, next) {
	// Promise interface
	scrapeIt("http://ionicabizau.net", {
	    title: ".header h1"
	  , desc: ".header h2"
	  , avatar: {
	        selector: ".header img"
	      , attr: "src"
	    }
	}).then(page => {
	    console.log(page);
	});

	// Callback interface
	scrapeIt("http://ionicabizau.net", {
	    // Fetch the articles
	    articles: {
	        listItem: ".article"
	      , data: {

	            // Get the article date and convert it into a Date object
	            createdAt: {
	                selector: ".date"
	              , convert: x => new Date(x)
	            }

	            // Get the title
	          , title: "a.article-title"

	            // Nested list
	          , tags: {
	                listItem: ".tags > span"
	            }

	            // Get the content
	          , content: {
	                selector: ".article-content"
	              , how: "html"
	            }
	        }
	    }

	    // Fetch the blog pages
	  , pages: {
	        listItem: "li.page"
	      , name: "pages"
	      , data: {
	            title: "a"
	          , url: {
	                selector: "a"
	              , attr: "href"
	            }
	        }
	    }

	    // Fetch some other data from the page
	  , title: ".header h1"
	  , desc: ".header h2"
	  , avatar: {
	        selector: ".header img"
	      , attr: "src"
	    }
	}, (err, page) => {
	    console.log(err || page);
	});
});

module.exports = router;

Why is this occuring? And how can this be resolved?

Code in README Instructions doesn't work, throws error.

After writing

npm install
node Scraper.js

It throws an error like this:
/Users/SKYLIFE/node_modules/scrape-it/lib/index.js:128
throw new Err("There is no element selected for the '<option.name>' field. Please provide a selector, list item or use nested object structure.", {
^

Any solutions? or am I missing something ?
Thanks!

Tests

...because yeah... 😁

How do you get top level text only?

How do I get a particular text from a tag with nested tags which also contain text

For example:

<div class="subtitle">
  <div class="when">Wed., Sep. 27,  8 p.m. - Ongoing | <a href="#schedule_calendar" data-
  toggle="modal">More dates</a></div>
  700 N. Calvert St. 
  <br /> 
  <a href="/venues/centerstage-baltimore">Center Stage</a> 
  <br /> 
  <div>$22-$79 | 410-332-0033</div>
</div>

I want 700 N. Calvert St. but instead I get Wed., Sep. 27, 8 p.m. - Ongoing | More dates700 N. Calvert St. Center Stage $22-$79 | 410-332-0033

Build with babel

As scrape-it is written in ES6 it can't be used with Node < v4.
A lot of people still use Node 0.10 or 0.12.

It could be nice to build the project using babel.

Add support for crawling

It would be nice to add support for crawling pages and follow links. I suggest:

Concurrency support
Request delay

Is there support for scraping single elements in table?

Hi,

I have this simple code that fails everytime.

var player = {
    url: "https://fr.wikipedia.org/wiki/Kossi_Agassa",
    data: {
        club: {
            selector: "#mw-content-text > table.wikitable.alternance2.centre > tbody > tr:nth-child(3) > td:nth-child(2) > span > b > a"
        }
    }
};

scrape(player.url, player.data)
    .then(console.log)
    .catch(console.error);

The output is always { club: '' } when it really should be { club: "FC Metz" }

Selecting table rows as listItem data seems to work but never selecting single elements from a table. I even tried with "#mw-content-text > table.infobox_v2 > tbody > tr:nth-child(4) > td > a". However, selecting single element outside of tables also works like a charm.

What am I doing wrong?

How to get value and inner text of options in a select?

I need to get a list of value and inner of each item of a select.

How could I do this via scrape-it?

Add a Dockerfile to build a container

This will help to distribute scrape-it as a service.

Object input

I'm not really sure why I actually implemented this in this way: passing an array of objects.

Since I realized and implemented the nested lists (lists in lists in lists etc), this can be taken to another level to pass an object (not an array) in the input:

scrapeIt("http://ionicabizau.net", {
    articles: {
        listItem: ".article"
      , data: {
            createdAt: {
                selector: ".date"
              , convert: x => new Date(x)
            }
          , title: "a.article-title"
          , tags: {
                selector: ".tags"
              , convert: x => x.split("|").map(c => c.trim()).slice(1)
            }
          , content: {
                selector: ".article-content"
              , how: "html"
            }
        }
    }
  , title: ".header h1"
  , desc: ".header h2"
  , avatar: {
       selector: ".header img"
     , attr: "src"
  }
}, (err, page) => {
    console.log(err || page);
});

I'm not sure if this works already or not, but anyways, this simplifies a bit the code.

XPath support

HI,
scrape it can work with XPath?

Doesn't work with gzip results

Not sure if this bug belongs in scrape-it, cheerio-req, or tinyreq.

A page I'm scraping is returning a result with 'content-encoding: gzip', and scrape-it isn't returning any results.

I traced functions calls and callbacks into the tinyrequest library, and it looks like it is passing the raw, gzipped body into cheerio.load()

I'm not sure the best way to handle this, my node-fu is not very strong. I tried overriding cheerio-req.request with a function that would unzip the body if the content-encoding is gzip, but I can't seem to get it working.

Any help would be appreciated.

Can I target a Dom or a String instead of a URL?

I'm forced to use casperjs for logging in and render the JS of the site I'm scraping. That means I already have the Dom. How can I pass it to scrape-it directly?

Scrapping special characters

Whenever I scrap sites with special characters like, accents and others i gets like this
Percusi�n
M�quina de

How to make HTML parsing case-insensitive?

How can I use scrape-it to search through the site elements in a case insensitive manner? There are some attribute values that I need to scrape and the capitalization of the values varies across pages on the site.
Thanks.

Maximum number in a list

Hi,

When I run the code below, it works for a certain number of items, then it outputs
"... 472 more items ] }"

Is there a configuration to limit the output that can be changed?

thanks,

`const scrapeIt = require("scrape-it");

scrapeIt("https://avia.nl/tankstations/", {
stations: {
listItem: ".search-result-inner-left",
data: {
name: {
selector: "h3"
},
address1: {
selector: ".first"
},
address2: {
selector: ".first ~ span"
}
}
}
}).then(page => {
console.log(page);
});`

Scrape multiple things a from a single HTML node with custom conversion

Let's say we have the HTML:

<table>
  <tr>
    <td>21.11.2017<br>22.11.2017</td>
    <td>Event A</td>
  </tr>
 <tr>
    <td>25.11.2017<br>27.11.2017</td>
    <td>Event B</td>
  </tr>
</table>

I would like to parse every row's first cell as two entries in my object: startDate and endDate.
Currently, I doing this redundantly:

function scrapeDate(first, str) {
  /* Redundant code section */
  let myData = str.split(' '); // or regex
  /* ... */
  return first ? myData[0] : myData[1];
}

scrapeIt(URI, {
  events: {
    listItem: "table tr"
    data: {
      startDate: {
        selector: "td",
        eq: 0,
        convert: scrapeDate.bind(null, true)
      },
      endDate: {
        selector: "td",
        eq: 0,
        convert: scrapeDate.bind(null, false)
      }
  }
});

Just opening this here, maybe I get an idea on how a flexible but clean API could look like.

[Improvement] Multiple requests?

I'm trying to get something like promise.all.

Here is my situation, I want to scrape multiple websites at once. Let's say I want Google and Twitter.

const res = [];

scrapeIt("https://google.com", {
    
}, (err, page) => {
    res.push(page)
});

scrapeIt("https://twitter.com", {
    
}, (err, page) => {
    res.push(page)
});

//When all of the above done, do the cb handle for the result.
 
console.log(res);

Due to Synchronize, the code above won't work. So I wonder if you could give me a walk through for this issue?
Also, let say I want to get the title from Google but tweet's content from Twitter. Meaning each URL will have a different callback. Is there any way to deal with kind of issue?

Thank you.

Option for User Agent string

It would be nice to add an option for User-Agent customisation

TypeScript typings missing

It seems that this project is missing typings to gain full compatibility with TypeScript.

Scrape in a nested ul li list ?

I am sorry but i am in trouble, hope somebody can help me, i have this page i want to scrape:

<ul>
<li><a>1</a></li>
<li><a>2</a></li>
<li><a>3</a></li>
<li><a>4</a></li>
</ul>

<ul>
<li><a>1</a></li>
<li><a>2</a></li>
<li><a>3</a></li>
<li><a>4</a></li>
</ul>

I need the ul li a content of the second

{ selector: 'ul', eq: 2 }

but it doesn't seems to work, can you please help me ?

thanks

Scrape bunch of urls Array?

Hi, i have a bunch of urls in array that i want to scrape, can you tell me if this is the correct way to achieve it:

for (i in urls) {
scrapeIt(i, {
selector: '#gradients',
how: 'html'
}).then((data) => {
console.log('DONE with url', ${i}, data);
});
}

i think it works but it takes too much time, i don't know why and how to debug so i am asking you, thanks for patience!

TypeError: Cannot read property 'text' of undefined

When trying out your example from the blog I get this error:

$ node index.js
/Users/jparnow/Documents/Unicorns/node_modules/scrape-it/lib/index.js:155
                    let value = typpy(cOpt.how, Function) ? cOpt.how($elm) : $elm[cOpt.how]();
                                                                                 ^

TypeError: Cannot read property 'text' of undefined
    at iterateObj (/Users/jparnow/Documents/Unicorns/node_modules/scrape-it/lib/index.js:155:82)
    at iterateObject (/Users/jparnow/Documents/Unicorns/node_modules/iterate-object/lib/index.js:25:17)
    at handleDataObj (/Users/jparnow/Documents/Unicorns/node_modules/scrape-it/lib/index.js:119:13)
    at Function.scrapeIt.scrapeHTML (/Users/jparnow/Documents/Unicorns/node_modules/scrape-it/lib/index.js:172:12)
    at req (/Users/jparnow/Documents/Unicorns/node_modules/scrape-it/lib/index.js:28:27)
    at Tinyreq (/Users/jparnow/Documents/Unicorns/node_modules/cheerio-req/lib/index.js:21:9)
    at opt_callback (/Users/jparnow/Documents/Unicorns/node_modules/tinyreq/lib/index.js:57:9)
    at IncomingMessage.res.on.on.on (/Users/jparnow/Documents/Unicorns/node_modules/tinyreq/lib/index.js:89:13)

I’m using Node v6.0.0

It would be great if this could be fixed.

Support for nested json

what a great lib, I moved nearly 40 different parsers with your library less than 2 hours and huge decrease in loc. :)

scrapeIt(url, {
    provider: ".powered-by"
    , status: {
        description: ".page-status span.status"
    }

}).then(page => {
    console.log(page);
});

I am getting

                   let value = typpy(cOpt.how, Function) ? cOpt.how($elm) : $elm[cOpt.how]();
                                                                                 ^

TypeError: Cannot read property 'text' of undefined
``

ECONNREFUSED

Hi,

I see this issue has been up before, but I got it now, always on consequent requests when scraping into scraped urls.

Error:

{ Error: connect ECONNREFUSED 127.0.0.1:443
    at Object.exports._errnoException (util.js:1016:11)
    at exports._exceptionWithHostPort (util.js:1039:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1138:14)
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect',
  address: '127.0.0.1',
  port: 443 }

Gist of full code and package.lock-file: https://gist.github.com/gnimmelf/dd6bd5c8f518eb967ca93cec588a0b12

Any ideas?

Thank you!

Error: connect ECONNREFUSED 127.0.0.1:443

I used the scrape-it module before and it didn't give me any errors.

I thought that it was in my repo the error, but I got it downloading this repo.

I npm install the dependencies and I tried to run the example index.js with node index.js

The error is the next

Gabriel at iMac-de-Gabriel in ~/scrape-it/example on master*
$ node index
{ Error: connect ECONNREFUSED 127.0.0.1:443
    at Object.exports._errnoException (util.js:1026:11)
    at exports._exceptionWithHostPort (util.js:1049:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1081:14)
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect',
  address: '127.0.0.1',
  port: 443 }

I have node 6.5.0 and npm 3.10.3

ECONNREFUSED

I've been using this module for a couple of weeks now and recently I get this error when using it :

{ Error: connect ECONNREFUSED 127.0.0.1:443
    at Object.exports._errnoException (util.js:953:11)
    at exports._exceptionWithHostPort (util.js:976:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1080:14)
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect',
  address: '127.0.0.1',
  port: 443 }

I understand the connection refuse bit - which is totally expected btw because nothing listen on 127.0.0.1:443 ... but why is it trying to reach that address when I ask for the internet ?

Here is a small snippet that I use to isolate the issue from other bits and pieces of my app:

const scrapeIt   = require('scrape-it');
scrapeIt("http://korben.info/", {
    posts: {
        listItem: ".post",
        data: {
            title: {
                selector: '.post-title'
            }
        }
    }
}, (err, result) => {
    if(err){
        console.log("ERROR fecthing site");
        console.log(err);
    }else{
        console.log(result);
    }
});

I know it's likely to be something to do with my setup but I struggling to see where it could came from... especially since I haven't touch anything over the week end...

Multiple selection

Hi, I couldn't find in the documentation if there is any option to collect data from various elements that matches one particular selector.

IE (collects all links from one particular site)

url: {
        selector: 'a',
        attr: "href"
    }

Is there an option like that ?

Avoid getting banned?

Hey this tool is awesome, i was going to scrape huge data, but i got some doubt... what about... i usually make some jquery or js script to run in the browser because curl requests are often banned specially by cloudflare websites...

Do you have any advice about using this tool for to collect huge data (3000/5000 pages data)?

How would you handle authentication ?

Hi,
I need to scrape a webpage on a site that is member only. How would I use a coockie or authenticate myself against this website (is it even possible using this lib ? )

Selecting a text node

Hi guys,

Quick question:
Is it possible here to target the text content of the second node? I'm having a great deal of trouble to get it.

Ideas?

<div id='foo'>
  <a>baz</a>
  "someText"
  <p>bar</p>
</div>

I expect to get "someText".

Running the example doesn't work

Hi there,

The example found in the readme does not work when running it.
Here is the output when running it. Perhaps I'm being dumb but this is the result of running node app.js

throw new Err("There is no element selected for the '<option.name>' field. Please provide a selector, list item or use nested object structure.", {
                ^

Error: There is no element selected for the 'data' field. Please provide a selector, list item or use nested object structure.
    at Error (native)
    at new Err (/Users/josh/Projects/daystext/node_modules/err/lib/index.js:57:81)
    at /Users/josh/Projects/daystext/node_modules/scrape-it/lib/index.js:128:23
    at iterateObject (/Users/josh/Projects/daystext/node_modules/iterate-object/lib/index.js:25:17)
    at handleDataObj (/Users/josh/Projects/daystext/node_modules/scrape-it/lib/index.js:122:9)
    at Function.scrapeIt.scrapeHTML (/Users/josh/Projects/daystext/node_modules/scrape-it/lib/index.js:177:12)
    at /Users/josh/Projects/daystext/node_modules/scrape-it/lib/index.js:29:27
    at /Users/josh/Projects/daystext/node_modules/cheerio-req/lib/index.js:24:9
    at opt_callback (/Users/josh/Projects/daystext/node_modules/tinyreq/lib/index.js:57:9)
    at IncomingMessage.<anonymous> (/Users/josh/Projects/daystext/node_modules/tinyreq/lib/index.js:89:13)

Set user agent

I came across a site I'm trying to scrape which does not return results unless the User-Agent header is set. It's probably good practice anyway to set it to something like "scrape-it X.Y.Z" to self-identify. Alternatively, my problem would be solved if it were possible to pass options through to the underlying request which I could imagine being useful for other purposes too.

Getting data from different tags into one object

Apologies for the messy code below but not sure how to structure it. I want to try and create an object from the below data but each after the '.FHeader' class should have the 'FHeader' contents included in a date key

        "date": "Thursday 14 Dec 2017",
        "time": "",
        "court": "",
        "team1Url": "",
        "team2Url": ""
    },
    {
        "date": "",
        "time": "18:00",
        "court": "STORM",
        "team1Url": "TeamProfile.aspx?VenueId=21&LeagueId=190&SeasonId=4807&DivisionId=32923&TeamId=86353",
        "team2Url": "TeamProfile.aspx?VenueId=21&LeagueId=190&SeasonId=4807&DivisionId=32923&TeamId=83964"
    },
    {
        "date": "",
        "time": "18:00",
        "court": "THUNDER",
        "team1Url": "TeamProfile.aspx?VenueId=21&LeagueId=190&SeasonId=4807&DivisionId=32923&TeamId=82083",
        "team2Url": "TeamProfile.aspx?VenueId=21&LeagueId=190&SeasonId=4807&DivisionId=32923&TeamId=84036"
    },
    {
        "date": "",
        "time": "19:10",
        "court": "THUNDER",
        "team1Url": "TeamProfile.aspx?VenueId=21&LeagueId=190&SeasonId=4807&DivisionId=32923&TeamId=91315",
        "team2Url": "TeamProfile.aspx?VenueId=21&LeagueId=190&SeasonId=4807&DivisionId=32923&TeamId=83964"
    }```

So with the above, I'd like to have the date data be present with each object date and then have that changed as the scraper continues to the next table with a new date and other <tr> elements.

`<tbody><tr class="FHeader"><td colspan="5">Thursday 14 Dec 2017</td></tr><tr class="FRow FBand"><td class="FDate">16:00</td><td class="FPlayingArea">Willowmoore<br></td><td class="FHomeTeam"><a href="TeamProfile.aspx?VenueId=6&amp;LeagueId=81&amp;SeasonId=4761&amp;DivisionId=32688&amp;TeamId=51886">Rock Stars</a></td><td class="FScore"><div><nobr data-fixture-id="1475077">vs</nobr></div></td><td class="FAwayTeam"><a href="TeamProfile.aspx?VenueId=6&amp;LeagueId=81&amp;SeasonId=4761&amp;DivisionId=32688&amp;TeamId=43939">The Dogs Of War</a></td></tr><tr class="FRow"><td class="FDate">18:20</td><td class="FPlayingArea">Willowmoore<br></td><td class="FHomeTeam"><a href="TeamProfile.aspx?VenueId=6&amp;LeagueId=81&amp;SeasonId=4761&amp;DivisionId=32688&amp;TeamId=43939">The Dogs Of War</a></td><td class="FScore"><div><nobr data-fixture-id="1486224"></nobr></div></td><td class="FAwayTeam"></td></tr><tr class="FRow FBand"><td class="FDate">19:40</td><td class="FPlayingArea">Springbok<br></td><td class="FHomeTeam"><a href="TeamProfile.aspx?VenueId=6&amp;LeagueId=81&amp;SeasonId=4761&amp;DivisionId=32688&amp;TeamId=91352">Silly Sloggers 2.0</a></td><td class="FScore"><div><nobr data-fixture-id="1475078">vs</nobr></div></td><td class="FAwayTeam"><a href="TeamProfile.aspx?VenueId=6&amp;LeagueId=81&amp;SeasonId=4761&amp;DivisionId=32688&amp;TeamId=1286">Kwagga's</a></td></tr></tbody>`

If something is unclear, please let me know.

301 Redirect Handling

Would be a nice feature to implement. Could work on, and PR if necessary.

[Question] Getting attribute from listItem

I want to scrape a table with a bunch of cells, some of which have links in them. By using the options below I can get an array containing the text of the links.

{
  links: {
    listItem: "table tr td a"
  }
}

However I need to get the href-attribute the links. How would I go about doing that?

Missing timeout option

Hi,

I've tried to use this library to scrape about 40 website asynchronously. I do this by using the Promise object returned by ScrapeIt, and then doing something like this

Promise.all(promises).then( function(result){

The problem here is that one of the websites I scrape can be down/slow at unspecified times (I have no control over it). The problem with the library is that ScrapeIt never seems to time out (tried it for a few minutes, but it won't return and run the Promise.all ... code.

Any suggestion on how I can make it timeout (while still using the library promises)? Did I miss any option?

How to remove Specific content from given selector?

Hi,

Using scrape-it how to remove specific content from given selector's content. In below example i want to scrape all the elements under "selector" except first "ul" which is List1, List2, List3, List4.

E.G.

<div class="selector">
    <ul>
        <li>List1</li>
        <li>List2</li>
        <li>List3</li>
        <li>List4</li>
    </ul>
    <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. </p>
    <p>Sed consequat ligula sed accumsan pharetra. Duis non erat nibh aliquet.</p>
    <h2>Sub Heading</h2>
    <ul>
        <li>Below_List1</li>
        <li>Below_List2</li>
        <li>Below_List3</li>
        <li>Below_List4</li>
    </ul>
</div>

List of links

Can you possibly help me please, I'm trying to scrape a list of links not wrapped in a ul or li. There is an attr & name I want to store in one object and then the objects in a single array.

<div class="Content">

<a href="/ActionController/LeagueList?VenueId=6">Benoni</a>
<br>
<a href="/ActionController/LeagueList?VenueId=38">Bloemfontein</a>
<br>
...

I want to store the information like so:

[{name: 'Benoni', url: 'https://...'}, ...]

I'm currently begin forced to make 2 calls to grab the name & url separately, any advice would be great thanks.

feature: cli

Is there any interest in adding CLI for this project?
BTW: I'd be happy to help.

What I'd love to use:

scrape-it <url> <definition>

Also, piping would be great

curl ... | scrape-it - <definition>

By definition I mean *.js file with

module.exports = {
  articles: {
    listItem: ...
  }
};

ionicabizau / scrape-it Goto Github PK

scrape-it's Introduction

scrape-it

☁️ Installation

FAQ

1. How to parse scrape pages?

2. Crawling

3. Local files

📋 Example

❓ Get Help

📝 Documentation

scrapeIt(url, opts, cb)

Params

Return

scrapeIt.scrapeHTML($, opts)

Params

Return

😋 How to contribute

💖 Support my projects

💫 Where is this library used?

📜 License

scrape-it's People

Stargazers

Watchers

Forkers

scrape-it's Issues

Recommend Projects

Recommend Topics

Recommend Org

`scrapeIt(url, opts, cb)`

`scrapeIt.scrapeHTML($, opts)`