body_scraper's Introduction

body_scraper

This script scrapes the body text from a list of URLS and outputs the text from each URL in a separate .txt file.

It was built on OSX 10.10. It uses bash and Ruby with the nokogiri library.

To use it, open a terminal in the directory and run the following command:

./scrape.sh urls.txt

This will look at all the urls in urls.txt and create a new text file for each one of them in a subdirectory called texts. Before running the script, make sure all your urls are in urls.txt, with exactly one url per line and no blank lines.

Notes

This script first grabs what is inside a web page's <title> tags and then extracts text enclosed in the HTML tags that would be under the XPath //body/p. This works for about 90% of web pages, but some web pages don't put their articles' text inside <p> tags. (For example, sometimes important text is inside <ul> or <li> tags.)

The text files generated by the script are just named 1.txt, 2.txt, 3.txt, etc. It would be nice to update this so it puts the article title or part of it as the text file's name.

Recommend Projects

carhutt / body_scraper Goto Github PK

body_scraper's Introduction

body_scraper

Notes

body_scraper's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent