This script scrapes the body text from a list of URLS and outputs the text from each URL in a separate .txt file.
It was built on OSX 10.10. It uses bash and Ruby with the nokogiri library.
To use it, open a terminal in the directory and run the following command:
./scrape.sh urls.txt
This will look at all the urls in urls.txt and create a new text file for each one of them in a subdirectory called texts. Before running the script, make sure all your urls are in urls.txt, with exactly one url per line and no blank lines.
This script first grabs what is inside a web page's <title>
tags and then extracts text enclosed in the HTML tags that would be under the XPath //body/p. This works for about 90% of web pages, but some web pages don't put their articles' text inside <p>
tags. (For example, sometimes important text is inside <ul>
or <li>
tags.)
The text files generated by the script are just named 1.txt, 2.txt, 3.txt, etc. It would be nice to update this so it puts the article title or part of it as the text file's name.