medialab / minet Goto Github PK

A webmining CLI tool & library for python.

License: GNU General Public License v3.0

Python 99.25% Makefile 0.13% Shell 0.60% HTML 0.01%

minet's Introduction

minet is a webmining command line tool & library for python (>= 3.7) that can be used to collect and extract data from a large variety of web sources such as raw webpages, Facebook, CrowdTangle, YouTube, Twitter, Media Cloud etc.

It adopts a very simple approach to various webmining problems by letting you perform a wide array of tasks from the comfort of the command line. No database needed: raw CSV files should be sufficient to do most of the work.

In addition, minet also exposes its high-level programmatic interface as a python library so you remain free to use its utilities to suit your use-cases better.

minet is developed by médialab SciencesPo research engineers and is the consolidation of more than a decade of webmining practices targeted at social sciences.

As such, it has been designed to be:

low-tech, as it requires minimal resources such as memory, CPUs or hard drive space and should be able to work on any low-cost PC.
fault-tolerant, as it is able to recover when network is bad and retry HTTP calls when suitable. What's more, most of minet commands can be resumed if aborted and are designed to run for a long time (think days or months) without leaking memory.
unix-compliant, as it can be piped easily and know how to work with the usual streams.

Shortcuts: Command line documentation, Python library documentation.

How to cite?

minet is published on Zenodo as 10.5281/zenodo.4564399.

You can cite it thusly:

Guillaume Plique, Pauline Breteau, Jules Farjas, Héloïse Théro, Jean Descamps, Amélie Pellé, Laura Miguel, César Pichon, & Kelly Christensen. (2019, October 14). Minet, a webmining CLI tool & library for python. Zenodo. http://doi.org/10.5281/zenodo.4564399

Whirlwind tour

# Downloading large amount of urls as fast as possible
minet fetch url -i urls.csv > report.csv

# Extracting raw text from the downloaded HTML files
minet extract -i report.csv -I downloaded > extracted.csv

# Scraping the urls found in the downloaded HTML files
minet scrape urls -i report.csv -I downloaded > scraped_urls.csv

# Parsing & normalizing the scraped urls
minet url-parse scraped_url -i scraped_urls.csv > parsed_urls.csv

# Scraping data from Twitter
minet twitter scrape tweets "from:medialab_ScPo" > tweets.csv

# Printing a command's help
minet twitter scrape -h

# Searching videos on YouTube
minet youtube search -k "MY-YT-API-KEY" "médialab" > videos.csv

What it does

Minet can single-handedly:

Extract URLs from a text file (or a table)
Parse URLs (get useful information, with Facebook- and Youtube-specific stuff)
Join two CSV files by matching the columns containing URLs
From a list of URLs, resolve their redirections
- ...and check their HTTP status
- ...and download the HTML
- ...and extract hyperlinks
- ...and extract the text content and other metadata (title...)
- ...and scrape structured data (using a declarative language to define your heuristics)
Crawl (using a declarative language to define a browsing behavior, and what to harvest)
Mine or search:
- Buzzsumo (requires API access)
- Crowdtangle (requires API access)
- Mediacloud (requires free API access)
- Twitter (requires free API access)
- Wikipedia
- Youtube (requires free API access)
Scrape (without requiring special access, often just a user account):
- Facebook
- Instagram
- Telegram
- TikTok
- Twitter
- Google Drive (spreadsheets etc.)
Grab & dump cookies from your browser
Dump Hyphe data

Documented use cases

Fetching a large amount of urls
Joining 2 CSV files by urls
Using minet from a Jupyter notebook (very useful to experiment with the tool or teach students)
Downloading images associated with a given hashtag on Twitter
Scraping DSL Tutorial

Features (from a technical standpoint)

Multithreaded, memory-efficient fetching from the web.
Multithreaded, scalable crawling.
Multiprocessed raw text content extraction from HTML pages.
Multiprocessed scraping from HTML pages.
URL-related heuristics utilities such as extraction, normalization and matching.
Data collection from various APIs such as CrowdTangle.

Installation

minet can be installed as a standalone CLI tool (currently only on mac >= 10.14, ubuntu & similar) by running the following command in your terminal:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash

Don't trust us enough to pipe the result of a HTTP request into bash? We wouldn't either, so feel free to read the installation script here and run it on your end if you prefer.

On ubuntu & similar you might need to install curl and unzip before running the installation script if you don't already have it:

sudo apt-get install curl unzip

Else, minet can be installed directly as a python CLI tool and library using pip:

pip install minet

Finally if you want to install the standalone binaries by yourself (even for windows) you can find them in each release here.

Upgrading

To upgrade the standalone version, simply run the install script once again:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash

To upgrade the python version you can use pip thusly:

pip install -U minet

Uninstallation

To uninstall the standalone version:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/uninstall.sh | bash

To uninstall the python version:

pip uninstall minet

Documentation

Contributing

To contribute to minet you can check out this documentation.

minet's People

Contributors

Stargazers

Watchers

minet's Issues

Clean shutdown of multiproc & multithread

Stacks are a mess on sigint

Implement own redirects/retries/exponential backoff to avoid urllib3 issues

Output file error

Not giving a -o output.csv file will result in a TypeError: expected str, bytes or os.PathLike object, not NoneType on line 142 of fetch.py -> output_file = open(namespace.output, 'w')

I had other issues but you broke paris.demosphere.net so I can't do any further testing.

Add estimation of number of comments fb

root-level `sel`

Yaml scrapers

HEAD queries refused by some domains

eBay refuses HEAD (they return a header Allow: GET) queries and redirects to a wrong page with infinite redirection when used
Example :
https://ebay.us/BUkuxU should get to https://www.ebay.com/itm/253189196428
but it goes to http://pages.ebay.com/messages/page_not_responding.html through https://pages.ebay.com/messages/page_not_responding.html and https://www.ebay.com/n/error?statuscode=500

Gotta love the web!

Grab cookies

https://github.com/n8henrie/pycookiecheat

Add a lru_cache option to fetch

Add some colors & use proper logger

spidering

jQuery-like aggregator

If selector gathers multiple elements, should we aggregate retrieved text?

Add adaptative throttling strategies

Using concepts such as exponential backoff. Might be a bit fun/tricky to implement in a multithreaded fashion with limited memory consumption.

Add --version

Multiple redirections not completely followed ?

I noticed this:

200 : http://bit.ly/2YupNmj -> https://t.co/OqtIzx9TlI

Which is weird : t.co is also a redirection and should be followed

Fetch should add resolved url to metadata

Fetch outputs wrong original line index

Enumerate should wrap reader, not multithreaded iterator

Don't write file if no content

Examples: OPTIONS, HEAD etc.

Google trends

Wikidata

cc @paulgirard

shortener resolution and such

Swith to browser-cookie3

Supports Firefox.

https://github.com/borisbabic/browser_cookie3

Add jsonl to scrape

Refactor fetch CSV formatters

Add content folder hierarchy option

By url breakdown, for instance.

Add choices to --grab-cookies

Add basic transform to scraping DSL

urls_from_text

Strip parenthesis tags and such.

weird idna encoding error when resolving some urls

(note: It might be more appropriate to move this issue to quenouille)

When trying to resolve this url http://www.outremersbeyou.com/talent-de-la-semaine-la-designer-comorienne-aisha-wadaane-je-suis-fiere-de-mes-origines/ we end up with the following surprising stacktrace

(which is weird as this url is indeed a redirection, but to some normally encoded (but bad) url : Location: http://www.outremers360.comtalent-de-la-semaine-la-designer-comorienne-aisha-wadaane-je-suis-fiere-de-mes-origines/

I guess because of the missing slash, it considers as TLD "comtalent-de-la-semaine-la-designer-comorienne-aisha-wadaane-je-suis-fiere-de-mes-origines" and since there are dashes inside, it tries to interpret it as punycode...

Traceback (most recent call last):
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/encodings/idna.py", line 167, in encode
    raise UnicodeError("label too long")
UnicodeError: label too long
 
The above exception was the direct cause of the following exception:
 
Traceback (most recent call last):
  File "bin/complete_links_resolving_v2.py", line 100, in <module>
    resolve()
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "bin/complete_links_resolving_v2.py", line 57, in resolve
    for res in multithreaded_resolve(urls_to_clear, threads=50, throttle=0.5, max_redirects=15):
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/quenouille/imap.py", line 353, in output
    raise e.with_traceback(trace)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/quenouille/imap.py", line 303, in worker
    result = func(data)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/minet/fetch.py", line 237, in worker
    error, stack = resolve(http, url, max=max_redirects, **kwargs)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/minet/utils.py", line 187, in resolve
    headers_only=True
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/minet/utils.py", line 167, in request
    redirect=redirect
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/request.py", line 68, in request
    **urlopen_kw)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/request.py", line 89, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/poolmanager.py", line 326, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/connectionpool.py", line 603, in urlopen
    chunked=chunked)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/connectionpool.py", line 355, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 1254, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 1300, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 1249, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 974, in send
    self.connect()
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/connection.py", line 183, in connect
    conn = self._new_conn()
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/connection.py", line 160, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/util/connection.py", line 57, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)

from my experience on Linux python3.5 python3-dev seems like required by dragnet install process. Not 100% sure though.
dragnet does not install its pip dependencies, while waiting for them to accept the PR adding those deps to minet requirements would help many users
finally dragnet being a large piece to get installed, it could be installed only optionally? Something like the dragnet option catch the specific dragnet module error and display the command to install it + documentation ?