Giter Club home page Giter Club logo

minet's Introduction

Build Status DOI download number

Minet

minet is a webmining command line tool & library for python (>= 3.7) that can be used to collect and extract data from a large variety of web sources such as raw webpages, Facebook, CrowdTangle, YouTube, Twitter, Media Cloud etc.

It adopts a very simple approach to various webmining problems by letting you perform a wide array of tasks from the comfort of the command line. No database needed: raw CSV files should be sufficient to do most of the work.

In addition, minet also exposes its high-level programmatic interface as a python library so you remain free to use its utilities to suit your use-cases better.

minet is developed by médialab SciencesPo research engineers and is the consolidation of more than a decade of webmining practices targeted at social sciences.

As such, it has been designed to be:

  1. low-tech, as it requires minimal resources such as memory, CPUs or hard drive space and should be able to work on any low-cost PC.
  2. fault-tolerant, as it is able to recover when network is bad and retry HTTP calls when suitable. What's more, most of minet commands can be resumed if aborted and are designed to run for a long time (think days or months) without leaking memory.
  3. unix-compliant, as it can be piped easily and know how to work with the usual streams.

Shortcuts: Command line documentation, Python library documentation.

fetch

How to cite?

minet is published on Zenodo as 10.5281/zenodo.4564399.

You can cite it thusly:

Guillaume Plique, Pauline Breteau, Jules Farjas, Héloïse Théro, Jean Descamps, Amélie Pellé, Laura Miguel, César Pichon, & Kelly Christensen. (2019, October 14). Minet, a webmining CLI tool & library for python. Zenodo. http://doi.org/10.5281/zenodo.4564399

Whirlwind tour

# Downloading large amount of urls as fast as possible
minet fetch url -i urls.csv > report.csv

# Extracting raw text from the downloaded HTML files
minet extract -i report.csv -I downloaded > extracted.csv

# Scraping the urls found in the downloaded HTML files
minet scrape urls -i report.csv -I downloaded > scraped_urls.csv

# Parsing & normalizing the scraped urls
minet url-parse scraped_url -i scraped_urls.csv > parsed_urls.csv

# Scraping data from Twitter
minet twitter scrape tweets "from:medialab_ScPo" > tweets.csv

# Printing a command's help
minet twitter scrape -h

# Searching videos on YouTube
minet youtube search -k "MY-YT-API-KEY" "médialab" > videos.csv

Summary

What it does

Minet can single-handedly:

  • Extract URLs from a text file (or a table)
  • Parse URLs (get useful information, with Facebook- and Youtube-specific stuff)
  • Join two CSV files by matching the columns containing URLs
  • From a list of URLs, resolve their redirections
    • ...and check their HTTP status
    • ...and download the HTML
    • ...and extract hyperlinks
    • ...and extract the text content and other metadata (title...)
    • ...and scrape structured data (using a declarative language to define your heuristics)
  • Crawl (using a declarative language to define a browsing behavior, and what to harvest)
  • Mine or search:
  • Scrape (without requiring special access, often just a user account):
  • Grab & dump cookies from your browser
  • Dump Hyphe data

Documented use cases

Features (from a technical standpoint)

  • Multithreaded, memory-efficient fetching from the web.
  • Multithreaded, scalable crawling.
  • Multiprocessed raw text content extraction from HTML pages.
  • Multiprocessed scraping from HTML pages.
  • URL-related heuristics utilities such as extraction, normalization and matching.
  • Data collection from various APIs such as CrowdTangle.

Installation

minet can be installed as a standalone CLI tool (currently only on mac >= 10.14, ubuntu & similar) by running the following command in your terminal:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash

Don't trust us enough to pipe the result of a HTTP request into bash? We wouldn't either, so feel free to read the installation script here and run it on your end if you prefer.

On ubuntu & similar you might need to install curl and unzip before running the installation script if you don't already have it:

sudo apt-get install curl unzip

Else, minet can be installed directly as a python CLI tool and library using pip:

pip install minet

Finally if you want to install the standalone binaries by yourself (even for windows) you can find them in each release here.

Upgrading

To upgrade the standalone version, simply run the install script once again:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash

To upgrade the python version you can use pip thusly:

pip install -U minet

Uninstallation

To uninstall the standalone version:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/uninstall.sh | bash

To uninstall the python version:

pip uninstall minet

Documentation

Contributing

To contribute to minet you can check out this documentation.

minet's People

Contributors

16arpi avatar ameliepelle avatar bmaz avatar boogheta avatar camillechanial avatar d3scmps avatar davidlibeau avatar elanhermi avatar farjasju avatar fyunusa avatar heloisethero avatar jacomyma avatar kat-kel avatar kianmeng avatar mdamien avatar miguellaura avatar paubre avatar paulgirard avatar yomguithereal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

minet's Issues

weird idna encoding error when resolving some urls

(note: It might be more appropriate to move this issue to quenouille)

When trying to resolve this url http://www.outremersbeyou.com/talent-de-la-semaine-la-designer-comorienne-aisha-wadaane-je-suis-fiere-de-mes-origines/ we end up with the following surprising stacktrace

(which is weird as this url is indeed a redirection, but to some normally encoded (but bad) url : Location: http://www.outremers360.comtalent-de-la-semaine-la-designer-comorienne-aisha-wadaane-je-suis-fiere-de-mes-origines/

I guess because of the missing slash, it considers as TLD "comtalent-de-la-semaine-la-designer-comorienne-aisha-wadaane-je-suis-fiere-de-mes-origines" and since there are dashes inside, it tries to interpret it as punycode...

Traceback (most recent call last):
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/encodings/idna.py", line 167, in encode
    raise UnicodeError("label too long")
UnicodeError: label too long
 
The above exception was the direct cause of the following exception:
 
Traceback (most recent call last):
  File "bin/complete_links_resolving_v2.py", line 100, in <module>
    resolve()
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "bin/complete_links_resolving_v2.py", line 57, in resolve
    for res in multithreaded_resolve(urls_to_clear, threads=50, throttle=0.5, max_redirects=15):
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/quenouille/imap.py", line 353, in output
    raise e.with_traceback(trace)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/quenouille/imap.py", line 303, in worker
    result = func(data)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/minet/fetch.py", line 237, in worker
    error, stack = resolve(http, url, max=max_redirects, **kwargs)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/minet/utils.py", line 187, in resolve
    headers_only=True
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/minet/utils.py", line 167, in request
    redirect=redirect
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/request.py", line 68, in request
    **urlopen_kw)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/request.py", line 89, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/poolmanager.py", line 326, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/connectionpool.py", line 603, in urlopen
    chunked=chunked)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/connectionpool.py", line 355, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 1254, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 1300, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 1249, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 1036, in _send_output
    self.send(msg)
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/http/client.py", line 974, in send
    self.connect()
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/connection.py", line 183, in connect
    conn = self._new_conn()
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/connection.py", line 160, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/home/boo/.pyenv/versions/quenouille/lib/python3.6/site-packages/urllib3/util/connection.py", line 57, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/home/boo/.pyenv/versions/3.6.9/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)

Output file error

Not giving a -o output.csv file will result in a TypeError: expected str, bytes or os.PathLike object, not NoneType on line 142 of fetch.py -> output_file = open(namespace.output, 'w')

I had other issues but you broke paris.demosphere.net so I can't do any further testing.

pain to install thoughts

  • from my experience on Linux python3.5 python3-dev seems like required by dragnet install process. Not 100% sure though.

  • dragnet does not install its pip dependencies, while waiting for them to accept the PR adding those deps to minet requirements would help many users

  • finally dragnet being a large piece to get installed, it could be installed only optionally? Something like the dragnet option catch the specific dragnet module error and display the command to install it + documentation ?

Add adaptative throttling strategies

Using concepts such as exponential backoff. Might be a bit fun/tricky to implement in a multithreaded fashion with limited memory consumption.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.