Giter Club home page Giter Club logo

wtf_wikipedia's Introduction

wikipedia markup parser
by Spencer Kelly and contributors

wtf_wikipedia turns wikipedia's markup language into JSON,
so getting data from wikipedia is easier.

🏠 Try to have a good time. 🛀

seriously,
this is among the most-curious data formats you can find.
(then we buried our human-record in it)

Consider:

wtf_wikipedia supports many recursive shenanigans, depreciated and obscure template variants, and illicit 'wiki-esque' shorthands.

image

It will try it's best, and fail in reasonable ways.

building your own parser is never a good idea
but this library aims to be a straight-forward way to get data out of wikipedia
... so don't be mad at me, be mad at this.

Demo   •   Tutorial   •   Api

well ok then,

npm install wtf_wikipedia

var wtf = require('wtf_wikipedia');

wtf.fetch('Whistling').then(doc => {

  doc.categories();
  //['Oral communication', 'Vocal music', 'Vocal skills']

  doc.sections('As communication').text();
  // 'A traditional whistled language named Silbo Gomero..'

  doc.images(0).thumb();
  // 'https://upload.wikimedia.org..../300px-Duveneck_Whistling_Boy.jpg'

  doc.sections('See Also').links().map(link => link.page)
  //['Slide whistle', 'Hand flute', 'Bird vocalization'...]
});

on the client-side:

<script src="https://unpkg.com/wtf_wikipedia"></script>
<script>
  //(follows redirect)
  wtf.fetch('On a Friday', 'en', function(err, doc) {
    var val = doc.infobox(0).get('current_members');
    val.links().map(link => link.page);
    //['Thom Yorke', 'Jonny Greenwood', 'Colin Greenwood'...]
  });
</script>

What it does:

  • Detects and parses redirects and disambiguation pages
  • Parse infoboxes into a formatted key-value object
  • Handles recursive templates and links- like [[.. [[...]] ]]
  • Per-sentence plaintext and link resolution
  • Parse and format internal links
  • creates image thumbnail urls from File:XYZ.png filenames
  • Properly resolve {{CURRENTMONTH}} and {{CONVERT ..}} type templates
  • Parse images, headings, and categories
  • converts 'DMS-formatted' (59°12'7.7"N) geo-coordinates to lat/lng
  • parses citation metadata
  • Eliminate xml, latex, css, and table-sorting cruft

But what about...

Parsoid:

Wikimedia's Parsoid javascript parser is the official wikiscript parser, and is pretty cool. It reliably turns wikiscript into HTML, but not valid XML.

To use it for data-mining, you'll need to:

parsoid(wikiText) -> [headless/pretend-DOM] -> screen-scraping

which is fine,

but getting structured data this way (say, sentences or infobox values), is still a complex + weird process. Arguably, you're not any closer than you were with wikitext. This library has lovingly ❤️ borrowed a lot of code and data from the parsoid project, and thanks its contributors.

Full data-dumps:

wtf_wikipedia was built to work with dumpster-dive, which lets you parse a whole wikipedia dump on a laptop in a couple hours. It's definitely the way to go, instead of fetching many pages off the api.

API

const wtf = require('wtf_wikipedia')
//parse a page
var doc = wtf(wikiText, [options])

//fetch & parse a page - wtf.fetch(title, [lang_or_wikiid], [options], [callback])
(async () => {
  var doc = await wtf.fetch('Toronto');
  console.log(doc.text())
})();

//(callback format works too)
wtf.fetch(64646, 'en', (err, doc) => {
  console.log(doc.categories());
});

//get a random german page
wtf.random('de').then(doc => {
  console.log(doc.text())
});

Main parts:

Document            - the whole thing
  - Category
  - Coordinate

  Section           - page headings ( ==these== )
    - Infobox       - a main, key-value template
    - Table         -
    - Reference     - citations, all-forms
    - Template      - any other structured-data

    Paragraph       - content separated by two newlines
      - Image       -
      - List        - a series of bullet-points

      Sentence      - contains links, formatting, dates

For the most-part, these classes do the looping-around for you, so that Document.links() will go through every section, paragraph, and sentence, to get their links.

Broadly speaking, you can ask for the data you'd like:

  • .sections()       -   ==these things==
  • .sentences()
  • .paragraphs()
  • .links()
  • .tables()
  • .lists()
  • .images()
  • .templates()     -  {{these|things}}
  • .categories()
  • .citations()     -   <ref>these guys</ref>
  • .infoboxes()
  • .coordinates()

or output things in various formats:

outputs:

  • .json()   -     handy, workable data
  • .text()   -     reader-focused plaintext
  • .html()
  • .markdown()
  • .latex()   -     (ftw)
fancy-times:
  • .isRedirect()     -   boolean
  • .isDisambiguation()     -   boolean
  • .title()       -      guess the title of this page
  • .redirectsTo()     -   {page:'China', anchor:'#History'}

Examples

wtf(wikiText)

flip your wikimedia markup into a Document object

import wtf from 'wtf_wikipedia'
wtf(`==In Popular Culture==
* harry potter's wand
* the simpsons fence`);
// Document {text(), html(), lists()...}

wtf.fetch(title, [lang_or_wikiid], [options], [callback])

retrieves raw contents of a mediawiki article from the wikipedia action API.

This method supports the errback callback form, or returns a Promise if one is missing.

to call non-english wikipedia apis, add it's language-name as the second parameter

wtf.fetch('Toronto', 'de', function(err, doc) {
  doc.text();
  //Toronto ist mit 2,6 Millionen Einwohnern..
});

you may also pass the wikipedia page id as parameter instead of the page title:

wtf.fetch(64646, 'de').then(console.log).catch(console.log)

the fetch method follows redirects.

the optional-callback pattern is the same for wtf.random()

wtf.random(lang, options, callback) wtf.random(lang, options).then(doc=>doc.infobox())

wtf.category(title, [lang_or_wikiid], [options], [callback])

retrieves all pages and sub-categories belonging to a given category:

let result = await wtf.category('Category:Politicians_from_Paris');
//{
//  pages: [{title: 'Paul Bacon', pageid: 1266127 }, ...],
//  categories: [ {title: 'Category:Mayors of Paris' } ]
//}

//this format works too
wtf.category('National Basketball Association teams', 'en', (err, result)=>{
  //
});

doc.text()

returns only nice plain-text of the article

var wiki =
  "[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall.<ref>{{cite web|blah}}</ref>";
var text = wtf(wiki).text();
//"Boston's baseball field has a 37ft wall."

Section traversal:

wtf(page).sections(1).children()
wtf(page).sections('see also').remove()

Sentence data:

s = wtf(page).sentences(4)
s.links()
s.bolds()
s.italics()
s.dates() //structured date templates

Images

img = wtf(page).images(0)
img.url()     // the full-size wikimedia-hosted url
img.thumnail() // 300px, by default
img.format()  // jpg, png, ..
img.exists()  // HEAD req to see if the file is alive

CLI

if you're scripting this from the shell, or from another language, install with a -g, and then run:

$ wtf_wikipedia George Clooney --plaintext
# George Timothy Clooney (born May 6, 1961) is an American actor ...

$ wtf_wikipedia Toronto Blue Jays --json
# {text:[...], infobox:{}, categories:[...], images:[] }

Good practice:

The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -

  • pass a Api-User-Agent as something so they can use to easily throttle bad scripts
  • bundle multiple pages into one request as an array
  • run it serially, or at least, slowly.
wtf.fetch(['Royal Cinema', 'Aldous Huxley'], 'en', {
  'Api-User-Agent': '[email protected]'
}).then((docList) => {
  let allLinks = docList.map(doc => doc.links());
  console.log(allLinks);
});

Contributing

Join in! - projects like these are only done with many-hands, and we try to be friendly and easy. PRs always welcome.

Some Big Wins:

  1. Supporting more templates - This is actually kinda fun.
  2. Adding more tests - you won't believe how helpful this is.
  3. Make a cool thing. Holler it at spencer.

if it's a big change, make an issue and talk-it-over first.

Otherwise, go nuts!

See also:

Thank you to the cross-fetch library.

MIT

wtf_wikipedia's People

Contributors

spencermountain avatar ephdot avatar dmh-cs avatar pdziok avatar jan-jan avatar j-rausch avatar yuryshulaev avatar frankier avatar moustachedelait avatar guidorr avatar robertcuadra avatar natan avatar wvanderp avatar cldellow avatar d1231 avatar derhuerst avatar simra avatar vsaarinen avatar amantel avatar abhinavmadahar avatar coreyb42 avatar emad-elsaid avatar niebert avatar innovationpulse avatar jacopofar avatar mclaughlinj avatar sanpochew avatar hakatashi avatar jonbarrow avatar riccardobucco avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.