wikipedia markup parser

_{by
Spencer Kelly and

contributors}

wtf_wikipedia turns wikipedia's markup language into JSON,

so getting data from wikipedia is easier.

🏠 Try to have a good time. 🛀

^seriously,

this is among the most-curious data formats you can find.

^{(then we buried our human-record in it)}

Consider:

the egyptian hieroglyphics syntax
'Birth_date_and_age' vs 'Birth-date_and_age'.
the partial-implementation of inline-css,
deep recursion of similar-syntax templates,
the unexplained hashing scheme for image paths,
the custom encoding of whitespace and punctuation,
right-to-left values in left-to-right templates.
as of Nov-2018, there are 634,755 templates in wikipedia

wtf_wikipedia supports many recursive shenanigans, depreciated and obscure template variants, and illicit 'wiki-esque' shorthands.

It will try it's best, and fail in reasonable ways.

building your own parser is never a good idea

but this library aims to be a straight-forward way to get data out of wikipedia

_{... so don't be mad at me,

be mad at this.}

Demo • Tutorial • Api

well ok then,

npm install wtf_wikipedia

var wtf = require('wtf_wikipedia');

wtf.fetch('Whistling').then(doc => {

  doc.categories();
  //['Oral communication', 'Vocal music', 'Vocal skills']

  doc.sections('As communication').text();
  // 'A traditional whistled language named Silbo Gomero..'

  doc.images(0).thumb();
  // 'https://upload.wikimedia.org..../300px-Duveneck_Whistling_Boy.jpg'

  doc.sections('See Also').links().map(link => link.page)
  //['Slide whistle', 'Hand flute', 'Bird vocalization'...]
});

on the client-side:

<script src="https://unpkg.com/wtf_wikipedia"></script>
<script>
  //(follows redirect)
  wtf.fetch('On a Friday', 'en', function(err, doc) {
    var val = doc.infobox(0).get('current_members');
    val.links().map(link => link.page);
    //['Thom Yorke', 'Jonny Greenwood', 'Colin Greenwood'...]
  });
</script>

What it does:

Detects and parses redirects and disambiguation pages
Parse infoboxes into a formatted key-value object
Handles recursive templates and links- like [[.. [[...]] ]]
Per-sentence plaintext and link resolution
Parse and format internal links
creates image thumbnail urls from File:XYZ.png filenames
Properly resolve {{CURRENTMONTH}} and {{CONVERT ..}} type templates
Parse images, headings, and categories
converts 'DMS-formatted' (59°12'7.7"N) geo-coordinates to lat/lng
parses citation metadata
Eliminate xml, latex, css, and table-sorting cruft

But what about...

Parsoid:

Wikimedia's Parsoid javascript parser is the official wikiscript parser, and is pretty cool. It reliably turns wikiscript into HTML, but not valid XML.

To use it for data-mining, you'll need to:

parsoid(wikiText) -> [headless/pretend-DOM] -> screen-scraping

which is fine,

but getting structured data this way (say, sentences or infobox values), is still a complex + weird process. Arguably, you're not any closer than you were with wikitext. This library has lovingly ❤️ borrowed a lot of code and data from the parsoid project, and thanks its contributors.

Full data-dumps:

wtf_wikipedia was built to work with dumpster-dive, which lets you parse a whole wikipedia dump on a laptop in a couple hours. It's definitely the way to go, instead of fetching many pages off the api.

API

const wtf = require('wtf_wikipedia')
//parse a page
var doc = wtf(wikiText, [options])

//fetch & parse a page - wtf.fetch(title, [lang_or_wikiid], [options], [callback])
(async () => {
  var doc = await wtf.fetch('Toronto');
  console.log(doc.text())
})();

//(callback format works too)
wtf.fetch(64646, 'en', (err, doc) => {
  console.log(doc.categories());
});

//get a random german page
wtf.random('de').then(doc => {
  console.log(doc.text())
});

Full API

Main parts:

Document            - the whole thing
  - Category
  - Coordinate

  Section           - page headings ( ==these== )
    - Infobox       - a main, key-value template
    - Table         -
    - Reference     - citations, all-forms
    - Template      - any other structured-data

    Paragraph       - content separated by two newlines
      - Image       -
      - List        - a series of bullet-points

      Sentence      - contains links, formatting, dates

For the most-part, these classes do the looping-around for you, so that Document.links() will go through every section, paragraph, and sentence, to get their links.

Broadly speaking, you can ask for the data you'd like:

.sections() - ==these things==
.sentences()
.paragraphs()
.links()
.tables()
.lists()
.images()
.templates() - {{these|things}}
.categories()
.citations() - <ref>these guys</ref>
.infoboxes()
.coordinates()

or output things in various formats:

outputs:

.json() - handy, workable data
.text() - reader-focused plaintext
.html()
.markdown()
.latex() - (ftw)

fancy-times:

.isRedirect() - boolean
.isDisambiguation() - boolean
.title() - guess the title of this page
.redirectsTo() - {page:'China', anchor:'#History'}

Examples

wtf(wikiText)

flip your wikimedia markup into a Document object

import wtf from 'wtf_wikipedia'
wtf(`==In Popular Culture==
* harry potter's wand
* the simpsons fence`);
// Document {text(), html(), lists()...}

wtf.fetch(title, [lang_or_wikiid], [options], [callback])

retrieves raw contents of a mediawiki article from the wikipedia action API.

This method supports the errback callback form, or returns a Promise if one is missing.

to call non-english wikipedia apis, add it's language-name as the second parameter

wtf.fetch('Toronto', 'de', function(err, doc) {
  doc.text();
  //Toronto ist mit 2,6 Millionen Einwohnern..
});

you may also pass the wikipedia page id as parameter instead of the page title:

wtf.fetch(64646, 'de').then(console.log).catch(console.log)

the fetch method follows redirects.

the optional-callback pattern is the same for wtf.random()

wtf.random(lang, options, callback) wtf.random(lang, options).then(doc=>doc.infobox())

wtf.category(title, [lang_or_wikiid], [options], [callback])

retrieves all pages and sub-categories belonging to a given category:

let result = await wtf.category('Category:Politicians_from_Paris');
//{
//  pages: [{title: 'Paul Bacon', pageid: 1266127 }, ...],
//  categories: [ {title: 'Category:Mayors of Paris' } ]
//}

//this format works too
wtf.category('National Basketball Association teams', 'en', (err, result)=>{
  //
});

doc.text()

returns only nice plain-text of the article

var wiki =
  "[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall.<ref>{{cite web|blah}}</ref>";
var text = wtf(wiki).text();
//"Boston's baseball field has a 37ft wall."

Section traversal:

wtf(page).sections(1).children()
wtf(page).sections('see also').remove()

Sentence data:

s = wtf(page).sentences(4)
s.links()
s.bolds()
s.italics()
s.dates() //structured date templates

Images

img = wtf(page).images(0)
img.url()     // the full-size wikimedia-hosted url
img.thumnail() // 300px, by default
img.format()  // jpg, png, ..
img.exists()  // HEAD req to see if the file is alive

CLI

if you're scripting this from the shell, or from another language, install with a -g, and then run:

$ wtf_wikipedia George Clooney --plaintext
# George Timothy Clooney (born May 6, 1961) is an American actor ...

$ wtf_wikipedia Toronto Blue Jays --json
# {text:[...], infobox:{}, categories:[...], images:[] }

Good practice:

The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -

pass a Api-User-Agent as something so they can use to easily throttle bad scripts
bundle multiple pages into one request as an array
run it serially, or at least, slowly.

wtf.fetch(['Royal Cinema', 'Aldous Huxley'], 'en', {
  'Api-User-Agent': '[email protected]'
}).then((docList) => {
  let allLinks = docList.map(doc => doc.links());
  console.log(allLinks);
});

Contributing

Join in! - projects like these are only done with many-hands, and we try to be friendly and easy. PRs always welcome.

Some Big Wins:

Supporting more templates - This is actually kinda fun.
Adding more tests - you won't believe how helpful this is.
Make a cool thing. Holler it at spencer.

if it's a big change, make an issue and talk-it-over first.

Otherwise, go nuts!

haniagha1 / wtf_wikipedia Goto Github PK

wtf_wikipedia's Introduction