Giter Club home page Giter Club logo

wtf_wikipedia's Introduction

wtf_wikipedia
parse data from wikipedia
npm install wtf_wikipedia

it is very, very hard.         we're not joking.
why do we always do this?
we put our information where we can't take it out.

import wtf from 'wtf_wikipedia'

let doc = await wtf.fetch('Toronto Raptors')
let coach = doc.infobox().get('coach')
coach.text() //'Darko Rajaković'

.text()

get clean plaintext:

let str = `[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall. <ref>Field of our Fathers: By Richard Johnson</ref>`
wtf(str).text()
// "Boston's baseball field has a 37ft wall."
let doc = await wtf.fetch('Glastonbury', 'en')
doc.sentences()[0].text()
// 'Glastonbury is a town and civil parish in Somerset, England, situated at a dry point ...'

.json()

get all the data from a page:

let doc = await wtf.fetch('Whistling')

doc.json()
// { categories: ['Oral communication', 'Vocal skills'], sections: [{ title: 'Techniques' }], ...}

the default .json() output is really verbose, but you can cherry-pick data by poking-around like this:

// get just the links:
doc.links().map((link) => link.json())
//[{ page: 'Theatrical superstitions', text: 'supersitions' }]

// just the images:
doc.images()[0].json()
// { file: 'Image:Duveneck Whistling Boy.jpg', url: 'https://commons.wiki...' }

// json for a particular section:
doc.section('see also').links()[0].json()
// { page: 'Slide Whistle' }

run it on the client-side:

<script src="https://unpkg.com/wtf_wikipedia"></script>
<script>
  wtf.fetch('Radiohead', { 'Api-User-Agent': 'Name your script here' }, function (err, doc) {
    let members = doc.infobox().get('current members')
    members.links().map((l) => l.page())
    //['Thom Yorke', 'Jonny Greenwood', 'Colin Greenwood'...]
  })
</script>

or the server-side:

import wtf from 'wtf_wikipedia'
// or,
const wtf = require('wtf_wikipedia')

full wikipedia dumps

With this library, in conjunction with dumpster-dive, you can parse the whole english wikipedia in an aftertoon.

npm install -g dumpster-dive

Ok first, 🛀

Wikitext is no small thing.

Consider:

this library supports many recursive shenanigans, depreciated and obscure template variants, and illicit wiki-shorthands.

What it does:

  • Detects and parses redirects and disambiguation pages
  • Parse infoboxes into a formatted key-value object
  • Handles recursive templates and links- like [[.. [[...]] ]]
  • Per-sentence plaintext and link resolution
  • Parse and format internal links
  • creates image thumbnail urls from File:XYZ.png filenames
  • Properly resolve dynamic templates like {{CURRENTMONTH}} and {{CONVERT ..}}
  • Parse images, headings, and categories
  • converts 'DMS-formatted' (59°12'7.7"N) geo-coordinates to lat/lng
  • parse and combine citation and reference metadata
  • Eliminate xml, latex, css, and table-sorting cruft

What doesn't do:

  • external 'transcluded' page data [1]
  • AST output
  • smart (or 'pretty') formatting of html in infoboxes or galleries [1]
  • maintain perfect page order [1]
  • per-sentence references (by 'section' element instead)
  • maintain template or infobox css styling
  • large tables that span different sections [1]

It is built to be as flexible as possible. In all cases, tries to fail in considerate ways.

How about html scraping..?

Wikimedia's official parser turns wikitext ➔ HTML.

if you prefer this screen-scraping workflow, you can pluck at parts of a page like that.

that's cool!

getting structured data this way is still a complex, weird process. Manually spelunking the html is sometimes just as tricky and error-prone as scanning the wikitext itself.

The contributors to this library have come to that conclusion, as many others have.

This library is gracious to the Parsoid contributors.

okay,

flip your wikitext into a Doc object

import wtf from 'wtf_wikipedia'

let txt = `
==Wood in Popular Culture==
* Harry Potter's wand
* The Simpson's fence
`
wtf(txt)
// Document {text(), json(), lists()...}

doc.links()

let txt = `Whistling is featured in a number of television shows, such as [[Lassie (1954 TV series)|''Lassie'']], and the title theme for ''[[The X-Files]]''.`
wtf(txt)
  .links()
  .map((l) => l.page())
// [ 'Lassie (1954 TV series)',  'The X-Files' ]

doc.text()

returns nice plain-text of the article

let txt =
  "[[Greater_Boston|Boston]]'s [[Fenway_Park|baseball field]] has a {{convert|37|ft}} wall.<ref>{{cite web|blah}}</ref>"
wtf(txt).text()
//"Boston's baseball field has a 37ft wall."

doc.sections():

a section is a heading '==Like This=='

wtf(page).sections()[1].children() //traverse nested sections
wtf(page).section('see also').remove() //delete one

doc.sentences()

let s = wtf(page).sentences()[4]
s.links()
s.bolds()
s.italics()
s.text()
s.wikitext()

doc.categories()

await wtf.fetch('Whistling').categories()
//['Oral communication', 'Vocal music', 'Vocal skills']

doc.images()

let img = wtf(page).images()[0]
img.url() // the full-size wikimedia-hosted url
img.thumbnail() // 300px, by default
img.format() // jpg, png, ..

Fetch

You can grab and parse articles from any wiki api. This includes any language, any wiki-project, and most 3rd-party wikis.

// 3rd-party wiki
let doc = await wtf.fetch('https://muppet.fandom.com/wiki/Miss_Piggy')

// wikipedia français
doc = await wtf.fetch('Tony Hawk', 'fr')
doc.sentence().text() // 'Tony Hplawk est un skateboarder professionnel et un acteur ...'

// accept an array, or wikimedia pageIDs
let docs = wtf.fetch(['Whistling', 2983], { follow_redirects: false })

// article from german wikivoyage
wtf.fetch('Toronto', { lang: 'de', wiki: 'wikivoyage' }).then((doc) => {
  console.log(doc.sentences()[0].text()) // 'Toronto ist die Hauptstadt der Provinz Ontario'
})

you may also pass the wikipedia page id as parameter instead of the page title:

let doc = await wtf.fetch(64646, 'de')

the fetch method follows redirects.

API plugin

wtf.getCategoryPages(title, [options])

retrieves all pages and sub-categories belonging to a given category:

wtf.extend(require('wtf-plugin-api'))
let result = await wtf.getCategoryPages('Category:Politicians_from_Paris')
/*
{
  [
    {"pageid":52502362,"ns":0,"title":"William Abitbol"},
    {"pageid":50101413,"ns":0,"title":"Marie-Joseph Charles des Acres de L'Aigle"}
    ...
    {"pageid":62721979,"ns":14,"title":"Category:Councillors of Paris"},
    {"pageid":856891,"ns":14,"title":"Category:Mayors of Paris"}
  ]
}
*/

wtf.random([options])

fetches a random wikipedia article, from a given language or domain

wtf.extend(require('wtf-plugin-api'))
wtf.random().then((doc) => {
  console.log(doc.title(), doc.categories())
  //'Whistling'  ['Oral communication', 'Vocal skills']
})

see wtf-plugin-api

Tutorials

Plugins

these add all sorts of new functionality:

wtf.extend(require('wtf-plugin-classify'))
await wtf.fetch('Toronto Raptors').classify()
// 'Organization/SportsTeam'

wtf.extend(require('wtf-plugin-summary'))
await wtf.fetch('Pulp Fiction').summary()
// 'a 1994 American crime film'

wtf.extend(require('wtf-plugin-person'))
await wtf.fetch('David Bowie').birthDate()
// {year:1947, date:8, month:1}

wtf.extend(require('wtf-plugin-i18n'))
await wtf.fetch('Ziggy Stardust', 'fr').infobox().json()
// {nom:{text:"Ziggy Stardust"}, oeuvre:{text:"The Rise and Fall of Ziggy Stardust"}}
Plugin
classify person/place/thing
summary short description text
person birth/death information
api fetch more data from the API
i18n improves multilingual template coverage
wtf-mlb fetch baseball data
wtf-nhl fetch hockey data
nsfw flag sexual/graphic/adult articles
image additional methods for .images()
html output html
wikitext output wikitext
markdown output markdown
latex output latex

Good practice:

The wikipedia api is pretty welcoming though recommends three things, if you're going to hit it heavily -

  • pass a Api-User-Agent as something so they can use to easily throttle bad scripts
  • bundle multiple pages into one request as an array (say, groups of 5?)
  • run it serially, or at least, slowly.
wtf
  .fetch(['Royal Cinema', 'Aldous Huxley'], {
    lang: 'en',
    'Api-User-Agent': '[email protected]',
  })
  .then((docList) => {
    let links = docList.map((doc) => doc.links())
    console.log(links)
  })

Full API

  • .title() - get/set the title of the page from the first-sentence
  • .pageID() - get/set the wikimedia id of the page, if we have it.
  • .wikidata() - get/set the wikidata id of the page, if we have it.
  • .domain() - get/set the domain of the wiki we're on, if we have it.
  • .url() - (try to) generate the url for the current article
  • .lang() - get/set the current language (used for url method)
  • .namespace() - get/set the wikimedia namespace of the page, if we have it
  • .isRedirect() - if the page is just a redirect to another page
  • .redirectTo() - the page this redirects to
  • .isDisambiguation() - is this a placeholder page to direct you to one-of-many possible pages
  • .isStub() - if the page is flagged as incomplete
  • .categories() - return all categories of the document
  • .sections() - return a list of the Document's sections
  • .paragraphs() - return a list of Paragraphs, in all sections
  • .sentences() - return a list of all sentences in the document
  • .images() - return all images found in the document
  • .links() - return a list of all links, in all parts of the document
  • .lists() - sections in a page where each line begins with a bullet point
  • .tables() - return a list of all structured tables in the document
  • .templates() - any type of structured-data elements, typically wrapped in like {{this}}
  • .infoboxes() - specific type of template, that appear on the top-right of the page
  • .references() - return a list of 'citations' in the document
  • .coordinates() - geo-locations that appear on the page
  • .text() - plaintext, human-readable output for the page
  • .json() - a 'stringifyable' output of the page's main data
  • .wikitext() - original wiki markup
  • .description() - get/set the page's short description, if we have one.
  • .pageImage() - get/set the page's representative image, if we have one.
  • .revisionID() - get/set the latest edit id of the page, if we have it.
  • .timestamp() - get/set the time of the most recent edit of the page, if we have it.

Section

  • .title() - the name of the section, between ==these tags==
  • .index() - which number section is this, in the whole document.
  • .indentation() - how many steps deep into the table of contents it is
  • .sentences() - return a list of sentences in this section
  • .paragraphs() - return a list of paragraphs in this section
  • .links() - list of all links, in all paragraphs and templates
  • .tables() - list of all html tables
  • .templates() - list of all templates in this section
  • .infoboxes() - list of all infoboxes found in this section
  • .coordinates() - list of all coordinate templates found in this section
  • .lists() - list of all lists in this section
  • .interwiki() - any links to other language wikis
  • .images() - return a list of any images in this section
  • .references() - return a list of 'citations' in this section
  • .remove() - remove the current section from the document
  • .nextSibling() - a section following this one, under the current parent: eg. 1920s → 1930s
  • .lastSibling() - a section before this one, under the current parent: eg. 1930s → 1920s
  • .children() - any sections more specific than this one: eg. History → [PreHistory, 1920s, 1930s]
  • .parent() - the section, broader than this one: eg. 1920s → History
  • .text() - readable plaintext for this section
  • .json() - return all section data
  • .wikitext() - original wiki markup

Paragraph

  • .sentences() - return a list of sentence objects in this paragraph
  • .references() - any citations, or references in all sentences
  • .lists() - any lists found in this paragraph
  • .images() - any images found in this paragraph
  • .links() - list of all links in all sentences
  • .interwiki() - any links to other language wikis
  • .text() - generate readable plaintext for this paragraph
  • .json() - generate some generic data for this paragraph in JSON format
  • .wikitext() - original wiki markup

Sentence

  • .links() - list of all links
  • .bolds() - list of all bold texts
  • .italics() - list of all italic formatted text
  • .text() - generate readable plaintext
  • .json() - return all sentence data
  • .wikitext() - original wiki markup

Image

  • .url() - return url to full size image
  • .thumbnail() - return url to thumbnail (pass size to customize)
  • .links() - any links from the caption (if present)
  • .format() - get file format (e.g. jpg)
  • .text() - does nothing
  • .json() - return some generic metadata for this image
  • .wikitext() - original wiki markup

Template

  • .text() - does this template generate any readable plaintext?
  • .json() - get all the data for this template
  • .wikitext() - original wiki markup

Infobox

  • .links() - any internal or external links in this infobox
  • .keyValue() - generate simple key:value strings from this infobox
  • .image() - grab the main image from this infobox
  • .get() - lookup properties from their key
  • .template() - which infobox, eg 'Infobox Person'
  • .text() - generate readable plaintext for this infobox
  • .json() - generate some generic 'stringifyable' data for this infobox
  • .wikitext() - original wiki markup

List

  • .lines() - get an array of each member of the list
  • .links() - get all links mentioned in this list
  • .text() - generate readable plaintext for this list
  • .json() - generate some generic easily-parsable data for this list
  • .wikitext() - original wiki markup

Reference

  • .title() - generate human-facing text for this reference
  • .links() - get any links mentioned in this reference
  • .text() - returns nothing
  • .json() - generate some generic metadata data for this reference
  • .wikitext() - original wiki markup

Table

  • .links() - get any links mentioned in this table
  • .keyValue() - generate a simple list of key:value objects for this table
  • .text() - returns nothing
  • .json() - generate some useful metadata data for this table
  • .wikitext() - original wiki markup

Configuration

Adding new methods:

you can add new methods to any class of the library, with wtf.extend()

wtf.extend((models) => {
  // throw this method in there...
  models.Doc.prototype.isPerson = function () {
    return this.categories().find((cat) => cat.match(/people/))
  }
})

await wtf.fetch('Stephen Harper').isPerson()

Adding new templates:

does your wiki use a {{foo}} template? Add a custom parser for it:

wtf.extend((models, templates) => {
  // create a custom parser function
  templates.foo = (tmpl, list, parse) => {
    let obj = parse(tmpl) //or do a custom regex
    list.push(obj)
    return 'new-text'
  }

  // array-syntax allows easy-labeling of parameters
  templates.foo = ['a', 'b', 'c']

  // number-syntax for returning by param # '{{name|zero|one|two}}'
  templates.baz = 0

  // replace the template with a string '{{asterisk}}' -> '*'
  templates.asterisk = '*'
})

by default, if there's no parser for a template, it will be just ignored and generate an empty string. However, it's possible to configure a fallback parser function to handle these templates:

wtf('some {{weird_template}} here', {
  templateFallbackFn: (tmpl, list, parse) => {
    let obj = parse(tmpl) //or do a custom regex
    list.push(obj)
    return '[unsupported template]' // or return null to ignore this template
  },
})

you can determine which templates are understood to be 'infoboxes' with the 3rd parameter:

wtf.extend((models, templates, infoboxes) => {
  Object.assign(infoboxes, { person: true, place: true, thing: true })
})

Notes:

3rd-party wikis

by default, a public API is provided by a installed mediawiki application. This means that most wikis have an open api, even if they don't realize it. Some wikis may turn this feature off.

It can usually be found by visiting http://mywiki.com/api.php

to fetch pages from a 3rd-party wiki:

wtf.fetch('Kermit', { domain: 'muppet.fandom.com' }).then((doc) => {
  console.log(doc.text())
})

some wikis will change the path of their API, from ./api.php to elsewhere. If your api has a different path, you can set it like so:

wtf.fetch('2016-06-04_-_J.Fernandes_@_FIL,_Lisbon', { domain: 'www.mixesdb.com', path: 'db/api.php' }).then((doc) => {
  console.log(doc.template('player').json())
})

for image-urls to work properly, the wiki should also have Special:Redirect enabled. Some wikis, (like wikia) have intentionally disabled this.

i18n and multi-language:

wikitext is (amazingly) used across all languages, wikis, and even in right-to-left languages. This parser actually does an okay job at it too.

Wikipedia I18n langauge information for Redirects, Infoboxes, Categories, and Images are included in the library, with pretty-decent coverage.

To improve coverage of i18n templates, use wtf-plugin-i18n

Please make a PR if you see something missing for your language.

Builds:

this library ships seperate client-side and server-side builds, to preserve filesize.

the browser version uses fetch() and the server version uses require('https').

Performance:

It is not the fastest parser, and is very unlikely to beat a single-pass parser in C or Java.

Using dumpster-dive, this library can parse a full english wikipedia in around 4 hours on a macbook.

That's about 100 pages/second, per thread.

See also:

Other alternative javascript parsers:

and many more!

MIT

wtf_wikipedia's People

Contributors

amantel avatar cldellow avatar d1231 avatar dagingaa avatar derhuerst avatar dmh-cs avatar ephdot avatar frankier avatar guidorr avatar j-rausch avatar jan-jan avatar joliss avatar marketingpip avatar mikeda37 avatar moustachedelait avatar mx781 avatar natan avatar pdziok avatar ramusus avatar rg3h avatar robertcuadra avatar spencermountain avatar suntala avatar tomasz13nocon avatar tstibbs avatar waldenn avatar wvanderp avatar yanosh-igor avatar yash-singh1 avatar yuryshulaev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wtf_wikipedia's Issues

Troubles with fr.wikipedia.org

Hi,

I'm having troubles with the french wiki which returns me always {} with your example.

wtf_wikipedia.from_api("Toronto", "en", function(markup){
var obj= wtf_wikipedia.parse(markup);
console.log(obj.infobox);
})
-> works perfectly

wtf_wikipedia.from_api("Toronto", "fr", function(markup){
var obj= wtf_wikipedia.parse(markup);
console.log(obj.infobox);
})
-> return me {}

Any ideas ? Thanks for you work btw !

Too many apostrophes removed along with formatting.

Thanks for the library, it's very useful!

When I run it on the Italian wikipedia, I find text like

L''''armonium''' o '''armonio''' (in francese, ''harmonium'')

where the first apostrophe is part of the text and the other three are formatting.

The parser have to check that the apostrophes are balanced, the change in the pull request does that.

wiktionary

This is simply a question, before doing my own tests: is this library supposed to also parse the wiktionary entries?
I didn't find any confirmation that the format is the same, besides the general messy appearance.

wtf_wikipedia removes () that are usefull

Hi, I'm back with an other issue :D

I use your API with this page : https://fr.wikipedia.org/wiki/J%C3%A9r%C3%B4me_Cahuzac

but the infobox.image.text is wrong after the parsing. It is { text: 'Jérôme Cahuzac cropped.jpg', links: undefined } whereas it should be Jérôme Cahuzac (2012) cropped.jpg according to the markup.

After some investigations, I think that the reason is this line line = line.replace(/\([^a-z]{0,8}\)/, ""); (/src/parse/parse_line.js:38) but I don't know if this line is important (can I remove it ?).

Thanks a lot and have a nice day !

reformat sections as ordered array

right now article headings are being organized as an object, but
it appears we're loosing their ordering this way.
Given this is important, it probably oughta be rendered as an array.
¯_(ツ)_/¯

extracting Infobox data from Turkish wikipages

Hi,

I'm trying to extracting information by using your script; however, I couldn't be able to do that. You can see the rawdata of Toronto written in Turkish below.

http://en.wikipedia.org/w/index.php?action=raw&title=toronto
http://tr.wikipedia.org/w/index.php?action=raw&title=toronto

I'm trying to list matching definitions:
disambig words: "anlam ayrımı"
infobox: "bilgi kutusu" (in Toronto case, "yerleşim bilgi kutusu" which means "Infobox settlement" )
category:"kategori"
redirect: "yönlendirme"

No success! Could you please help?

Thanks..

Intro section ignored in RU articles

Hi, are there known problems with non-latin characters? When parsing the "extract" field of this API result, the first section (depth 0) has an empty array of sentences.

Interestingly, if I prepend a non-cyrillic word with a space, like "test ", the first paragraph is correctly parsed, but all others are still missing. Without a fix, I suppose I'll have to prepend a 'latin' word to every paragraph and later remove it, but perhaps I'm also doing something else wrong...

support es6 package.json entries

I'm using your library with angular-cli generated project. It works flawlessly locally in dev mode, but when trying to generate production build i get this:

ERROR in vendor.167aea1ea9373484afbe.bundle.js from UglifyJs
Unexpected token: name (url) [vendor.167aea1ea9373484afbe.bundle.js:15769,6]

and it starts to work again after removing even the most basic form of import, i.e. import 'wtf_wikipedia';

If you want to reproduce the problem, here's the code: https://bitbucket.org/spartanPAGE/notes-assistant/src/e1705786b5452ec25609c35846fa2ed543b5c15e

upload to npm?

your version is 3.1.0 currently. latest on npm 2.0. can you bump it please?

Thanks

Infobox parsing failure

Hi,

I using the library and I found issue with the infobox.

There are some pages where not all infobox data is parsed, for example the following page: https://en.wikipedia.org/wiki/Boudewijn_Zenden
Many keys are missing for example: club1, birthday

I managed to find the reason for the issue, for some keys (detonated by '|') there is no leading \n and some values are contains '|' or '=' and so the library don't parse it.

I fixed it, using regex and iterating over all possible keys. If you want I could make a pull request or give you my code.

Thanks

Cross Compile with wtf_wikipedia.js to other Document formats

Parsing of Wiki markdown generates a syntax tree. Is there a recommended way to create a output format other than plain text. Want to use document conversion via wtf_wikipedia.js with generated syntax tree e.g. to create LaTeX and other output formats, similar to PanDoc https://pandoc.org/try/ ? e.g. convert "===My Header===" into LaTeX syntax "\subsection{My Header}" . Thank you very much for developing wtf_wikipedia.js and sharing the code.

.plaintext() cuts text

Hi, straight to the point: .plaintext("dom, kon. XIX w.") returns "dom, kon." for me.

Problem parsing sortable lists

thanks for this awesome parser. helped me a lot!

i ran into a problem parsing sortable lists:
eg.
https://en.wikipedia.org/wiki/List_of_largest_cities
i can access the "meta information" but not the content itself.

wtf.from_api("List_of_largest_cities", "en", function(markup) { var data = wtf.parse(markup); console.log(data.sections[5].lists); });

not shure wether to file this as an issue or a feature request.
cheers and thanks for all your work

<onlyinclude> is not parsed away

A part of this article is enclosed in <onlyinclude>...</onlyinclude>. But the parse function keeps these tags as such in the sentences.

I should probably add that, prior to parsing, I remove the disambiguation tag ({{Begriffsklärung}}) because I want the options' descriptions, not just a list of links. When I try using the disambiguation parsing, those links are missing the first one, perhaps because of the <onlyinclude> that precedes the first "*".

regex overcapturing content

The following regex in word_tempates() seems to be capturing more than it should be:
wiki = wiki.replace(/\{\{convert\|([0-9]*?)\|([^\|]*).*?\}\}/gi, "$1 $2");

When it runs on the infobox at the end of this, it gets to the line that starts with the "campus" attribute, where it grabs the {{convert...}} section at the end of that line, but it also grabs the entire next line. The end result is this snippet:

| campus = Urban (small city);<br/>1970 acre}}

| athletics = [[NCAA Division I]] – [[Southeastern Conference|SEC]]

Later on in processing this causes you to skip the remainder of the infobox after "campus" because of the blank line. It looks like you should have a '?' inside the second capture group that would make it ([^\|]*?) and then maybe remove the '.*?' after it unless you were trying to exclude any extra characters after a second '|' until the '}}',

{{Infobox university
| name = The University of Alabama
| image_name = BamaSeal.png
| image_size = 150px
| established = 1831
| type = [[Flagship university|Flagship]]<br />[[State university system|Public university]]<br />[[Sea-grant]]<br />[[Space-grant]]
| endowment = $667,980,131<ref name="ReferenceA">http://colleges.usnews.rankingsandreviews.com/best-colleges/university-of-alabama-1051</ref><ref name="colleges.usnews.rankingsandreviews.com">{{cite web|url=http://colleges.usnews.rankingsandreviews.com/best-colleges/university-of-alabama-1051 |title=University of Alabama|work=rankingsandreviews.com}}</ref>
| president = [[Stuart R. Bell]]
| faculty = 1,175
| students = 37,098 (Fall 2015)<ref name=CommonDataSet>{{cite web|url=http://oira.ua.edu/d/content/reports/common-data-set|title=Common Data Set - OIRA|work=ua.edu}}</ref>
| postgrad = 5,140 (Fall 2015)<ref name=CommonDataSet/>
| undergrad = 31,958 (Fall 2015)<ref name=CommonDataSet/>
| city = [[Tuscaloosa, Alabama|Tuscaloosa]]
| state = [[Alabama]]
| country = U.S.
| campus = Urban (small city);<br/>{{convert|1970|acre}}
| coor = {{coord|33.209438|N|87.541493|W|source:dewiki_region:US-AL_type:landmark|display=inline,title}}
| athletics = [[NCAA Division I]] – [[Southeastern Conference|SEC]]
|free_label = Sports Motto
| free = [[Roll Tide]]
| colors = Crimson & White<ref>{{cite web|url=http://visualid.ua.edu/download/UA-BrandingStandards-Aug172015.pdf|title=The University of Alabama Branding Standards 2015–2016 |work=ua.edu}}</ref><br />{{color box|#9E1B32}}&nbsp;{{color box|#FFFFFF}}
| nickname = [[Alabama Crimson Tide]]
| mascot = [[Big Al (mascot)|Big Al]]
| affiliations = {{unbulleted list|[[University of Alabama System]]|[[Oak Ridge Associated Universities|ORAU]]|[[Universities Research Association|URA]]|[[Association of Public and Land-Grant Universities|APLU]]}}
| website = {{url|www.ua.edu}}
| logo = [[File:University of Alabama (logo).png|250px]]
}}

Error: connect ECONNREFUSED 127.0.0.1:80

Hi, I tried to use wtf_wikipedia but me and another person have this issue

PS C:\Users\Jormungand\Workspace\unicorngame> wikipedia Toronto Blue Jays
{ Error: connect ECONNREFUSED 127.0.0.1:80
    at Object.exports._errnoException (util.js:1012:11)
    at exports._exceptionWithHostPort (util.js:1035:20)
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1080:14)
  code: 'ECONNREFUSED',
  errno: 'ECONNREFUSED',
  syscall: 'connect',
  address: '127.0.0.1',
  port: 80,
  response: undefined }
{
  "type": "page",
  "text": {},
  "categories": [],
  "images": [],
  "infobox": {},
  "infobox_template": "",
  "tables": [],
  "translations": {}
}

Also, I have
npm 4.0.1
node 6.3.1
Test with window an Mac OS
If anyone have a small idea, it would be of great help to me.

Promises in fetch

A month ago I started creating my own wiki/parser but as you know it's a big pain in the ass. So I decided colaborate with another library, yesterday I found this one and I think is a great start. My idea is helping you with this, but I think we have different code-styles, and before start send you PRs I wanna tell you my thoughts:

  • Why we need more than one method? Now we have plaintext, from_api and parse. IMHO this library must be a unique stateless method:
import wikipediaWTF from `wikipedia-wtf`;

wikipedia(sentence {string}, options {language...})
  • Why we dont use Promises like?Better if we use something like:
wikipedia.from(sentence, options)
  .then(value => ...)
  .catch(error => ...)
  • If it's a promise we can use the new standard async/await:
const knowledge = await wikipedia(sentence, options);
  • IMHO is better approach auto-parse the markup to a JavaScript Object
  • Now we have infobox and infobox_template, why we dont create a unique attribute?**
  • And the last one (for now), why we use grunt?? Maybe we have to move to babel + mocha via NPM tasks (zero-dependencies)

I think this library is a great start, but we have to create an consistent DSL for a better experience using it.

best,

javi

{{coords}} template

There is <text xml:space="preserve">{{pagebanner|Moscow Banner.jpg|disambig=yes}}" in XML, but I can not find it anywhere after running the script

Query in another language

Hello.

Can I get the results in a different language?

Great work, actually the best wikipedia parse written in js.

Best regards.

'Chronological' output within a section

The parse method nicely yields arrays of sentences and links, but I see no way of knowing which lists were at which position. Is this loss of information deliberate, am I missing something, or was this just never an issue for anybody? A simple solution would seem to allow disabling the parsing of lists to somehow include them in the sentences.

doesn't seem to work from the command line

i installed this globally and tried using it in the console, and got an error where fetch_text.js was failing because it hadn't been called with a callback function.

support Links with #Hash anchors

There is a Wikipedia-internal link with an anchor link in the first paragraph of the DE article for "Köln", like [[a#b|c]] (see the API output here and search for "Gemeinden"). Contrary to expectation, this does not get properly parsed as "c" but is essentially returned as it is.

Support advanced {{convert}} syntax

Running the following script:

var wtf_wikipedia = require('wtf_wikipedia');
wtf_wikipedia.from_api("Paris", function(page) {
    var parsed = wtf_wikipedia.parse(page); // causes the crash
    console.log(parsed);
});

Results in the process taking up 1 GB of RAM, then quickly crashing with the message
FATAL ERROR: CALL_AND_RETRY_2 Allocation failed - process out of memory Aborted (core dumped)

Running Ubuntu 14.04

Paragraph Support

Thanks for the lib, I'm finding it very useful.

I want to preserve the newlines ("\n") in the text, so that I can convert content to markdown, and paragraphs are shown correctly.

If you are willing to do that it would be great, otherwise could you please direct to where I can affect that.

option to keep some formatting

preProcess function remove bold/italic etc. I propose to add some parameter which can be used to keep them.
I think viable parameter can be:
"keep italic/bold"

Hard error on some queries

Given a query of 'test' or 'blah', the following error with stack gets thrown:

TypeError: data.text.forEach is not a function
    at Object.plaintext (.../node_modules/wtf_wikipedia/src/index.js:198:15)
    at wtf.from_api (.../lib/commands/wiki.js:24:22)
    at Request._callback (.../node_modules/wtf_wikipedia/src/lib/fetch_text.js:28:7)
    at Request.self.callback (.../node_modules/request/request.js:186:22)
    at emitTwo (events.js:106:13)
    at Request.emit (events.js:191:7)
    at Request.<anonymous> (.../node_modules/request/request.js:1060:10)
    at emitOne (events.js:96:13)
    at Request.emit (events.js:188:7)
    at IncomingMessage.<anonymous> (.../node_modules/request/request.js:980:12)

Rolling back to 0.5.0 resolved this issue on the same queries so I'm assuming the issue came about in the latest update.

Parse each table entry into object

Hey there! Right now each table cell gets parsed into it's own key in the Map that gets returned, would be great to get the entire entry as an object and grab the columns from there.

Question: How performant / fast is the parser?

Hello,

I want to build something like a WikiGame, but with custom styles for the article. Do you think this is a good use case for the library or should I just fetch the rendered html from the API and style that myself?
Is this library fast enough to parse the html from a browser of a huge Wikipedia page or is it rather inefficient?

Thanks,
Marc

Removing valid listings during parse.infobox stage

Example setup
Information from https://en.wikivoyage.org/w/index.php?title=Barcelona&action=edit&section=5

===Visitor information===
* {{listing
| name=Tourist office at Plaça de Catalunya | alt= | url=http://www.barcelonaturisme.com/wv3/en/page/38/tourist-information-points.html | email=
| address=Plaça de Catalunya, 17-S | lat=41.3868027 | long=2.1707225 | directions=Metro: L1, L3. Bus: 9, 22, 28, 42, 47, 58, 66, 67, 68. Train: R4
| phone= | tollfree= | fax=
| hours=8:30am-8:30pm | price=
| lastedit=2015-10-22
| content=This is the main tourist office in the city.
}}

The other tourist offices can be found at Plaça de Sant Jaume, Ciutat, 2 Ajuntament de Barcelona. (City Hall.) Opening time: Monday to Friday: 8.30am-8.30pm. Saturday: 9am-7pm. Sunday and public holidays: 9am-2pm.; Estació de Sants, Plaça dels Països Catalans. How to get there: Metro: L5,L3. Bus: 63,68. Opening time:  daily, 8am-8pm. and Aeroport del Prat. Terminal 1 and 2. Opening time: Daily, 9am-9pm.  All are closed on 1st January and 25th December. For a full list of tourist information points check the link above.

The department store El Corte Ingles publishes a free street map for tourists. You can pick a copy at the store, or at one of the many hotels in the city.
Turisme de Barcelona

Converts to http://i.imgur.com/6qB3Vz2.png

All information from this {{listing}} is nowhere to be found after "infobox stage". I guess the part where we remove templates.

Normal * lists are ok - we save them, to list array of section.

In short:
We lose a lot of information about nice objects from {{}} templates - "do", "see" etc.

External link text

Hello!

While parsing sentence we keep nice array of links (both external and internal), but sometimes there is relatively hard to map them to actual sentence.

Example:
First section of https://en.wikivoyage.org/w/index.php?title=Barcelona/Eixample&action=edit
We have

The '''[http://w110.bcn.cat/portal/site/Eixample Eixample]''' is the quarter designed during the middle of the 19th century by Ildefons Cerdà, expanding the medieval city of [[Barcelona]] into space left empty for defense outside the city walls.

Which translates to:

text:"The ''' Eixample''' is the quarter designed during the middle of the 19th century by Ildefons Cerdà, expanding the medieval city of Barcelona into space left empty for defense outside the city walls."
links: Array(2) [Object, Object]
0:Object {type: "external", site: "http://w110.bcn.cat/portal/site/Eixample", text: ""}
1:Object {page: "Barcelona", text: "Barcelona"}

When I will try to print this sentence should I:

  1. Use some regexp to find first (second/third) of ''' ''' templates an insert first link there, if type is external?
  2. Find regexp of text if type is page (or other) and insert link to wikipage with text (Barcelona)?

Birth date and date of death disappeared

Hi,

i use the plaintext method to retrieve the text of the article about people.
When this text contains the birth date and/or the date of death, these dates are missing.

For example in the article :

Raoul Dautry est un ingénieur, dirigeant d'entreprises publiques et homme politique français, né le 16 septembre 1880 à Montluçon (Allier) et décédé le 21 août 1951 à Lourmarin (Vaucluse).

And the query result is :

Raoul Dautry est un ingénieur, dirigeant d'entreprises publiques et homme politique français, né le à Montluçon (Allier) et décédé le à Lourmarin (Vaucluse).

The two dates are missing.

Is it an error or I need to do something to have the dates ?

Thanks.

strange error related to keys on node 8

let arr = Object.keys(data.sections).map(k => {
^

TypeError: Cannot convert undefined or null to object
at Function.keys ()
at Object.plaintext (/home/saurabh/node_modules/wtf_wikipedia/src/index.js:25:20)
at onFileContent (/home/saurabh/Desktop/comprehension_burden/wikipedia_w2v/wiki_xml_to_jsonv2.js:10:22)
at /home/saurabh/Desktop/comprehension_burden/wikipedia_w2v/wiki_xml_to_jsonv2.js:45:9
at tryToString (fs.js:513:3)
at FSReqWrap.readFileAfterClose [as oncomplete] (fs.js:501:12)

Update : Here is it's stackoverflow question I posted along with code
https://stackoverflow.com/questions/45839226/easiest-way-to-convert-wikitext-to-plaintext

Table parsing

Great job, It works well. But one thing i noticed is its ignoring tables other than an infobox.

example : http://en.wikipedia.org/wiki/List_of_British_films_of_2014

{
  "type": "page",
  "text": {
    "Intro": [
      {
        "text": "A list of British films scheduled for release in 2014",
        "links": [
          {
            "page": "2014 in film",
            "src": "2014"
          }
        ]
      }
    ]
  },
  "categories": [
    "Lists of British films by year|2014",
    "2014 in British cinema|Films",
    "Lists of 2014 films by country or language"
  ],
  "images": [],
  "infobox": {}
}

Also, if the table has rowspan or a colspan too, it simply ignores that data.

interlanguage links interfering with cookbook wikibook text

links that look like [[Cookbook:Casserole|casserole]] are identified as interlanguage links and removed which also removes the word casserole from the text.

added:
const languages = require('../../../data/languages');
at the top and modified line 24 (now 25) of src/parse/section/image/index.js to:
if (site && i18n.dictionary[site] === undefined && languages[site] != undefined) {
so that languages are checked.

Might be a better way, thought I'd flag it.

Fails to correctly parse a Wikibooks page

  • Library version: 0.5.1g
  • Node.js version: v6.4.0

I encountered two problems while attempting to parse the markup on this page:

  • The parser stops parsing after the gdbserver section.
  • Code blocks are not indicated and such, and are just seen as regular text.

The original input:

When QEMU is running, it provides a monitor console for interacting with QEMU. Through various commands, the monitor allows you to inspect the running guest OS, change removable media and USB devices, take screenshots and audio grabs, and control various aspects of the virtual machine.

The monitor is accessed from within QEMU by holding down the Ctrl and Alt keys (or whatever the "mouse grab" keystrokes are), and pressing <code>Shift-2</code>. Once in the monitor, <code>Shift-1</code> switches back to the guest OS. Typing <code>help</code> or <code>?</code> in the monitor brings up a list of all commands.
Alternatively the monitor can be redirected to using the <code>-monitor <dev></code> command line option.
Using <code>-monitor stdio</code> will send the monitor to the standard output, this is most useful when using qemu on the command line.

==Help and information==

===help===
* <tt>help [''command'']</tt> or <tt>? [''command'']</tt>
With no arguments, the help command lists all commands available. For more detail about another command, type <tt>help ''command''</tt>, e.g.
 (qemu) help info
On a small screen / VM window, the list of commands will scroll off the screen too quickly to let you read them. To scroll back and forth so that you can read the whole list, hold down the control key and press Page Up and Page Down.

===info===
* <tt>info ''option''</tt>
Show information on some aspect of the guest OS. Available options are:
* <tt>block</tt> &ndash; block devices such as hard drives, floppy drives, cdrom
* <tt>blockstats</tt> &ndash; read and write statistics on block devices
* <tt>capture</tt> &ndash; active capturing (audio grabs)
* <tt>history</tt> &ndash; console command history
* <tt>irq</tt> &ndash; statistics on interrupts (if compiled into QEMU)
* <tt>jit</tt> &ndash; statistics on QEMU's Just In Time compiler
* <tt>kqemu</tt> &ndash; whether the kqemu kernel module is being utilised
* <tt>mem</tt> &ndash; list the active virtual memory mappings
* <tt>mice</tt> &ndash; mouse on the guest that is receiving events
* <tt>network</tt> &ndash; network devices and VLANs
* <tt>pci</tt> &ndash; PCI devices being emulated
* <tt>pcmcia</tt> &ndash; [[w:PC Card|PCMCIA card]] devices
* <tt>pic</tt> &ndash; state of i8259 (PIC)
* <tt>profile</tt> &ndash; info on the internal profiler, if compiled into QEMU
* <tt>registers</tt> &ndash; the CPU registers
* <tt>snapshots</tt> &ndash; list the VM snapshots
* <tt>tlb</tt> &ndash; list the TLB ([[w:Translation lookaside buffer|Translation Lookaside Buffer]]), i.e. mappings between physical memory and virtual memory
* <tt>usb</tt> &ndash; USB devices on the virtual USB hub
* <tt>usbhost</tt> &ndash; USB devices on the host OS
* <tt>uuid</tt> &ndash; Unique id of the VM
* <tt>version</tt> &ndash; QEMU version number
* <tt>vnc</tt> &ndash; [[w:Virtual Network Computing|VNC]] information

==Devices==

===change===
* <tt>change ''device'' ''setting''</tt>
The <code>change</code> command allows you to change removable media (like CD-ROMs), change the display options for a VNC, and change the password used on a VNC.

When you need to change the disc in a CD or DVD drive, or switch between different [[w:ISO image|.iso]] files, find the name of the CD or DVD drive using <code>info</code> and use <code>change</code> to make the change.
 (qemu) info block
 ide0-hd0: type=hd removable=0 file=/path/to/winxp.img
 ide0-hd1: type=hd removable=0 file=/path/to/pagefile.raw
 ide1-hd1: type=hd removable=0 file=/path/to/testing_data.img
 ide1-cd0: type=cdrom removable=1 locked=0 file=/dev/sr0 ro=1 drv=host_device
 floppy0: type=floppy removable=1 locked=0 [not inserted]
 sd0: type=floppy removable=1 locked=0 [not inserted]
 (qemu) change ide1-cd0 /path/to/my.iso
 (qemu) change ide1-cd0 /dev/sr0 host_device

===eject===
* <tt>eject [-f] ''device''</tt>
Use the <code>eject</code> command to release the device or file connected to the removable media device specified. The <code>-f</code> parameter can be used to force it if it initially refuses!

===usb_add===
Add a host file as USB flash device
( you need to create in advance the host file: dd if=/dev/zero of=/tmp/disk.usb bs=1024k count=32 )

usb_add disk:/tmp/disk.usb

===usb_del===
use info usb to get the usb device list
<br>
(qemu)info usb<br>
Device 0.1, Speed 480 Mb/s, Product XXXXXX<br>
Device 0.2, Speed 12 Mb/s, Product XXXXX<br>
<br>
(qemu)usb_del 0.2<br>
<br>
This deletes the device

===mouse_move===
Sends Mouse Movevment events to guest.
mouse_move dx dy [dz] -- send mouse move events.
Example:
[qemu]mouse_move -20 20

===mouse_button===

===mouse_set index===

===sendkey keys===
You can emulate keyboard events through sendkey command. The syntax is: sendkey keys. To get a list of keys, type <tt>sendkey [tab]</tt>. Examples: 
* <tt>sendkey a</tt>
* <tt>sendkey shift-a</tt>
* <tt>sendkey ctrl-u</tt>
* <tt>sendkey ctrl-alt-f1</tt>

As of QEMU 0.12.5 there are:

{| border=0
|shift
|shift_r
|alt
|alt_r
|altgr
|altgr_r
|-
|ctrl
|ctrl_r
|menu
|esc
|1
|2
|-
|3
|4
|5
|6
|7
|8
|-
|9
|0
|minus
|equal
|backspace
|tab
|-
|q
|w
|e
|r
|t
|y
|-
|u
|i
|o
|p
|ret
|a
|-
|s
|d
|f
|g
|h
|j
|-
|k
|l
|z
|x
|c
|v
|-
|b
|n
|m
|comma
|dot
|slash
|-
|asterisk
|spc
|caps_lock
|f1
|f2
|f3
|-
|f4
|f5
|f6
|f7
|f8
|f9
|-
|f10
|num_lock
|scroll_lock
|kp_divide
|kp_multiply
|kp_subtract
|-
|kp_add
|kp_enter
|kp_decimal
|sysrq
|kp_0
|kp_1
|-
|kp_2
|kp_3
|kp_4
|kp_5
|kp_6
|kp_7
|-
|kp_8
|kp_9
|<
|f11
|f12
|print
|-
|home
|pgup
|pgdn
|end
|left
|up
|-
|down
|right
|insert
|delete
|-
|}

==Screen and audio grabs==

===screendump===
* <tt>screendump ''filename''</tt>
Capture a screendump and save into a [[w:Netpbm format|PPM]] image file.

===wavcapture===

===stopcapture===

==Virtual machine==

===commit===
* <tt>commit ''device''</tt> or <tt>commit all</tt>
When running QEMU with the <code>-snapshot</code> option, commit changes to the device, or all devices.

===quit===
* <tt>quit</tt> or <tt>q</tt>
Quit QEMU '''immediately'''.

===savevm===
* <tt>savevm</tt> name
Save the virtual machine as the tag 'name'. Not all filesystems support this. raw does not, but qcow2 does.

===loadvm===
* <tt>loadvm</tt> name
Load the virtual machine tagged 'name'. This can also be done on the command line: <tt>-loadvm</tt> name

With the <tt>info snapshots</tt> command, you can request a list of available machines.

===delvm===

Remove the virtual machine tagged 'name'.

===stop===
Suspend execution of VM

===cont===
Reverse a previous stop command - resume execution of VM.

===system_reset===
This has an effect similar to the physical reset button on a PC. Warning: Filesystems may be left in an unclean state.

===system_powerdown===
This has an effect similar to the physical power button on a modern PC. The VM will get an ACPI shutdown request and usually shutdown cleanly.

===log===
* <tt>log ''option''</tt>

===logfile===
* <tt>logfile ''filename''</tt>
Write logs to specified file instead of the default path, <code>/tmp/qemu.log</code> .

===gdbserver===
Starts a remote debugger session for the GNU debugger (gdb).  To connect to it from the host machine, run the following commands:
 shell$ gdb qemuKernelFile
 (gdb) target remote localhost:1234

===x===
''x /format address''

Displays memory at the specified virtual address using the specified format.

Refer to the xp section for details on ''format'' and ''address''.

===xp===
''xp /format address''

Displays memory at the specified physical address using the specified format.

''format'': Used to specify the output format the displayed memory.  The format is broken down as ''/[count][data_format][size]
* count: number of item to display (base 10)
* data_format: 'x' for hex, 'd' for decimal, 'u' for unsigned decimal, 'o' for octal, 'c' for char and 'i' for (disassembled) processor instructions
* size: 'b' for 8 bits, 'h' for 16 bits, 'w' for 32 bits or 'g' for 64 bits. On x86 'h' and 'w' can select instruction disassembly code formats.

''address'':
* Direct address, for example: 0x20000
* Register, for example: $eip

Example - Display 3 instructions on an x86 processor starting at the current instruction:
<pre>
(qemu) xp /3i $eip
</pre>

Example - Display the last 20 words on the stack for an x86 processor:
<pre>
(qemu) xp /20wx $esp
</pre>

===print===
Print (or p), evaluates and prints the expression given to it.
The result will be printed in hexadecimal, but decimal can also be used in the expression.
If the result overflows it will wrap around.
To use a the value in a CPU register use $<register name>.  
The name of the register should be lower case.  You can see registers with the info registers command.

Example of qemu simulating an i386.
 (qemu) print 16
 0x10
 (qemu) print 16 + 0x10
 0x20
 (qemu) print $eax
 0xc02e4000
 (qemu) print $eax + 2
 0xc02e4000
 (qemu) print ($eax + 2) * 2
 0x805c8004
 (qemu) print 0x80000000 * 2
 0

More information on the architecture specific register names can be found from the below qemu source file

http://git.qemu.org/?p=qemu.git;a=blob;f=monitor.c;h=1266ba06fb032cb0e7c9dbaa1b6d22cd9047c6b4;hb=HEAD#l3044

===sum===

===memsave===
'''Usage:''' memsave <address> <size> <filename>

==Links==
Monitor in QEMU documentation: http://wiki.qemu.org/download/qemu-doc.html#pcsys_005fmonitor
{{Auto category}}

The resulting data:

{ type: 'page',
  text: 
   { Intro: 
      [ { text: 'When QEMU is running, it provides a monitor console for interacting with QEMU.',
          links: undefined },
        { text: 'Through various commands, the monitor allows you to inspect the running guest OS, change removable media and USB devices, take screenshots and audio grabs, and control various aspects of the virtual machine.',
          links: undefined },
        { text: 'The monitor is accessed from within QEMU by holding down the Ctrl and Alt keys (or whatever the "mouse grab" keystrokes are), and pressing  will send the monitor to the standard output, this is most useful when using qemu on the command line.',
          links: undefined } ],
     help: 
      [ { text: 'With no arguments, the help command lists all commands available.',
          links: undefined },
        { text: 'For more detail about another command, type help command, e.g.',
          links: undefined },
        { text: '(qemu) help info', links: undefined },
        { text: 'On a small screen / VM window, the list of commands will scroll off the screen too quickly to let you read them.',
          links: undefined },
        { text: 'To scroll back and forth so that you can read the whole list, hold down the control key and press Page Up and Page Down.',
          links: undefined } ],
     info: 
      [ { text: 'Show information on some aspect of the guest OS. Available options are:',
          links: undefined } ],
     change: 
      [ { text: 'The  to make the change.', links: undefined },
        { text: '(qemu) info block', links: undefined },
        { text: 'ide0-hd0: type=hd removable=0 file=/path/to/winxp.img',
          links: undefined },
        { text: 'ide0-hd1: type=hd removable=0 file=/path/to/pagefile.raw',
          links: undefined },
        { text: 'ide1-hd1: type=hd removable=0 file=/path/to/testing_data.img',
          links: undefined },
        { text: 'ide1-cd0: type=cdrom removable=1 locked=0 file=/dev/sr0 ro=1 drv=host_device',
          links: undefined },
        { text: 'floppy0: type=floppy removable=1 locked=0 [not inserted]',
          links: undefined },
        { text: 'sd0: type=floppy removable=1 locked=0 [not inserted]',
          links: undefined },
        { text: '(qemu) change ide1-cd0 /path/to/my.iso',
          links: undefined },
        { text: '(qemu) change ide1-cd0 /dev/sr0 host_device',
          links: undefined } ],
     eject: 
      [ { text: 'Use the  parameter can be used to force it if it initially refuses!',
          links: undefined } ],
     usb_add: 
      [ { text: 'Add a host file as USB flash device',
          links: undefined },
        { text: '( you need to create in advance the host file: dd if=/dev/zero of=/tmp/disk.usb bs=1024k count=32 )',
          links: undefined },
        { text: 'usb_add disk:/tmp/disk.usb', links: undefined } ],
     usb_del: 
      [ { text: 'use info usb to get the usb device list',
          links: undefined },
        { text: '(qemu)info usb', links: undefined },
        { text: 'Device 0.1, Speed 480 Mb/s, Product XXXXXX',
          links: undefined },
        { text: 'Device 0.2, Speed 12 Mb/s, Product XXXXX',
          links: undefined },
        { text: '(qemu)usb_del 0.2', links: undefined },
        { text: 'This deletes the device', links: undefined } ],
     mouse_move: 
      [ { text: 'Sends Mouse Movevment events to guest.',
          links: undefined },
        { text: 'mouse_move dx dy [dz] send mouse move events.',
          links: undefined },
        { text: 'Example:', links: undefined },
        { text: '[qemu]mouse_move -20 20', links: undefined } ],
     'sendkey keys': 
      [ { text: 'You can emulate keyboard events through sendkey command.',
          links: undefined },
        { text: 'The syntax is: sendkey keys.', links: undefined },
        { text: 'To get a list of keys, type sendkey [tab] .',
          links: undefined },
        { text: 'Examples:', links: undefined },
        { text: 'As of QEMU 0.12.5 there are:', links: undefined } ],
     screendump: 
      [ { text: 'Capture a screendump and save into a PPM image file.',
          links: [ { page: 'W:Netpbm format', src: 'PPM' } ] } ],
     commit: 
      [ { text: 'When running QEMU with the  option, commit changes to the device, or all devices.',
          links: undefined } ],
     quit: [ { text: 'Quit QEMU immediately.', links: undefined } ],
     savevm: 
      [ { text: 'Save the virtual machine as the tag \'name\'.',
          links: undefined },
        { text: 'Not all filesystems support this.', links: undefined },
        { text: 'raw does not, but qcow2 does.', links: undefined } ],
     loadvm: 
      [ { text: 'Load the virtual machine tagged \'name\'.',
          links: undefined },
        { text: 'This can also be done on the command line: -loadvm name',
          links: undefined },
        { text: 'With the info snapshots command, you can request a list of available machines.',
          links: undefined } ],
     delvm: 
      [ { text: 'Remove the virtual machine tagged \'name\'.',
          links: undefined } ],
     stop: [ { text: 'Suspend execution of VM', links: undefined } ],
     cont: 
      [ { text: 'Reverse a previous stop command - resume execution of VM.',
          links: undefined } ],
     system_reset: 
      [ { text: 'This has an effect similar to the physical reset button on a PC. Warning: Filesystems may be left in an unclean state.',
          links: undefined } ],
     system_powerdown: 
      [ { text: 'This has an effect similar to the physical power button on a modern PC. The VM will get an ACPI shutdown request and usually shutdown cleanly.',
          links: undefined } ],
     logfile: 
      [ { text: 'Write logs to specified file instead of the default path,  .',
          links: undefined } ],
     gdbserver: 
      [ { text: 'Starts a remote debugger session for the GNU debugger (gdb).',
          links: undefined },
        { text: 'To connect to it from the host machine, run the following commands:',
          links: undefined },
        { text: 'shell$ gdb qemuKernelFile', links: undefined },
        { text: '(gdb) target remote localhost:1234', links: undefined } ] },
  categories: [],
  images: [],
  infobox: {},
  infobox_template: '',
  tables: 
   [ [ [ 'ctrl', 'ctrl_r', 'menu', 'esc', '1', '2' ],
       [ '3', '4', '5', '6', '7', '8' ],
       [ '9', '0', 'minus', 'equal', 'backspace', 'tab' ],
       [ 'q', 'w', 'e', 'r', 't', 'y' ],
       [ 'u', 'i', 'o', 'p', 'ret', 'a' ],
       [ 's', 'd', 'f', 'g', 'h', 'j' ],
       [ 'k', 'l', 'z', 'x', 'c', 'v' ],
       [ 'b', 'n', 'm', 'comma', 'dot', 'slash' ],
       [ 'asterisk', 'spc', 'caps_lock', 'f1', 'f2', 'f3' ],
       [ 'f4', 'f5', 'f6', 'f7', 'f8', 'f9' ],
       [ 'f10',
         'num_lock',
         'scroll_lock',
         'kp_divide',
         'kp_multiply',
         'kp_subtract' ],
       [ 'kp_add', 'kp_enter', 'kp_decimal', 'sysrq', 'kp_0', 'kp_1' ],
       [ 'kp_2', 'kp_3', 'kp_4', 'kp_5', 'kp_6', 'kp_7' ],
       [ 'kp_8', 'kp_9', '<', 'f11', 'f12', 'print' ],
       [ 'home', 'pgup', 'pgdn', 'end', 'left', 'up' ],
       [ 'down', 'right', 'insert', 'delete' ],
       [],
       '-1': [ 'shift', 'shift_r', 'alt', 'alt_r', 'altgr', 'altgr_r' ] ] ],
  translations: {} }

Long tables parsed incorrectly

It appears that large tables can fail to be parsed.

This is about the most minimal workable example I can come up with:

const wtfWikipedia = require('wtf_wikipedia')

function areTablesParsedCorrectly(charCount) {
    const longString = 'a'.repeat(charCount)

    const input = `
{| class="wikitable"
|-
| ${longString}
|}`

    var obj = wtfWikipedia.parse(input)
    return obj.tables.length > 0
}

console.log(areTablesParsedCorrectly(7975))
console.log(areTablesParsedCorrectly(7976))

Based on the numbers above, I think this happens when the markup between the braces hits 8000 chars (ignoring newlines). The expected behaviour is that the contents of the table is parsed into the 'tables' element. The actual behaviour (when above this string length) is that the table contents is parsed into sections/list/text.

Reproduced with wtf_wikipedia version 1.0.1, node v7.4.0.

Fetch wiki pages by id?

Mind adding support to fetch wiki pages by id as well? Currently you can only do this by the page title.

Here is a suggestion:

Update index.js line 538:

    var from_api=function(page_identifier, lang_or_wikiid, cb){
      if(typeof lang_or_wikiid=="function"){
        cb= lang_or_wikiid
        lang_or_wikiid="en"
      }
      cb= cb || console.log
      lang_or_wikiid=lang_or_wikiid||"en"
      if(!fetch){//no http method, on the client side
        return cb(null)
      }
      fetch(page_identifier, lang_or_wikiid, cb);
    }

Update fetch_text.js line 7:

var fetch=function(page_identifier, lang_or_wikiid, cb){
  lang_or_wikiid = lang_or_wikiid || 'en';

  var identifier_type  = (Number.isInteger(page_identifier)) ? 'curid' : 'title';
  var url;
  if (site_map[lang_or_wikiid]) {
    url=site_map[lang_or_wikiid]+'/w/index.php?action=raw&'+identifier_type+'='+page_identifier;
  } else {
    url='http://'+lang_or_wikiid+'.wikipedia.org/w/index.php?action=raw&'+identifier_type+'='+page_identifier;
  }
  request({
    uri: url,
  }, function(error, response, body) {
    cb(body);
  });
};

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.