Giter Club home page Giter Club logo

parse_wiktionary_en's Introduction

Parse Wiktionary for en.wiktionary.org

Parse dictionary pages from the English language edition of Wiktionary into structured data.

Parse Wiki Text

This following information applies to all language editions of Parse Wiktionary. For information specific to each language edition, see its documentation.

Introduction

Wiktionary is a dictionary with millions of entries containing a wide variety of data about words and phrases in many languages. The dictionary data is distributed under a free license, allowing it to be reused in other applications. Unfortunately it's written in a format that prevents using it in other applications. The format is designed only to be transformed into the exact HTML format displayed in Wiktionary itself, not to be parsed into semantically meaningful data that can be used for other purposes or displayed in other formats.

The format does however contain enough structurally meaningful data to allow most of it with great difficulty to be parsed into structured data. Parse Wiktionary does the challenging task of parsing entries from Wiktionary into a structured format that can easily be used to query details about entries and use them for different purposes and present them in different formats. Because each language edition of Wiktionary unfortunately has a completely different format, there is a different edition of Parse Wiktionary for each edition of Wiktionary and they all have a different output format. Currently Parse Wiktionary exists for the English (en.wiktionary.org), German (de.wiktionary.org) and Czech (cs.wiktionary.org) editions of Wiktionary.

Limitations

Different parts of the information in Wiktionary are written in different formats that vary in regularity and complexity.

  • Some parts have a highly regular format. These parts are parsed into semantic data directly representing the facts stated in the entry, not the format in which they are stated.
  • Some parts have a format of manageable complexity but are not entirely consistent. Such parts are not parsed into semantic data, but parsed into a free form document. These documents are a sequence of elements where any of the supported elements can occur and can come in any order. There is also a kind of element that represents wiki text that could not be parsed. The document may be displayed by displaying all elements in order, or may be further parsed into semantic data in those cases the document complies with a regular enough format.
  • Some parts don't have a consistent format or have a too complex and dynamic format that changes over time as templates are added or edited. These parts are not parsed at all. Parse Wiktionary may however put flags in the output indicating the existence of such parts.

In all editions of Wiktionary, the headings follow a regular format. Headings are therefore parsed semantically. The content of sections however may or may not be in a regular format depending on the section. Many sections are therefore parsed as a free form document. That free form document is however stored in a field with a semantic meaning. This means that even though the content of the documents is not semantic, they are organized in a semantic way allowing applications to choose what sections to take and what to do with each section.

The long term goal is to eliminate all these limitations and parse all information in Wiktionary as structured semantic data. This will however require cooperation from Wiktionary editors. A standard format could be created for each section, and authors encouraged to follow the standard format. Parse Wiktionary could be integrated in Wiktionary and validate entries as they are being edited, showing warnings about anything that doesn't conform to the standard format. More data can also be transferred to Wikidata which is already designed from the beginning to store semantic data.

parse_wiktionary_en's People

Contributors

brkalmar avatar

Stargazers

 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.