Giter Club home page Giter Club logo

wikipedia-word-frequency's Introduction

Wikipedia word frequency generator

This script processes wikipedia article dumps from https://dumps.wikimedia.org/enwiki/ and gathers the word frequency distribution data. The script uses wikiextractor to fetch raw text, then strips punctuation marks and normalizes unicode dashes and apostrophes. The script then disregards words that have a digit in them, and only takes words that were used in at least 3 different articles.

The script was inspired by this article which unfortunately provided very inaccurate data with punctuation marks and other sorts of inaccuracies.

Usage

The script needs Python 3. On macOS, there is a known bug with Python 3.8, so you will need to use Python 3.7 or lower.

Install requirements:

pip install -r requirements.txt

Download the current Wikipedia dumps for the desired language:

WIKI=enwiki
wget -np -r --accept-regex \
  "https:\/\/dumps\.wikimedia\.org\/${WIKI}\/latest\/${WIKI}-latest-pages-articles[0-9]*\.xml.bz2" \
  https://dumps.wikimedia.org/${WIKI}/latest/

Note that for enwiki (as of April 2023) this will require about 19 Gb of free space.

Parse dumps and save results:

python ./gather_wordfreq.py dumps.wikimedia.org/${WIKI}/latest/*.bz2 > wordfreq.txt

Pre-generated word frequency data

The word frequency data for English, Spanish, French, German, Italian, Portuguese, Dutch, Arabic, Polish, Egyptian, Japanese, Russian, Cebuano, Swedish, Ukrainian, Vietnamese, Chinese, Waray, Afrikaans & Swahili are provided at results.

English results:

  • Total unique words appearing at least in 3 articles: 2747823
  • Top 20 most popular words: the, of, in, and, a, to, was, is, on, for, as, with, by, he, that, at, from, his, it, an.

wikipedia-word-frequency's People

Contributors

arefinnomi avatar borisdayma avatar dainternetdude avatar ilyasemenov avatar sts10 avatar tbm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

wikipedia-word-frequency's Issues

Recommendation to edit for calculating context diversity?

Hi there โ€“ I'm interested in modifying the script to calculate the number of different documents in which words appear (e.g., how many wikipedia articles does the word "DOG" appear in). Have you considered this, or do you have a recommended approach to modifying the script for this purpose? Wanted to check in before attempting the changes myself. Appreciate your consideration.

Docs could make it clearer what file to get

The docs could make it clearer which XML file is needed.

For Cebuano, I see two options:

  • cebwiki-latest-pages-articles-multistream.xml.bz2
  • cebwiki-latest-pages-articles.xml.bz2

The instructions say:

wget -np -r --accept-regex 'https:\/\/dumps\.wikimedia\.org\/enwiki\/latest\/enwiki-latest-pages-articles[0-9]+\..*'

which suggests that the multistream one is wrong and I need the normal one. Your regex won't match that since it has no digit.

I guess enwiki-latest-pages-articles[0-9]*\.xml.bz2 might be better, with a note to replace en with whatever language you want.

the word "a" seems to not show up in the list of words created in 2015.

Does the script discard single letter words? "a" is definitely a word. Luckily this actually has no bearing on what i need the word frequencies for, but you may want to consider that if you plan to use your script.

On a separate note, this little tool saved me a ton of trouble, so thanks for sharing it on github. I don't even need the tool itself seeing as you provided a sample. ๐Ÿ‘

Cebuano is missing

The README says that a pre-generated list for Cebuano is there but it's actually missing from what I can tell.

Do you think you could add it?

Lots of repeated words.

Lots of duplicates besides some meaningless words like "aaaah", "aah", etc. Better if you(or someone) create separate list what has the words those present in Wikitionary.
Anyways thanks for this work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.