Giter Club home page Giter Club logo

top-open-subtitles-sentences's Introduction

Top OpenSubtitles Sentences

This repository provides cleaned frequency lists of the most common sentences and words for all the 62 languages in the OpenSubtitles2018 corpus, plus customizable Python code which reproduces these lists.

Lists of the most common sentences and words

Clicking on a link in the sentences (or words) column of the table below brings up a list of the 10,000 most common sentences (or 30,000 most common words) in the language on the corresponding row. All lists can also be found in the bld directory.

The numbers in the sentences (or words) column give the total number of sentences (or words), including duplicates, on which the linked lists are based. These numbers differ from the ones given for the underlying corpus at OpenSubtitles2018, primarily because they exclude the sentences and words which are removed when cleaning. Each list also contains a count column, which gives the number of times that particular sentence/word occurs in the underlying corpus.

code language sentences words
af Afrikaans 60,668 369,123
ar Arabic 77,863,954 360,421,023
bg Bulgarian 92,696,697 469,206,909
bn Bengali 623,609 2,829,891
br Breton 22,902 117,571
bs Bosnian 33,769,306 164,287,955
ca Catalan 604,672 3,645,619
cs Czech 134,738,901 640,319,278
da Danish 29,963,011 158,465,190
de German 40,648,529 218,781,994
el Greek 120,254,008 639,221,011
en English 424,332,313 2,558,565,196
eo Esperanto 91,573 445,995
es Spanish 211,510,220 1,137,050,844
et Estonian 27,378,143 125,927,295
eu Basque 1,011,460 4,226,325
fa Persian 12,361,713 64,326,270
fi Finnish 51,726,104 202,514,234
fr French 105,626,854 639,468,672
gl Galician 304,788 1,844,553
he Hebrew 84,803,287 400,628,507
hi Hindi 125,360 765,808
hr Croatian 112,249,406 538,052,737
hu Hungarian 102,656,500 451,839,266
hy Armenian 2,430 24,197
id Indonesian 22,396,206 103,580,137
is Icelandic 1,930,726 9,578,051
it Italian 103,588,080 581,515,381
ja Japanese 3,034,473 18,494,956
ka Georgian 274,057 1,261,959
kk Kazakh 4,051 14,257
ko Korean 2,062,345 7,408,813
lt Lithuanian 2,110,571 8,178,690
lv Latvian 612,122 2,565,490
mk Macedonian 7,707,280 38,379,332
ml Malayalam 505,786 1,752,604
ms Malay 3,769,707 17,475,629
nl Dutch 103,995,910 595,792,461
no Norwegian 12,866,036 66,979,416
pl Polish 233,638,062 1,049,551,703
pt Portuguese 117,679,690 623,827,834
pt_br Portuguese, Brazil 250,231,504 1,313,425,238
ro Romanian 191,620,920 1,051,216,598
ru Russian 43,563,555 213,758,183
si Sinhala 943,726 4,266,309
sk Slovak 15,958,574 77,096,449
sl Slovenian 59,309,241 267,371,023
sq Albanian 3,549,383 18,697,836
sr Serbian 165,175,285 807,672,359
sv Swedish 35,955,299 188,647,795
ta Tamil 34,263 141,693
te Telugu 24,027 107,890
th Thai 8,530,650 54,028,834
tl Tagalog 18,487 103,149
tr Turkish 172,028,191 694,495,389
uk Ukrainian 1,199,790 5,654,100
ur Urdu 38,672 266,345
vi Vietnamese 5,069,885 30,297,828
ze_en English, ze 6,282,966 42,270,214
ze_zh Chinese, ze 7,093,112 59,730,730
zh_cn Chinese 27,167,013 231,881,660
zh_tw Chinese, Taiwan 9,799,690 79,287,019

Here "ze" stands for subtitle files containing dual Chinese and English subtitles.

The underlying corpus

This repository is based on the OpenSubtitles2018 corpus, which is part of the OPUS collection. In particular, as primary source files the untokenized corpus files linked in the rightmost column language IDs of the first table here are taken. These contain the raw corpus text as a collection of xml subtitle files, which have been downloaded from www.opensubtitles.org (see also their newer site www.opensubtitles.com). Optionally, one can also use the pre-parsed and pre-tokenized source data files.

OpenSubtitles is an online community where anyone can upload and download subtitles. At the time of writing 6,404,981 subtitles are available, of these 3,735,070 are included in the OpenSubtitles2018 corpus, which contains a total of 22.10 billion tokens (see the first columns in the second table of the preceding link for a per-language breakdown). See the following article for a detailed description of the corpus:

P. Lison and J. Tiedemann (2016) OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016).

Python code

Usage

First download this repository and install the dependencies listed in pyproject.toml (Python 3.8+ and a few Python packages). If you're using Poetry this can be done automatically by running poetry install --all-extras from the command line when at the root directory of the downloaded repository. This also installs extra tokenizers for Japanese, Thai, and Vietnamese. If you only are interested in sentences or only plan to use a simpler regex tokenizer, a minimal install with poetry install --without words will also do.

Next adjust the Settings section of src/top_open_subtitles_sentences.py to your liking, while optionally reading the Info section, before running the whole file.

Features

All 62 languages are supported (see the Info section of src/top_open_subtitles_sentences.py for the full list).

The Python code will by default first download the source data, parse it into a temporary file, and then construct the lists of most common sentences and (optionally) words from it. The lists also include the number of times each sentence or word occurs in the corpus.

The following cleaning steps are performed (uncomment the respective lines in the code to skip them):

  • Whitespace, dashes, and other undesirable characters are stripped from the beginning and end of each sentence.
  • Entries consisting of only punctuation and numbers are removed.
  • Entries starting with a parenthesis are removed as these don't contain speech. Same for entries ending with a colon.
  • Entries containing Latin characters are removed for languages not using the Latin alphabet.
  • Sentences ending the same way up to " .?!¿¡" are combined into the most common form.
  • Words of differing cases are combined.
  • Sentences in src/extra_settings/extra_sentences_to_exclude.csv are excluded.

spaCy is used to split sentences into words (tokenization).

Related work

There exist at least two similar but older repositories (older versions of the corpus and older Python versions) which can parse the OpenSubtitles corpus but do not extract lists of the most frequent sentences and words: AlJohri/OpenSubtitles and domerin0/opensubtitles-parser.

The orgtre/google-books-ngram-frequency repository constructs similar lists of the most frequent words and sequences of up to five words (ngrams) in the much larger Google Books Ngram Corpus.

Limitations and known problems

If a movie or episode contains several different subtitle files for a given language in the raw corpus, then all of them are used when constructing the sentence/word lists. Since more popular movies and episodes tend to have more subtitle files, the resulting lists can hence be viewed as popularity-weighted. The option one_subtitle_per_movie exists for changing this.

Many of the subtitle files are not in the same language as the movie they are subtitling. They are hence translations, often from English, and don't represent typical sentence and word usage in the subtitle language. Using the option original_language_only one can restrict to only subtitles in the same language as the movie/episode. However, for most languages only a small fraction of files contain original language metadata which matches the subtitle language. For English 31% of files match, Spanish 3%, French 5%, Italian 1%, German 5%, Swedish 2%, Dutch 0%, Japanese 4%, Korean 3%, and Chinese 0%. The directory bld/original_language_only contains such restricted lists for a few selected languages. To test how representative the translations are, src/tests.py computes the frequency rank correlations between these lists and the main unrestricted ones. For the top 1000 English and Spanish sentences one gets a correlation of 0.7, while for German it is only 0.5. The corresponding correlations for words are all at least 0.9.

The sentence and word lists still contain entries which probably better are excluded for many purposes, like proper names. #TODO clean the lists more.

The code in this repository is licensed under the Creative Commons Attribution 3.0 Unported License. The sentence and word lists come with the same license as the underlying corpus. Issue reports and pull requests are most welcome!

top-open-subtitles-sentences's People

Contributors

orgtre avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.