Top OpenSubtitles Sentences

This repository provides cleaned frequency lists of the most common sentences and words for all the 62 languages in the OpenSubtitles2018 corpus, plus customizable Python code which reproduces these lists.

Lists of the most common sentences and words

Clicking on a link in the sentences (or words) column of the table below brings up a list of the 10,000 most common sentences (or 30,000 most common words) in the language on the corresponding row. All lists can also be found in the bld directory.

The numbers in the sentences (or words) column give the total number of sentences (or words), including duplicates, on which the linked lists are based. These numbers differ from the ones given for the underlying corpus at OpenSubtitles2018, primarily because they exclude the sentences and words which are removed when cleaning. Each list also contains a count column, which gives the number of times that particular sentence/word occurs in the underlying corpus.

code	language	sentences	words
af	Afrikaans	60,668	369,123
ar	Arabic	77,863,954	360,421,023
bg	Bulgarian	92,696,697	469,206,909
bn	Bengali	623,609	2,829,891
br	Breton	22,902	117,571
bs	Bosnian	33,769,306	164,287,955
ca	Catalan	604,672	3,645,619
cs	Czech	134,738,901	640,319,278
da	Danish	29,963,011	158,465,190
de	German	40,648,529	218,781,994
el	Greek	120,254,008	639,221,011
en	English	424,332,313	2,558,565,196
eo	Esperanto	91,573	445,995
es	Spanish	211,510,220	1,137,050,844
et	Estonian	27,378,143	125,927,295
eu	Basque	1,011,460	4,226,325
fa	Persian	12,361,713	64,326,270
fi	Finnish	51,726,104	202,514,234
fr	French	105,626,854	639,468,672
gl	Galician	304,788	1,844,553
he	Hebrew	84,803,287	400,628,507
hi	Hindi	125,360	765,808
hr	Croatian	112,249,406	538,052,737
hu	Hungarian	102,656,500	451,839,266
hy	Armenian	2,430	24,197
id	Indonesian	22,396,206	103,580,137
is	Icelandic	1,930,726	9,578,051
it	Italian	103,588,080	581,515,381
ja	Japanese	3,034,473	18,494,956
ka	Georgian	274,057	1,261,959
kk	Kazakh	4,051	14,257
ko	Korean	2,062,345	7,408,813
lt	Lithuanian	2,110,571	8,178,690
lv	Latvian	612,122	2,565,490
mk	Macedonian	7,707,280	38,379,332
ml	Malayalam	505,786	1,752,604
ms	Malay	3,769,707	17,475,629
nl	Dutch	103,995,910	595,792,461
no	Norwegian	12,866,036	66,979,416
pl	Polish	233,638,062	1,049,551,703
pt	Portuguese	117,679,690	623,827,834
pt_br	Portuguese, Brazil	250,231,504	1,313,425,238
ro	Romanian	191,620,920	1,051,216,598
ru	Russian	43,563,555	213,758,183
si	Sinhala	943,726	4,266,309
sk	Slovak	15,958,574	77,096,449
sl	Slovenian	59,309,241	267,371,023
sq	Albanian	3,549,383	18,697,836
sr	Serbian	165,175,285	807,672,359
sv	Swedish	35,955,299	188,647,795
ta	Tamil	34,263	141,693
te	Telugu	24,027	107,890
th	Thai	8,530,650	54,028,834
tl	Tagalog	18,487	103,149
tr	Turkish	172,028,191	694,495,389
uk	Ukrainian	1,199,790	5,654,100
ur	Urdu	38,672	266,345
vi	Vietnamese	5,069,885	30,297,828
ze_en	English, ze	6,282,966	42,270,214
ze_zh	Chinese, ze	7,093,112	59,730,730
zh_cn	Chinese	27,167,013	231,881,660
zh_tw	Chinese, Taiwan	9,799,690	79,287,019

Here "ze" stands for subtitle files containing dual Chinese and English subtitles.

The underlying corpus

This repository is based on the OpenSubtitles2018 corpus, which is part of the OPUS collection. In particular, as primary source files the untokenized corpus files linked in the rightmost column language IDs of the first table here are taken. These contain the raw corpus text as a collection of xml subtitle files, which have been downloaded from www.opensubtitles.org (see also their newer site www.opensubtitles.com). Optionally, one can also use the pre-parsed and pre-tokenized source data files.

OpenSubtitles is an online community where anyone can upload and download subtitles. At the time of writing 6,404,981 subtitles are available, of these 3,735,070 are included in the OpenSubtitles2018 corpus, which contains a total of 22.10 billion tokens (see the first columns in the second table of the preceding link for a per-language breakdown). See the following article for a detailed description of the corpus:

P. Lison and J. Tiedemann (2016) OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016).

Python code

Usage

First download this repository and install the dependencies listed in pyproject.toml (Python 3.8+ and a few Python packages). If you're using Poetry this can be done automatically by running poetry install --all-extras from the command line when at the root directory of the downloaded repository. This also installs extra tokenizers for Japanese, Thai, and Vietnamese. If you only are interested in sentences or only plan to use a simpler regex tokenizer, a minimal install with poetry install --without words will also do.

Next adjust the Settings section of src/top_open_subtitles_sentences.py to your liking, while optionally reading the Info section, before running the whole file.

Features

All 62 languages are supported (see the Info section of src/top_open_subtitles_sentences.py for the full list).

The Python code will by default first download the source data, parse it into a temporary file, and then construct the lists of most common sentences and (optionally) words from it. The lists also include the number of times each sentence or word occurs in the corpus.

The following cleaning steps are performed (uncomment the respective lines in the code to skip them):

Whitespace, dashes, and other undesirable characters are stripped from the beginning and end of each sentence.
Entries consisting of only punctuation and numbers are removed.
Entries starting with a parenthesis are removed as these don't contain speech. Same for entries ending with a colon.
Entries containing Latin characters are removed for languages not using the Latin alphabet.
Sentences ending the same way up to " .?!¿¡" are combined into the most common form.
Words of differing cases are combined.
Sentences in src/extra_settings/extra_sentences_to_exclude.csv are excluded.

spaCy is used to split sentences into words (tokenization).

Related work

There exist at least two similar but older repositories (older versions of the corpus and older Python versions) which can parse the OpenSubtitles corpus but do not extract lists of the most frequent sentences and words: AlJohri/OpenSubtitles and domerin0/opensubtitles-parser.

The orgtre/google-books-ngram-frequency repository constructs similar lists of the most frequent words and sequences of up to five words (ngrams) in the much larger Google Books Ngram Corpus.

Limitations and known problems

If a movie or episode contains several different subtitle files for a given language in the raw corpus, then all of them are used when constructing the sentence/word lists. Since more popular movies and episodes tend to have more subtitle files, the resulting lists can hence be viewed as popularity-weighted. The option one_subtitle_per_movie exists for changing this.

Many of the subtitle files are not in the same language as the movie they are subtitling. They are hence translations, often from English, and don't represent typical sentence and word usage in the subtitle language. Using the option original_language_only one can restrict to only subtitles in the same language as the movie/episode. However, for most languages only a small fraction of files contain original language metadata which matches the subtitle language. For English 31% of files match, Spanish 3%, French 5%, Italian 1%, German 5%, Swedish 2%, Dutch 0%, Japanese 4%, Korean 3%, and Chinese 0%. The directory bld/original_language_only contains such restricted lists for a few selected languages. To test how representative the translations are, src/tests.py computes the frequency rank correlations between these lists and the main unrestricted ones. For the top 1000 English and Spanish sentences one gets a correlation of 0.7, while for German it is only 0.5. The corresponding correlations for words are all at least 0.9.

The sentence and word lists still contain entries which probably better are excluded for many purposes, like proper names. #TODO clean the lists more.

The code in this repository is licensed under the Creative Commons Attribution 3.0 Unported License. The sentence and word lists come with the same license as the underlying corpus. Issue reports and pull requests are most welcome!

orgtre / top-open-subtitles-sentences Goto Github PK