Giter Club home page Giter Club logo

spanish-corpora's Introduction

Spanish Unannotated Corpora

DOI

This repository gathers a compilation of corpus in Spanish language. Available to download here: Zenodo

Data

Number of lines: 300904000 (300M)

Number of tokens: 2996016962 (3B)

Number of chars: 18431160978 (18.4B)

Sources

Spanish Wikis: Wich include Wikipedia, Wikinews, Wikiquotes and more. These were first processed with wikiextractor (https://github.com/josecannete/wikiextractorforBERT) using the wikis dump of 20/04/2019.

ParaCrawl: Spanish portion of ParaCrawl (http://opus.nlpl.eu/ParaCrawl.php)

EUBookshop: Spanish portion of EUBookshop (http://opus.nlpl.eu/EUbookshop.php)

MultiUN: Spanish portion of MultiUN (http://opus.nlpl.eu/MultiUN.php)

OpenSubtitles: Spanish portion of OpenSubtitles2018 (http://opus.nlpl.eu/OpenSubtitles-v2018.php)

DGC: Spanish portion of DGT (http://opus.nlpl.eu/DGT.php)

DOGC: Spanish portion of DOGC (http://opus.nlpl.eu/DOGC.php)

ECB: Spanish portion of ECB (http://opus.nlpl.eu/ECB.php)

EMEA: Spanish portion of EMEA (http://opus.nlpl.eu/EMEA.php)

Europarl: Spanish portion of Europarl (http://opus.nlpl.eu/Europarl.php)

GlobalVoices: Spanish portion of GlobalVoices (http://opus.nlpl.eu/GlobalVoices.php)

JRC: Spanish portion of JRC (http://opus.nlpl.eu/JRC-Acquis.php)

News-Commentary11: Spanish portion of NCv11 (http://opus.nlpl.eu/News-Commentary-v11.php)

TED: Spanish portion of TED (http://opus.nlpl.eu/TED2013.php)

UN: Spanish portion of UN (http://opus.nlpl.eu/UN.php)

Post-processing

Two post-processing scripts included (corpus_processing.py and split_punctuation.py). The available data was processed just with the first one.

Using process_corpus.py:

  • Lowercase
  • Removed urls
  • Removed listing
  • Replaced multiple spaces with single one

spanish-corpora's People

Contributors

josecannete avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.