Giter Club home page Giter Club logo

taiga_site's Introduction

Taiga corpus

Welcome to the taiga site repository!

Here, as well as on our website, you can explore our documentation, leave feedback, open issues and create pull requests

About the project

Taiga corpus is an ambitious project to become the largest fully available webcorpus constructed from open text sources. Taiga corpus is:

  • open source
  • big - about 6 billion words by now
  • sorted by datasets applicable to different machine laearning tasks
  • made by linguists, experienced in text crawling, parsing and filtering
  • rich with metainformation
  • POS-tagged and syntactically tagged in Universal Dependencies

Our motivation

A wisely constructed web corpus has a lot more potential applications than is classically accounted to have. The “web as corpus” paradigm recently has had its natural continuation as a formulation “web as train set”. Open-source websites provide ample opportunities for NLP-developers and computational linguists, who nevertheless have to gather all the corresponding data by themselves, repeating the same actions for cleaning and de-duplicating the material, as traditional web corpora provide only search interface and do not give any access to the whole data. The "Taiga" corpus project unites the needs of developers, machine learners and computational linguists, as a web corpus for big linguistic data analysis and actual NLP and NLU systems modeling. Its main aim is to influence the culture of corpus research for Russian language and reflect the paradigm shift in linguistic methodology.

Project creators

Under inspiring supervision of Olga Lyashevskaya

References:

  1. Shavrina T., Shapovalova O. (2017) TO THE METHODOLOGY OF CORPUS CONSTRUCTION FOR MACHINE LEARNING: «TAIGA» SYNTAX TREE CORPUS AND PARSER. in proc. of "CORPORA2017", international conference , Saint-Petersbourg, 2017.
  2. Shavrina T. (2018) Differential approach to webcorpus construction. In Dialogue, Russian International Conference on Computational Linguistics, RSUH, Moscow.

taiga_site's People

Contributors

tatianashavrina avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.