Giter Club home page Giter Club logo

nlp_datasets's Introduction

A not-complete list of datasets for NLP tasks

A useful list of datasets I collected for NLP tasks. You can fork and/or clone this repository and get all the datasets available.

git clone https://github.com/nluninja/nlp_datasets

Available datasets

Name Description classes format language
20 Newsgroups dataset file set arranged into 20 topic folders see corpus page files en
The Anatomical Entity Mention (AnEM) corpus PubMeb dataset Anatomical_system, Cell,Cellular_component, Developing_anatomical_structure, Immaterial_anatomical_entity, Multi-tissue_structure, Organ, Organism_subdivision, Organism_substance, Pathological_formation, Tissue conll/iob2
AG News Topic dataset News Topic Classification dataset - Antonio Gulli - UniPi World, Sports, Business, Sci/Tech csv en
CoNLL 2003 named entity recognition dataset People, Location, Organization, Misc conll/iob2 en
emotions classification dataset emotion classification dataset which contains tweets labeled into 6 categories joy, sadness, anger, fear, love, surprise csv en
Georgetown University Multilayer corpus in CoNLL CoNLL tagged corpus for entity extraction 23 classes (person, substance, quantity, time, place, organization) conll/iob2 en
Relationship and Entity Extraction Evaluation Dataset in CoNLL CoNLL tagged corpus for entity extraction 21 classes (person, temporal, weapon, MilitaryPlatform, quantity, organization) conll/iob2 en
sentiment140 dataset dataset which contains tweets labeled according to their polarity negative, neutral, positive csv en
Toxic Comments dataset Reviews Wikipedia comments labeled into 6 categories with score toxic, severe_toxic, obscene, threat, insult, identity_hate csv en
WikiGold Dataset named entity recognition dataset People, Location, Organization, Misc conll/iob2 en
Wikipedia Movie Plots dataset descriptions of movies from around the world scraped from WikiPedia Genre Classes csv en
WNUT 17 Emerging Entities Dataset Twitter/StackOverflow data for discovering emerging entities Entity Classes conll/iob2 en
Yelp! Reviews reviews dataset from Yelp! for classification/sentiment analysis tasks 1 to 5 rates csv en

I appreciate your contribution to this repo, so don't hesitate to submit your changes via pull request for bug fixing or for adding a new dataset as well!

pull request https://github.com/nluninja/nlp_datasets

use the corpus_template for uploading the new dataset. I look forward seeing your contribution! ๐Ÿ™ ๐Ÿ˜˜

nlp_datasets's People

Contributors

nluninja avatar

Stargazers

 avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.