Giter Club home page Giter Club logo

inf2978's Introduction

Title:  Bag of Words Data Set

Abstract: This data set contains five text collections in the form of bags-of-words.

-----------------------------------------------------	

Data Set Characteristics: Text
Number of Instances: 8000000
Area: N/A
Attribute Characteristics: Integer
Number of Attributes: 100000
Date Donated: 2008-03-12
Associated Tasks: Clustering
Missing Values? N/A

-----------------------------------------------------		

Source:

David Newman
newman '@' uci.edu
University of California, Irvine

-----------------------------------------------------	

Data Set Information:

For each text collection, D is the number of documents, W is the
number of words in the vocabulary, and N is the total number of words
in the collection (below, NNZ is the number of nonzero counts in the
bag-of-words). After tokenization and removal of stopwords, the
vocabulary of unique words was truncated by only keeping words that
occurred more than ten times. Individual document names (i.e. a
identifier for each docID) are not provided for copyright reasons.

These data sets have no class labels, and for copyright reasons no
filenames or other document-level metadata.  These data sets are ideal
for clustering and topic modeling experiments.

For each text collection we provide docword.*.txt (the bag of words
file in sparse format) and vocab.*.txt (the vocab file).

Enron Emails:
orig source: www.cs.cmu.edu/~enron
D=39861
W=28102
N=6,400,000 (approx)

NIPS full papers:
orig source: books.nips.cc
D=1500
W=12419
N=1,900,000 (approx)

KOS blog entries:
orig source: dailykos.com
D=3430
W=6906
N=467714

NYTimes news articles:
orig source: ldc.upenn.edu
D=300000
W=102660
N=100,000,000 (approx)

PubMed abstracts:
orig source: www.pubmed.gov
D=8200000
W=141043
N=730,000,000 (approx)


-----------------------------------------------------	

Attribute Information:

The format of the docword.*.txt file is 3 header lines, followed by
NNZ triples:
---
D
W
NNZ
docID wordID count
docID wordID count
docID wordID count
docID wordID count
...
docID wordID count
docID wordID count
docID wordID count
---

The format of the vocab.*.txt file is line contains wordID=n.

inf2978's People

Contributors

lquatrin avatar lschirmer avatar

Watchers

 avatar James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.