lquatrin / inf2978 Goto Github PK
View Code? Open in Web Editor NEWFundamentos de Data Science
Fundamentos de Data Science
Title: Bag of Words Data Set Abstract: This data set contains five text collections in the form of bags-of-words. ----------------------------------------------------- Data Set Characteristics: Text Number of Instances: 8000000 Area: N/A Attribute Characteristics: Integer Number of Attributes: 100000 Date Donated: 2008-03-12 Associated Tasks: Clustering Missing Values? N/A ----------------------------------------------------- Source: David Newman newman '@' uci.edu University of California, Irvine ----------------------------------------------------- Data Set Information: For each text collection, D is the number of documents, W is the number of words in the vocabulary, and N is the total number of words in the collection (below, NNZ is the number of nonzero counts in the bag-of-words). After tokenization and removal of stopwords, the vocabulary of unique words was truncated by only keeping words that occurred more than ten times. Individual document names (i.e. a identifier for each docID) are not provided for copyright reasons. These data sets have no class labels, and for copyright reasons no filenames or other document-level metadata. These data sets are ideal for clustering and topic modeling experiments. For each text collection we provide docword.*.txt (the bag of words file in sparse format) and vocab.*.txt (the vocab file). Enron Emails: orig source: www.cs.cmu.edu/~enron D=39861 W=28102 N=6,400,000 (approx) NIPS full papers: orig source: books.nips.cc D=1500 W=12419 N=1,900,000 (approx) KOS blog entries: orig source: dailykos.com D=3430 W=6906 N=467714 NYTimes news articles: orig source: ldc.upenn.edu D=300000 W=102660 N=100,000,000 (approx) PubMed abstracts: orig source: www.pubmed.gov D=8200000 W=141043 N=730,000,000 (approx) ----------------------------------------------------- Attribute Information: The format of the docword.*.txt file is 3 header lines, followed by NNZ triples: --- D W NNZ docID wordID count docID wordID count docID wordID count docID wordID count ... docID wordID count docID wordID count docID wordID count --- The format of the vocab.*.txt file is line contains wordID=n.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.