jldbc / gutenberg Goto Github PK
View Code? Open in Web Editor NEWA content-based recommender system for books using the Project Gutenberg text corpus
A content-based recommender system for books using the Project Gutenberg text corpus
Some of these files might not be necessary anymore. Need to look closer and verify, but I think fred.txt is an out of date example output, featurizer is now baked into MoreEfficientPreprocessing, and get_sentiment.py [the first one] is inferior to the second.
Add titles to k means output. Match book title to book id so the csv format is (title, id, cluster). This would be easier to interpret the results of.
I want to execute the knn code directly but I need not find the all-data.csv file in the repo.please share all-data file.
See if we can do cosine similarity instead of euclidean distance for the clustering. Supposedly that can work better for text clustering.
Data needs to be scaled for K means. Word scores should have bounds [0,1].
Create a sample set of known books to test the clustering. Pick clusters of books that you know are similar, and then run the code on them to test efficacy
Fix normalizing:
XML parsing:
Finish Streaming:
Format Outputs:
when I tried to replicate in 3rd step tfidf code,it is running fine but I'm not able to find correct data in the output folder that was created by pyspark code.
I able to see only this in output folder
(u'file:/home/system5/Documents/gutenberg-master/output_POS.txt', 0)
Just a test
tf-idf script is splitting book titles on commas, and this is causing it to break. Quick fix from earlier was to remove commas from book titles, but this is not going to work as we start taking titles from online. The backslash character is also causing it to break.
e.g. "Lincoln's Gettysburg address, delivered in 1863" (one title) => "Lincoln's Gettysburg address" and "delivered in 1863" as two separate titles, where neither of these partial titles exist
TFIDF is time expensive, so let's find a way make this more efficient. All frequencies are kept local at the end of first pass, then score is calculated. So it requires multiple passes of book data. Find a way to find a set sufficiently large enough that we can assume IDF, since word frequency across all 50,000 books might not actually affect things greatly. Challenge is to find a set representative of the whole.
NOTE: potentially just use full Michigan Dataset for IDF then run TFIDF on full Gutenberg using Michigan IDF
Having trouble with the '|' character in the output of the sentiment analyzer. Pandas reads it at a delimiter since output.txt is bar delimited. Could we go without including these characters? They seem unimportant in 99.999% of literature.
Looking through its output, it actually could be important
It's really just the stems we care about here I think. Might make a slight improvement in clustering accuracy.
on Running TFIDF clustering code on preprocessed folder does it gives me the exact csv file with all columns and rows like cluster size values(small,medium.large) etc values just like that are in git repo data folder ?
Use dictionary to swap ID numbers for book titles. Books in directory are read in sequentially so just modify id_files.py so the return value is a dictionary. Move this function into tfidf.py to centralize tasks.
Get author. Get a guess at year of publication if possible (I believe the author's lifetime will be included in the metadata we pull from the website. The mean of this will be a good enough guess)
Insert something into tfidf that cuts everything below a certain threshold. Simplifies the k-means step and removes some of the noise from our data.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.