author-similarity's People
author-similarity's Issues
Periodic background classifier retraining
When new texts are added for training, retraining the classifier is a potential option. We need to balance the costs of training with the cost of having a suboptimal classifier. We already have a status field on the Classifier model, and we can use the untrained status to represent a classifier that should be retrained at some point in the future.
- When a new Text is added, set classifier.status to "untrained"
- Create a periodic celery task to run every half hour
- If the classifier is untrained, retrain it! ๐
Test accuracy for muliple chunk sizes
It looks like our accuracy is low for small input sizes. This might be because the training is biased towards bigger chunk sizes.
Write something to retrain and test accuracy for multiple chunk sizes.
Add average syllable to model
Average number of syllables per word
http://h6o6.com/2013/03/using-python-and-the-nltk-to-find-haikus-in-the-public-twitter-stream/
Consider moving features out of Chunk model
Currently, fingerprints are stored as a massive lists of fields on the Chunk model.
Instead, we could:
- Add a FeatureType model (name, description)
- Add a Feature model (id of linked Chunk/Text/Author, id of FeatureType, value[float])
- Ensure FeatureType table is populated on system startup
Pros:
- simplifies Chunk model a whole lot
- allows adding new features without migrations on Chunk model
- we can store fingerprints against Text/Authors instead of just Chunks
Cons:
- Probable performance implications for getting fingerprint for objects
- A whole bunch of changes
Ensure a system classifier exists
It is possible for the django app to be in a state where there is no instance of Classifier, and certain tasks will fail.
We need to ensure that there's at least one at all times e.g after first migrate / on first startup / on running scripts in shell.
design a front-end
non-ascii characters
Should handle non-ascii characters rather than ignoring them
Add script to add a whole bunch of texts to Django at once
Given a directory structure like:
texts/
texts/author1/text1.txt
texts/author1/text2.txt
texts/author2/text1.txt
texts/author2/text2.txt
Write some code that:
for each author folder in "texts":
if the author doesn't exist: add it to Django
for each of the author's texts:
if the text doesn't exist: add it to Django
fix concurrency
Only store the classifier on the server worker
Right now, whichever worker is assigned the periodic_retrain task will store the trained, pickled classifier locally, which is useless.
Add way of finding classifier accuracy
Find accuracy using cross-validation and add button to admin
Improve fields shown on admin lists
- Text list should have links for authors
- Chunks should have links for texts and authors
- Classifier should show training status
Improve classification feedback
Improve classification feedback:
- fix "de" (probably "do")
- average texts/chunks
- compile list of interesting fields for output and only find the average for these
- combine POS categories into 7 main tags
fix unicode
Instead of using strings change over to using unicode everywhere and fix the problems that arise as a result.
Remove writing to file
No longer write chunks or fingerprints to file
Write startup script for dev environments
With Celery, we now have a whole bunch (4 or so) of processes to start to run the webserver and parallel stuff:
- django's live server
- celery
- celery's beat
- celery's cam
While all of these things have scripts for convenience, running it all still involves opening multiple terminals/tmux-panes, which is annoying. Ideally we could have one command which runs them all at once.
This is just for development; elegant solutions will exist for production, and I'll include the relevant systemd config files eventually.
Implement front-end design
Get more training data
- Get more training data.
- Write something to convert it to .txt
- Obtain accuracy of system with more data and ensure it's still as good
- Reassess the list of function words on a larger dataset
Average chunks
We can speed up some operations by storing "average" chunks for Author and Text models.
Once a text has been added, the average of all chunks should be computed and stored for the texts.
If an author's texts have changed, it's average chunk should be computed based on it's texts' chunks, and stored.
Fix reliance on files when processing
Currently, we're unable to parallelise anything that relies on the presence of some file on the worker machine's filesystem.
This currently causes an issue when adding new texts:
- When a new text is added, we call Text.save()
- After saving to the database, we call tasks.process_text
** process_text assumes that it'll be able to find the Text and its contents, for chunking
** BUT: the files are only present on the "main" instance
Everything that involves reading the file should be executed where we're sure the file exists (i.e the Text.save() method).
We should also remove anything concerning clean_up....
process_chunk doesn't have the same issue, because it receives the actual chunk text content as an argument (still need to consider networking overheads here).
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.