eilidhhendry / author-similarity Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 1.66 MB

Python 93.61% HTML 1.19% CSS 1.04% JavaScript 3.13% Shell 1.03%

author-similarity's People

Contributors

Stargazers

Watchers

author-similarity's Issues

Periodic background classifier retraining

When new texts are added for training, retraining the classifier is a potential option. We need to balance the costs of training with the cost of having a suboptimal classifier. We already have a status field on the Classifier model, and we can use the untrained status to represent a classifier that should be retrained at some point in the future.

When a new Text is added, set classifier.status to "untrained"
Create a periodic celery task to run every half hour
If the classifier is untrained, retrain it! 👍

Test accuracy for muliple chunk sizes

It looks like our accuracy is low for small input sizes. This might be because the training is biased towards bigger chunk sizes.

Write something to retrain and test accuracy for multiple chunk sizes.

Add average syllable to model

Average number of syllables per word
http://h6o6.com/2013/03/using-python-and-the-nltk-to-find-haikus-in-the-public-twitter-stream/

Consider moving features out of Chunk model

Currently, fingerprints are stored as a massive lists of fields on the Chunk model.

Instead, we could:

Add a FeatureType model (name, description)
Add a Feature model (id of linked Chunk/Text/Author, id of FeatureType, value[float])
Ensure FeatureType table is populated on system startup

Pros:

simplifies Chunk model a whole lot
allows adding new features without migrations on Chunk model
we can store fingerprints against Text/Authors instead of just Chunks

Cons:

Probable performance implications for getting fingerprint for objects
A whole bunch of changes

Ensure a system classifier exists

It is possible for the django app to be in a state where there is no instance of Classifier, and certain tasks will fail.

We need to ensure that there's at least one at all times e.g after first migrate / on first startup / on running scripts in shell.

design a front-end

non-ascii characters

Should handle non-ascii characters rather than ignoring them

Add script to add a whole bunch of texts to Django at once

Given a directory structure like:

texts/
texts/author1/text1.txt
texts/author1/text2.txt
texts/author2/text1.txt
texts/author2/text2.txt

Write some code that:

for each author folder in "texts":
    if the author doesn't exist: add it to Django
    for each of the author's texts:
        if the text doesn't exist: add it to Django

fix concurrency

Only store the classifier on the server worker

Right now, whichever worker is assigned the periodic_retrain task will store the trained, pickled classifier locally, which is useless.

Add way of finding classifier accuracy

Find accuracy using cross-validation and add button to admin

Improve fields shown on admin lists

Text list should have links for authors
Chunks should have links for texts and authors
Classifier should show training status

Improve classification feedback

Improve classification feedback:

fix "de" (probably "do")
average texts/chunks
compile list of interesting fields for output and only find the average for these
combine POS categories into 7 main tags

fix unicode

Instead of using strings change over to using unicode everywhere and fix the problems that arise as a result.

Remove writing to file

No longer write chunks or fingerprints to file

Write startup script for dev environments

With Celery, we now have a whole bunch (4 or so) of processes to start to run the webserver and parallel stuff:

django's live server
celery
celery's beat
celery's cam

While all of these things have scripts for convenience, running it all still involves opening multiple terminals/tmux-panes, which is annoying. Ideally we could have one command which runs them all at once.

This is just for development; elegant solutions will exist for production, and I'll include the relevant systemd config files eventually.

Implement front-end design

Get more training data

Get more training data.
Write something to convert it to .txt
Obtain accuracy of system with more data and ensure it's still as good
Reassess the list of function words on a larger dataset

Average chunks

We can speed up some operations by storing "average" chunks for Author and Text models.

Once a text has been added, the average of all chunks should be computed and stored for the texts.
If an author's texts have changed, it's average chunk should be computed based on it's texts' chunks, and stored.

Fix reliance on files when processing

Currently, we're unable to parallelise anything that relies on the presence of some file on the worker machine's filesystem.

This currently causes an issue when adding new texts:

When a new text is added, we call Text.save()
After saving to the database, we call tasks.process_text
** process_text assumes that it'll be able to find the Text and its contents, for chunking
** BUT: the files are only present on the "main" instance

Everything that involves reading the file should be executed where we're sure the file exists (i.e the Text.save() method).
We should also remove anything concerning clean_up....

process_chunk doesn't have the same issue, because it receives the actual chunk text content as an argument (still need to consider networking overheads here).

eilidhhendry / author-similarity Goto Github PK

author-similarity's People

Contributors

Stargazers

Watchers

author-similarity's Issues

Recommend Projects

Recommend Topics

Recommend Org