Giter Club home page Giter Club logo

author-similarity's People

Contributors

chriswait avatar eilidhhendry avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

author-similarity's Issues

Periodic background classifier retraining

When new texts are added for training, retraining the classifier is a potential option. We need to balance the costs of training with the cost of having a suboptimal classifier. We already have a status field on the Classifier model, and we can use the untrained status to represent a classifier that should be retrained at some point in the future.

  • When a new Text is added, set classifier.status to "untrained"
  • Create a periodic celery task to run every half hour
  • If the classifier is untrained, retrain it! ๐Ÿ‘

Test accuracy for muliple chunk sizes

It looks like our accuracy is low for small input sizes. This might be because the training is biased towards bigger chunk sizes.

Write something to retrain and test accuracy for multiple chunk sizes.

Consider moving features out of Chunk model

Currently, fingerprints are stored as a massive lists of fields on the Chunk model.

Instead, we could:

  • Add a FeatureType model (name, description)
  • Add a Feature model (id of linked Chunk/Text/Author, id of FeatureType, value[float])
  • Ensure FeatureType table is populated on system startup

Pros:

  • simplifies Chunk model a whole lot
  • allows adding new features without migrations on Chunk model
  • we can store fingerprints against Text/Authors instead of just Chunks

Cons:

  • Probable performance implications for getting fingerprint for objects
  • A whole bunch of changes

Ensure a system classifier exists

It is possible for the django app to be in a state where there is no instance of Classifier, and certain tasks will fail.

We need to ensure that there's at least one at all times e.g after first migrate / on first startup / on running scripts in shell.

Add script to add a whole bunch of texts to Django at once

Given a directory structure like:

texts/
texts/author1/text1.txt
texts/author1/text2.txt
texts/author2/text1.txt
texts/author2/text2.txt

Write some code that:

for each author folder in "texts":
    if the author doesn't exist: add it to Django
    for each of the author's texts:
        if the text doesn't exist: add it to Django

Improve classification feedback

Improve classification feedback:

  • fix "de" (probably "do")
  • average texts/chunks
  • compile list of interesting fields for output and only find the average for these
  • combine POS categories into 7 main tags

fix unicode

Instead of using strings change over to using unicode everywhere and fix the problems that arise as a result.

Write startup script for dev environments

With Celery, we now have a whole bunch (4 or so) of processes to start to run the webserver and parallel stuff:

  • django's live server
  • celery
  • celery's beat
  • celery's cam

While all of these things have scripts for convenience, running it all still involves opening multiple terminals/tmux-panes, which is annoying. Ideally we could have one command which runs them all at once.

This is just for development; elegant solutions will exist for production, and I'll include the relevant systemd config files eventually.

Get more training data

  • Get more training data.
  • Write something to convert it to .txt
  • Obtain accuracy of system with more data and ensure it's still as good
  • Reassess the list of function words on a larger dataset

Average chunks

We can speed up some operations by storing "average" chunks for Author and Text models.

Once a text has been added, the average of all chunks should be computed and stored for the texts.
If an author's texts have changed, it's average chunk should be computed based on it's texts' chunks, and stored.

Fix reliance on files when processing

Currently, we're unable to parallelise anything that relies on the presence of some file on the worker machine's filesystem.

This currently causes an issue when adding new texts:

  • When a new text is added, we call Text.save()
  • After saving to the database, we call tasks.process_text
    ** process_text assumes that it'll be able to find the Text and its contents, for chunking
    ** BUT: the files are only present on the "main" instance

Everything that involves reading the file should be executed where we're sure the file exists (i.e the Text.save() method).
We should also remove anything concerning clean_up....

process_chunk doesn't have the same issue, because it receives the actual chunk text content as an argument (still need to consider networking overheads here).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.