Giter Club home page Giter Club logo

ukgovdatascience.govuk-lda-tagger-lite's Introduction

Tag GOV.UK documents with the LDA algorithm

This project contains several experiments that used the LDA machine learning algorithm to generate topics from pages on GOV.UK and tag them with those topics.

Nomenclature

Install requirements

The best way to run these scripts is by using the govuk-lda-tagger-image docker container, which will ensure that python 2.7 and all the necessary dependencies are installed.

Try it out

Before execution, the EXPERIMENT_DIR environment variable needs to be set to the folder in which you want your experiments to be saved. When using within a docker container, this should default to /mnt/experiments to allow the experiments to be mounted as a volume.

The train_lda.py script is a command line interface (CLI) to the LDA tagger. You can customise the input dataset, the preprocessing, and the parameters passed to the underlying LDA library.

Generating topics and tags for early years

Using the early years data from the HTML pages to derive topics, and tagging every document to those topics:

train_lda.py import --experiment early_years input/early-years.csv

The --experiment option defines the output directory under experiments. It defaults to one generated from the current time.

Using a curated dictionary

Pass a curated dictionary using the --input-dictionary option. By default the dictionary is generated from the corpus, excluding a number of predefined stopwords (defined in the stopwords directory).

train_lda.py import input/audits_with_content.csv --input-dictionary input/dictionary.txt

Retraining using the same corpus

If you already ran an experiment, but something went wrong, you can use the refine subcommand to train it again, but reuse the corpus generated in the first run. The final argument is the original experiment directory name, which will be overwritten.

train_lda.py --numtopics 100 refine early-years

Using the GensimEngine class

In gensim_engine.py there is a class that can be used to train and run an LDA model programatically.

This has the following API:

# Instantiate an object
engine = GensimEngine(documents, log=True)

# Train the model with the data provided
experiment = engine.train(number_of_topics=20)

# Tag all documents in the corpus
tags = experiment.tag()

documents is expected to be a list of dictionaries, where each dictionary has a base_path key and a text key.

Other scripts

When we started the project we created two simple scripts to test the libraries we used.

You can run either of these to see some sample topics.

Using Python's lda library

Run python run_lda.py in order to use the LDA library to generate topics and categorise the documents listed in the input file.

Using Python's gensim library

Run python run_gensim.py in order to use the gensim library to generate topics and categorise the documents listed in the input file.

Fetching new data

Import indexable content from the search API

In order to fetch data from the search API, prepare a CSV input file containing one column (with the URL header) and the base_path of the links we wish to fetch content for.

Then run the following command:

python import_indexable_content.py --environemnt https://www.gov.uk input_file.csv

This script outputs CSV rows with the title, description, indexable content, topic names and organisation names.

Import PDF data

In order to fetch PDF text from a number of GOV.UK base paths, prepare a CSV input file containing one column (with the URL header) and the base_path of the links we wish to fetch content for.

Then, run the following command:

python fetch_pdf_content.py input_file.csv output_file.csv

The output file will include the same base paths and also the text found in all PDF attachments, merged into one big string.

Combine all the data

The python tool CSVKit can be used to combine the separate CSVs into one:

Note that because the columns are very wide, you will need to increase the default maximum field size:

csvjoin -c url all_audits_for_education.csv all_audits_for_education_with_pdf_data.csv > all_audits_for_education_with_pdf_and_indexable_content.csv --maxfieldsize [a big number]

The resulting CSV can then be passed to data_import/combine_csv_columns.py to merge everything into one "words" column.

python data_import/combine_csv_columns.py < all_audits_for_education_with_pdf_and_indexable_content.csv > all_audits_for_education_words.csv

Licence

MIT License

ukgovdatascience.govuk-lda-tagger-lite's People

Contributors

djheron avatar ivyleavedtoadflax avatar mammykins avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.