Giter Club home page Giter Club logo

alphagov.classifyintents's Introduction

Build Status codecov.io GitHub tag

classifyintents

This is a python module which prepares and cleans GOV.UK survey data, in preparation for classification using a machine learning algorithm. Training of the algorithm and prediction on new data is handled in the alphagov/classifyintentspipe repo. The module is built around the classifyintents.survey and associated methods.

To install this module using pip:

pip install git+git://github.com/alphagov/classifyintents.git

Alternatively, place the following line in your requirements.txt file:

git+git://github.com/alphagov/classifyintents.git

and run the command pip install -r requirements.txt as usual.

Requirements

  • Python >= 3.5 See requirements.txt for additional requirements.

Usage

Loading data

To begin instantiate an instance of the class with:

import survey from classifyintents

intent = survey()

Load some raw data. The class expects an unedited CSV file downloaded from survey monkey. Note that the load() method also does some cleaning of the column names, and drops a sub-heading row from the csv that was generated by survey monkey.

intent.load('data.csv')

The data is stored as pandas dataframe in the class named intent.raw.

Cleaning the raw data

The next step is to perform some cleaning of the raw data. This is accomplished in the clean_raw() method. The method does a number of things:

  • Creates a copy of the intent.raw dataframe, and calls this new dataframe intent.data.
  • The messy column names inherited from the csv are cleaned up using a dictionary called intent.raw_mapping.
    • Note that if the format of the survey or the names of questions are changed, breaking the class, a quick fix may be to update the intent.raw_mapping dictionary.
  • A number of new features are added to the data:
    • Time taken to complete the survey
    • Some simple features based on the free text
      • Number of characters in the string.
      • Ratio of both capital letters, and exclamation marks to total number of characters.

Determining the org and section

The page that the user was visiting when they were asked to complete the survey is recorded in a cleaned field called full_url. In this step the URLs are cleaned according to a number of rules, and then the unique URLs are extracted and then queried using the GOV.UK content API. This returns an organisation (org) and a section (section).

These data are then merged back into the intent.data dataframe. This step is completed with:

intent.api_lookup()

This step is verbose, and can take a while if there are a large number of URLs to lookup.

Preparing the data for training or prediction

Assuming all has gone well so far, the next step is to prepare the data for training or prediction using a machine learnign algorithm. This is done with the methods intent.trainer() and intent.predictor() respectively.

When calling intent.trainer() a list of classes must be passed as an argument. As part of the method, all classes that are not specified in the list are concatenated into one, enabling one-versus-all (OVA) classification.

Using the predictor() method will remove the outcome class, if it was present.

The data are now ready for the application of a machine learning algorithm.

alphagov.classifyintents's People

Contributors

augustt avatar ivyleavedtoadflax avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.