Giter Club home page Giter Club logo

classifyintents's Introduction

Build Status codecov.io GitHub tag

classifyintents

This is a python module which prepares and cleans GOV.UK survey data, in preparation for classification using a machine learning algorithm. Training of the algorithm and prediction on new data is handled in the alphagov/classifyintentspipe repo. The module is built around the classifyintents.survey and associated methods.

To install this module using pip:

pip install git+git://github.com/alphagov/classifyintents.git

Alternatively, place the following line in your requirements.txt file:

git+git://github.com/alphagov/classifyintents.git

and run the command pip install -r requirements.txt as usual.

Requirements

  • Python >= 3.5 See requirements.txt for additional requirements.

Usage

Loading data

To begin instantiate an instance of the class with:

import survey from classifyintents

intent = survey()

Load some raw data. The class expects an unedited CSV file downloaded from survey monkey. Note that the load() method also does some cleaning of the column names, and drops a sub-heading row from the csv that was generated by survey monkey.

intent.load('data.csv')

The data is stored as pandas dataframe in the class named intent.raw.

Cleaning the raw data

The next step is to perform some cleaning of the raw data. This is accomplished in the clean_raw() method. The method does a number of things:

  • Creates a copy of the intent.raw dataframe, and calls this new dataframe intent.data.
  • The messy column names inherited from the csv are cleaned up using a dictionary called intent.raw_mapping.
    • Note that if the format of the survey or the names of questions are changed, breaking the class, a quick fix may be to update the intent.raw_mapping dictionary.
  • A number of new features are added to the data:
    • Time taken to complete the survey
    • Some simple features based on the free text
      • Number of characters in the string.
      • Ratio of both capital letters, and exclamation marks to total number of characters.

Determining the org and section

The page that the user was visiting when they were asked to complete the survey is recorded in a cleaned field called full_url. In this step the URLs are cleaned according to a number of rules, and then the unique URLs are extracted and then queried using the GOV.UK content API. This returns an organisation (org) and a section (section).

These data are then merged back into the intent.data dataframe. This step is completed with:

intent.api_lookup()

This step is verbose, and can take a while if there are a large number of URLs to lookup.

Preparing the data for training or prediction

Assuming all has gone well so far, the next step is to prepare the data for training or prediction using a machine learnign algorithm. This is done with the methods intent.trainer() and intent.predictor() respectively.

When calling intent.trainer() a list of classes must be passed as an argument. As part of the method, all classes that are not specified in the list are concatenated into one, enabling one-versus-all (OVA) classification.

Using the predictor() method will remove the outcome class, if it was present.

The data are now ready for the application of a machine learning algorithm.

classifyintents's People

Contributors

ivyleavedtoadflax avatar augustt avatar

Stargazers

Keyth M Citizen  avatar hatschibratschi avatar artu avatar Mat avatar Gilles Cornu avatar Mark Edmondson avatar

Watchers

Simon Whatley avatar Steve Laing avatar Anika Henke avatar David Read avatar Ryan MacGillivray avatar Minno avatar Richard Baker avatar Tom avatar James Cloos avatar Mateusz Grotek avatar Chris Blackburn avatar Mark Hurrell avatar Tara Stockford avatar Andrew Leimdorfer avatar Rhian Lewis avatar  avatar Dr Keith Mitchell avatar Ash Chohan avatar Somme avatar Keith Lawrence avatar  avatar Barbara Slawinska avatar Dilwoar Hussain avatar Chae Cramb  avatar Johnathan Ishmael avatar  avatar  avatar Leena Gupte avatar Rosa Fox avatar  avatar David Trussler avatar Jonathon Shire avatar Stephen Harker avatar  avatar artu avatar  avatar Dan Heron avatar  avatar  avatar  avatar Shahina Rahman avatar Kyle MacPherson avatar Rachel Smith avatar z-gooch avatar Jess Jones avatar  avatar George Eaton avatar Felix Harrison avatar  avatar

classifyintents's Issues

Smartsurvey column titles have changed

In 0.6.1 the package expects Q5 to look like: Q5.1. when the latest downloads from smartsurvey look like: Q5. Overall, how did you feel about your visit to GOV.UK today?.

Output a more sensible date format

At present the output date format is MM/DD/YYYY HH:MM:SS. This causes a number of problems downstream, so save dates out to a better format.

Creation of string features fails on empty columns

When running survey.clean_raw() on survey.raw objects with columns that are entirely NaN it will fail because the type defaults to float64 rather than str.

There was an error converting strings to string length column!
Original error message:
AttributeError('Can only use .str accessor with string values, which use np.object_ dtype in pandas',)
There was an error cleaning the 1      NaN
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
8      NaN
9      NaN
10     NaN
11     NaN
12     NaN
13     NaN
14     NaN
15     NaN
16     NaN
17     NaN
18     NaN
19     NaN
20     NaN
21     NaN
22     NaN
23     NaN
24     NaN
25     NaN
26     NaN
27     NaN
28     NaN
29     NaN
30     NaN
        ..
2954   NaN
2955   NaN
2956   NaN
2957   NaN
2958   NaN
2959   NaN
2960   NaN
2961   NaN
2962   NaN
2963   NaN
2964   NaN
2965   NaN
2966   NaN
2967   NaN
2968   NaN
2969   NaN
2970   NaN
2971   NaN
2972   NaN
2973   NaN
2974   NaN
2975   NaN
2976   NaN
2977   NaN
2978   NaN
2979   NaN
2980   NaN
2981   NaN
2982   NaN
2983   NaN
Name: comment_other_where_for_help, dtype: float64 column.
Original error message:
AttributeError('Can only use .str accessor with string values, which use np.object_ dtype in pandas',)

Implement gov.uk content api lookup as method

Make the lookup of organisation and section an automated step by including it in they survey class.

Will need to create blank org and section fields in case this step is not completed in order to ensure compatibility with the data coming from the classified survey response spreadsheets.

Make raw_mapping more robust

intent.raw_mapping is extremely sensitive to small changes to the format of the survey, see for instance #20.

This could be improved by adding some parsing of field names before matching (to remove whitespace, and punctuation, for example).

Add time taken feature

Add a feature which is based on the time taken to complete the form by subtraction of start_date from end_date.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.