mlopatka / canosp2020 Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 4.0 51.72 MB

[staging] CANOSP project repo for January 2020 cohort

License: Mozilla Public License 2.0

Python 1.91% Jupyter Notebook 98.09%

canosp2020's People

Stargazers

Watchers

Forkers

akatsoulas jonichan konantian willfenton

canosp2020's Issues

Generate a term frequency model for the entirety of the SUMO corpus

Parse the entire SUMO corpus via the Kitsune API and store a local copy so that we are not hitting the API constantly.
Count the word frequency across the entire corpus.

Refactor a branch of CrowdTruth to remove use of timing

The CrowdTruth library uses the timing associated with tag attribution events as a factor for assessing tagger and tag reliability. We currently clobber this functionality by passing a uniform time value for every tag/item/tagger triplet.

It would be ideal to remove the timing factor from the CrowdTruth code or refactor the code to optionally omit its usage in a custom fork.

Migrate package requirements to Anaconda

Right now our requirements.txt is broken, it doesn't work on a fresh clone of the repo. We should be able to:

Clone repo
Install packages from requirements
Run all of our scripts, notebooks, etc. without issue

We'll switch from pip's requirements.txt approach to using the Anaconda package manager

Custom stopwords based on TF-IDF

Text preprocessing code needs to be able to handle non-english text

Right now it just crashes when it encounters non-english text.

I propose that we just remove non-english text from our dataset.

Test coverage for JSON --> CSV conversion

Ensure that the outputted CSV is valid and conforms to our standard

Test coverage for CSV --> panda Dataframe conversion

Ensure that the outputted Dataframe is valid and conforms to our standard

Add test coverage for JSON schema validation

Add jsonschema validation to ensure that all parsed data adheres to the JSON schema defined in the repo wiki.

Validate all JSON files under data/ directory.

Integrate Michael's preprocessing code into the JSON --> CSV script

Evaluate tag quality using methodology from CrowdTruth repo

The functionality described in CrowdTruth pertaining to annotation quality metrics in the presence of ambiguity/annotator disagreement may be useful in defining a tag-set for SUMO tickets.

A first step towards integration is to evaluate our manually generate tag set against the metrics defined in compute_ann_quality_factors

Determine a useful IDF (global term frequency model) for normalizing term frequency for support tickets

Possible normalization can be performed using the entirety of the SUMO corpus or a subset. Some strategies may be to:

use a sample of recent tickets
use a random subset of tickets
use a subset of tickets that is selected to ensure balanced representation of active fx versions
use a subset of tickets corresponding to only the current release version of Firefox

This issue should be closed by the production of a notebook comparing the impact of different strategies and the downstream impact on some primitive models. A decision about what normalization strategy is advocated for should be specified.

sample file of 2000 tickets has 4 duplicates

The CSV file from #92 has duplicates:

cd ~/GIT/CANOSP2020/data
uniq -D mturk_tickets.csv | wc -l
8

This should be 0 not 8 right?

Please:

remove the duplicates
add a test or ensure that all 2000 tickets are unique

Harmonize text preprocessing workflows

Each of the different exploratory directions are using a lot of manual regex-based data cleaning.
We should consolidate that code into a utils file and use consistent prepossessing across all our different approaches.

How do we address code snippets, non-english terms, crash reports, etc. in the tickets?

We'll need to think about how we handle strange text when calculating statistics and training models.

Using t-SNE to implement a classifier

Goal

Human annotator tags support ticket with free-form text. However, many of them share similarities (typo, variation of words, etc) such that we could cluster them into different groups.

Use T-SNE model from sklearn to explore the possibility of using it to implement a classification algorithm

Idea

Use sklearn to find segments from variation of hand-annotated tags
Consider each segment as a class
Then apply the conventional classification method on the data (we could start with an NB classifier)
Also, explore other classification methods, such as, SVM, LSTM, RNN, etc.
Instead of clustering the tags, we can use t-SNE on the text (questions) and do clustering (though may not be a good idea according to https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne)

Notes

word2vec or other pre-trained word embedding model won't work for us since all we have is a bunch of tags that are hardly a word.
char2vec might be handy, https://github.com/Lettria/Char2Vec

Write a parser that can extract tags associated with a SUMO support ticket

We need to associate a set of manually contributed tags to a sumo issue with a particular id.
The tags are currently spread across multiple google sheets and csv files.

Scrape new tickets (tickets after the code sprint)

At least some of the tickets we tagged in #48 are not included in our 200MB ticket JSON. We need to scrape these new tickets so that we have the title and body for them.

Data pre-proccessing

Define a data structure in Python.

Write a function in Python which takes a text argument and does the pre-processing work, the function should also implement a "feature flag" which we would be able to turn stuff on and off.

Lemmatization
Removing punctations
Removing HTML tags
Remove URL
Extracting text from hashtag

Prepare data aggregates for the SUMO ticket corpus

Useful aggregates that provide meaningful and actionable insights include:

most frequent words/word pairs used in tickets per FXrelease
most frequent operating systems. This is useful because we know Windows 10 and 7 are most frequent, so if over time e.g. at the start of a FX release, Windows 8 or OS X or Linux spike up or down, we know something is wrong.
most frequent Firefox version tag This is useful because we know that if there's a sudden spike in an older version then perhaps we have a problem in ESR or a problem in the current release that is causing people to downgrade, etc.

most frequent Firefox "topic" tag where "topic" is one of: Download, install and migration; Privacy and security settings; Customize controls, options and add-ons; Fix slowness; crashing; error messages and other problems; Tips and Tricks; Bookmarks; Cookies; Tabs; Websites; Firefox Sync; Other. This is useful because if one of these topic tags is trending at the start of release or during a release then we probably want analyze those questions to see what is trending.
how often are crash reports, and about:support in the questions

Refactor num2word to be done in parallel

#108

Add word2vec functionality at the ticket level

Get a working implementation that wraps a word2vec constructor at the support ticket level.
https://github.com/danielfrg/word2vec

Get ticket text using the Kitsune API

We need the title and content of the tickets that we tagged, so we need to request that using the Kitsune API.

See WIKI page, https://github.com/mlopatka/CANOSP2020/wiki/Ticket-JSON-format

Update ticket data in the repository with newly tagged tickets

Now that we've each tagged those 100 tickets, we need to add them to our data (csv / json) in order to analyze them.

How to user Naive Bayes model

Based on my research, there are two approaches to apply the Naive Bayes model to train our model.
The first one is to embed the Naive Bayes model inside classifiers, such as ClassifierChain, MLkNN, RandomForestClassifier, etc.
Another way is pretty simple, we use Naive Bayes directly and fit over data into it.
Which approach is better to use for our project?

Port test coverage to apply to production functions for data munging.

@konantian @ExiaSR Please coordinate to get this done and discuss the process here.

Apply lowercase to preprocessing

Considering remove textpipe and just stick with spacy and nltk

We are not really getting anything out of textpipe beside then stripping out the HTML tags.

It is giving us pain to install on our workstation, and hard to process large amount of data at once.

I propose we consider replace textpipe with beautifulsoup + spacy for preprocessing work.

Use TF-IDF to do keyword extraction

Figure out why CrowdTruth is only showing me and Yuan in the unit results.

For some reason results["units"] only shows myself and Yuan in the worker column. Try to figure out why.

Add num2words to preprocessing workflow

Because there are some tokens after our current preprocessing are numbers, it is good to try to convert these numbers into simple words. For example, for number 70, we convert it to seventy instead. After converting all the numbers to words, we don't have to treat them as different tokens, since they express the same meanings to the user.

Ingest 4490 tickets tagged by mechnical turk

I believe this issue makes issue #106 obsolete. I will close #106

From our 1500 support tickets, Mechanical Turk has created this CSV file:

13march2020-630pm-Batch_3952790_batch_results.csv

Please ingest into our pipeline:

Please only ingest those tickets with column Q, "AssignmentStatus" set to "Approved". Please do not ingest the tickets that are set to "Rejected". I believe this is approximately 1200 tickets
The tags are in column AD, "Answer tags"
The tagger id is in column P, "WorkerId"

Add code for generating ticket JSON to the repo

Right now we're missing this portion of our data pipeline

Determine a set of 100 additional SUMO support tickets for manual annotations

We need to determine a specific subset of support tickets to redundantly manually tag in order to reduce the sparsity of our training data.

Conversion function between schematized JSON format and (T)CSV format excluding metadata

We should be able to quickly go between a schematic JSON file including meta data fields and a reduced/flattene subset of the features that is useful for pytorch/torchtext or textpipe type libraries

Create a set of tags with TF-IDF

Fix is_expert column in tickets.csv

The last couple taggers in tickets.csv don't have the is_expert column filled in. It should be 1

Stopword list based on relative frequency

Optional in the pre-processing workflow

Add is_mturk column to CSV schema

Similar to is_expert and is_sumo, we should include an is_mturk column to our CSV schema.

Clean up Evelyn and Brady's 100 tagged issues from 24January2020

From #48 , Mozilla employees Evelyn and Brady have completed their 2nd round of tagging and I've placed it in this Google Drive Folder:

Evelyn
Brady

Please:

clean up the two google drive spreadsheets into two well formed CSVs in this github repo in the right folder

Evaluate a classifier over a specified tag set for automated tagging task

Explore different classifiers using a defined set of tags. Compare performance over a subset of tagged data and over a subset of untagged data. Report on results using a cloned version of notebook template

Merge SUMO tickets with our own annotated tickets to create a master JSON

fetch_tickets.py has functionality for merging scraped SUMO tickets (the 200mb ticket JSON) with our own annotated tickets (data/annotated_tickets.csv) into one big JSON file. Once we have this master JSON we can convert it to CSV and rerun all of our analysis with the new data.

Create a CSV file with title and text in plaintext for Amazon Mechnical Turk with 2000 unique support questions

Requirements:

The CSV file should have the following headers (sample file with 9 questions):
sumo-ticket-title,sumo-ticket-text
The ticket text should have the HTML parsed out to plain text.
The file should have the the 200 tickets we have already tagged.
And then a further 1800 selected randomly from our giant file of support tickets
There may be some overlap in 3. Please make sure the file has exactly 2000 unique tickets.