Giter Club home page Giter Club logo

canosp2020's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

canosp2020's Issues

Refactor a branch of CrowdTruth to remove use of timing

The CrowdTruth library uses the timing associated with tag attribution events as a factor for assessing tagger and tag reliability. We currently clobber this functionality by passing a uniform time value for every tag/item/tagger triplet.

It would be ideal to remove the timing factor from the CrowdTruth code or refactor the code to optionally omit its usage in a custom fork.

Migrate package requirements to Anaconda

Right now our requirements.txt is broken, it doesn't work on a fresh clone of the repo. We should be able to:

  1. Clone repo
  2. Install packages from requirements
  3. Run all of our scripts, notebooks, etc. without issue

We'll switch from pip's requirements.txt approach to using the Anaconda package manager

Determine a useful IDF (global term frequency model) for normalizing term frequency for support tickets

Possible normalization can be performed using the entirety of the SUMO corpus or a subset. Some strategies may be to:

  • use a sample of recent tickets
  • use a random subset of tickets
  • use a subset of tickets that is selected to ensure balanced representation of active fx versions
  • use a subset of tickets corresponding to only the current release version of Firefox

This issue should be closed by the production of a notebook comparing the impact of different strategies and the downstream impact on some primitive models. A decision about what normalization strategy is advocated for should be specified.

sample file of 2000 tickets has 4 duplicates

The CSV file from #92 has duplicates:

cd ~/GIT/CANOSP2020/data
uniq -D mturk_tickets.csv | wc -l
8                                                                                      

This should be 0 not 8 right?

Please:

  1. remove the duplicates
  2. add a test or ensure that all 2000 tickets are unique

Harmonize text preprocessing workflows

Each of the different exploratory directions are using a lot of manual regex-based data cleaning.
We should consolidate that code into a utils file and use consistent prepossessing across all our different approaches.

Using t-SNE to implement a classifier

Goal

Human annotator tags support ticket with free-form text. However, many of them share similarities (typo, variation of words, etc) such that we could cluster them into different groups.

Use T-SNE model from sklearn to explore the possibility of using it to implement a classification algorithm

Idea

  • Use sklearn to find segments from variation of hand-annotated tags
  • Consider each segment as a class
  • Then apply the conventional classification method on the data (we could start with an NB classifier)
  • Also, explore other classification methods, such as, SVM, LSTM, RNN, etc.
  • Instead of clustering the tags, we can use t-SNE on the text (questions) and do clustering (though may not be a good idea according to https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne)

Notes

  • word2vec or other pre-trained word embedding model won't work for us since all we have is a bunch of tags that are hardly a word.
  • char2vec might be handy, https://github.com/Lettria/Char2Vec

Links

Data pre-proccessing

Define a data structure in Python.

Write a function in Python which takes a text argument and does the pre-processing work, the function should also implement a "feature flag" which we would be able to turn stuff on and off.

  • Lemmatization
  • Removing punctations
  • Removing HTML tags
  • Remove URL
  • Extracting text from hashtag

Prepare data aggregates for the SUMO ticket corpus

Useful aggregates that provide meaningful and actionable insights include:

  • most frequent words/word pairs used in tickets per FXrelease
  • most frequent operating systems. This is useful because we know Windows 10 and 7 are most frequent, so if over time e.g. at the start of a FX release, Windows 8 or OS X or Linux spike up or down, we know something is wrong.
  • most frequent Firefox version tag This is useful because we know that if there's a sudden spike in an older version then perhaps we have a problem in ESR or a problem in the current release that is causing people to downgrade, etc.
  • most frequent Firefox "topic" tag where "topic" is one of: Download, install and migration; Privacy and security settings; Customize controls, options and add-ons; Fix slowness; crashing; error messages and other problems; Tips and Tricks; Bookmarks; Cookies; Tabs; Websites; Firefox Sync; Other. This is useful because if one of these topic tags is trending at the start of release or during a release then we probably want analyze those questions to see what is trending.
  • how often are crash reports, and about:support in the questions

How to user Naive Bayes model

Based on my research, there are two approaches to apply the Naive Bayes model to train our model.
The first one is to embed the Naive Bayes model inside classifiers, such as ClassifierChain, MLkNN, RandomForestClassifier, etc.
Another way is pretty simple, we use Naive Bayes directly and fit over data into it.
Which approach is better to use for our project?

Considering remove textpipe and just stick with spacy and nltk

We are not really getting anything out of textpipe beside then stripping out the HTML tags.

It is giving us pain to install on our workstation, and hard to process large amount of data at once.

I propose we consider replace textpipe with beautifulsoup + spacy for preprocessing work.

Add num2words to preprocessing workflow

Because there are some tokens after our current preprocessing are numbers, it is good to try to convert these numbers into simple words. For example, for number 70, we convert it to seventy instead. After converting all the numbers to words, we don't have to treat them as different tokens, since they express the same meanings to the user.

Ingest 4490 tickets tagged by mechnical turk

I believe this issue makes issue #106 obsolete. I will close #106

From our 1500 support tickets, Mechanical Turk has created this CSV file:

13march2020-630pm-Batch_3952790_batch_results.csv

Please ingest into our pipeline:

  1. Please only ingest those tickets with column Q, "AssignmentStatus" set to "Approved". Please do not ingest the tickets that are set to "Rejected". I believe this is approximately 1200 tickets
  2. The tags are in column AD, "Answer tags"
  3. The tagger id is in column P, "WorkerId"

Create a CSV file with title and text in plaintext for Amazon Mechnical Turk with 2000 unique support questions

Requirements:

  1. The CSV file should have the following headers (sample file with 9 questions):
    sumo-ticket-title,sumo-ticket-text
  2. The ticket text should have the HTML parsed out to plain text.
  3. The file should have the the 200 tickets we have already tagged.
  4. And then a further 1800 selected randomly from our giant file of support tickets
  5. There may be some overlap in 3. Please make sure the file has exactly 2000 unique tickets.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.