hlibbabii / log-recommender Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 259.05 MB

Supporting logging activities by mining software repositories

License: MIT License

Shell 0.21% Python 72.16% Java 27.48% Jupyter Notebook 0.15%

deep-learning language-model

log-recommender's Introduction

Hi, I am Hlib Babii

Repositories

log-recommender's People

Contributors

Stargazers

Watchers

log-recommender's Issues

use `c instead of <cc_sep> in order to save space

LM06: Use of PreprocessingType enum instead of strings

LM10: add moreunit-tests to parsing-preprocessing framework

LM13: Dictionary for same case splitting is loaded for every file

Consider words like naïve and Café as english

Currently all the words that contain non-asci chars are considered as non-English. However there are English words like naïve and Café that contain non-asci chars.

Possible solution is to remove accents.
E.g.
Café -> Café
Naïve -> naive

Implement a separate script for vocabulary size calculation

logrec.dataprep.vocabsize.py

Currently, the vocabulary of the preprocessed datasets is calculated as the first step of language model training. Recently some logic has been added for calculation the vocab size using that script without training the whole data model. However, it is still quite a cumbersome procedure.
That's why it totally makes sense to have a separate script for this. An additional feature could be saving information about vocab size for 1,2,5, 15, 50 etc. percents of the dataset. Moreover, multiple files should be processed in parallel.

Along with vocabulary size, words that comprise the vocabulary and how frequently they occur should be returned. Saving this information for certain percentage of data is not necessary. Only the number of oov words (non-asci for the beginning) should be saved for different percents.

So, the subtasks would be:

Implement the script
Remove existing logic from lang_model.py script that allows running 'lang model training' in calc-vocab-size mode only
Implement calling this script from cli (logrec vocab build) hlibbabii/log-recommender-cli#2

Add parsing of log statements to parsing-preprocessing framework

Currently, the following line:
logger.info("The value is " + val);

will be parsed as

[ProcessableToken("logger"), '.', ProcessableToken('info'), StringLiteral([ ProcessableToken("The"), ProcessableToken("value"), ProcessableToken("is"), '+', ProcessableToken('val') ])]

In other words, logger, info are seen the same as any other identifier.

Since we are interested in logging statements, it would make sense to treat them in a special way.
So, instead, the original expression could be parsed as something like:

LogStatement(level=info, text=[ProcessableToken("The"), ProcessableToken("value"), ProcessableToken("is"), VarPlaceholder()], vars=[ProcessableToken('val')])

Review dictionaries (en + non-eng)

The dictionaries are stored under dicts/ directory.

It would be a good idea to review them, check where they are coming from, double-check the licenses under which they are distributed and if neccessary replace/remove them or add new dictionaries.

There is a certain set of words that is not split during same-case splitting

Examples:

'einstance', 'destructee', 'executionspecification', 'combinedfragment', 'interactionoperand', 'coregion', 'durationobservation', 'timeobservation', 'commentlink', 'constraintlink'
'propertyeditor', 'propertyeditorclass', 'readmethod', 'writemethod', 'customizerclass', 'lnsep', 'zic', 'zoneinfo', 'csyntax', 'representatives'

Improve subword separation test-set (don't run everything in one test method + make it documentation-like)

Make logging more systematic

Namely: do initialization in one place; make use of loggers hierarchy (for now everything is root)

Train language model in both directions

Currently when training the model, we are feeding code to it from left to right.

It would be good to try doing it additionally from right to left

LM05: Calc non-English words stats for each file

For each file the following values should be calculated:

total tokens code
total tokens code (unique)
non-eng tokens code
non-eng tokens code (unique)
total tokens strings
total tokens strings(unique)
non-eng tokens strings
non-eng tokens strings (unique)
total tokens comments
total tokens comments (unique)
non-eng tokens comments
non-eng tokens comments (unique)
non-eng examples code
non-eng examples strings
non-eng examples comments

Steps

[x] Save information about file names parsed:
<project_name>.filenames files contain the information about filenames that were parsed into <project_name>.parsed projects (Line n in .<project_name>.filenames file corresponds to line n in <project_name>.parsed file)

nostrcom - break down into 2 different preprocess params

LM11: Implement EN_ONLY preprocessor type

LM02: Use 1 file which will contain local properties (especially pathes), which wont be committed

Discuss if differentiation between splitting types inside of `s s` is needed

e.g. if we have

`s my awesome class s`

we lose infornation about how this words were originally connected

myAwesomeClass

my_awesome_class

my_awesomeClass

Implement proper way to manage fractions of datasets (raw datasets, preprocessed datasets for language modeling and classification)

Possible option is to prefix all files with <chunk>_ where <chunk> is 0..999, e.g.:

70_proj5
70_proj5.parsed.repr
70_proj5.label

Train classifier to detect correct log position

IN01: Setup CI

Try feeding new chunk of data on each epoch

Create a util method for the calculation of the number of files to remove code duplication

Plot embeddings we got after network training

Fix repr to real code translator

Replace words that occur only in 1 project with <project-specific> placeholder

LM04: Add separators between digits of a number

Don't load matplotlib where not necessary (especially in tests)

For classification use only projects that contain logs

Currently, there is a bunch of projects. There are to possible reasons for that:

The logging is done poorly in those projects. Obviously, we shouldn't use such projects for classifier trainer. Therefore when generating a dataset for classification some logic should be implemented that ignores projects with no logs
Our framework was unable to recognize the logging patterns in certain projects, In this case, the regexes for catching log statements should be improved.

The statistics about 1) the percentage of projects without logs at all, 2) the proportion of files with and without logs within the projects that has some logging - should be continuously monitored.

Implement generation of dataset for log position training

import java.mypackage

int a() {
log.warn("!");
}
int b() {
log.info("?");
}

Given the file, the following positive case should be generated (consisting of 2 subcases for different directions):

--> import java.mypackage int a() {
} int b() { log.info("?"); } <--

Negative cases

--> import java.mypackage int a()
{ } int b() { log.info("?"); } <--

Create file handler (to write logs to the file) for lang_model.py

Fix Unicode decode errors

During parsing, some files fail to parse because of UnicodeDecodeError.

A possible reason is that those files are encoded not in UTF-8. If it is so, maybe there is a way to 'guess' the encoding, if not, we should calculate at least the percentage fo such files to include into the report

Fix test_imports unit_test (pytorch import problem)

See travis build # 25

Refactor NonEng logic

Fix and properly document matplotlib error when the order of imports is wrong

LM09: Add non-English words to parsing-preprocessing framework

Add NonEng object to parsed model
Add en_only preprocessing type and implement it

LM01: In some cases train+test+valid folder are created with capitalised first letter, in some not

Come up with a good way to define preprocess params

Write util method for writing dict into two columns

LM08: change log_recommender --> logrec package name to make it shorter

Try marking split words as <s> w1 w2 w3 <s> instead of w1 <sep> w2 <sep> w3

Currently, if we split some word as e.g. MyFavoriteClass, we represent it as my <cc_sep> favorite <cc_sep> class

There is an idea to try an alternative way to mark split words:
<s> my favorite class <s>

Subtasks:

implement another preprocessing option, based on which we will mark split words in one or another way.
add support to cli

Try using QRNN instead of LSTM

Make travis ci builds run faster

Currently travis ci builds take around 12 mins to run. The reason for that is the downloading of anaconda packages. It could be improved if there were a possibility to somehow cache anaconda dependencies

Modules that contain main method should not be imported

IN02: Move scripts that are no longer used into a separate directory

Check if information about capitalization of a token is not lost if the token is not split (consists of a single subtoken)

Time until completion logs are useless

Such kind of logs can be found in different scripts (parsing, preprocessing, vocab calculation etc.)

INFO:root:Time elapsed: 568.12 s, estimated time until completion: 97.

Although they are fine if the execution starts from the beginning if the task has been interrupted and restarted, the predicted time is far ffrom true.

LM03: Eliminate concept of chunks from raw datasets. Use project names instead

multilang_100_percent/train/1/android-sdk --> multilang_100_percent/train/android-sdk

Create metrics that consider word splitting

Implement char pair encoding (modification of bpe - byte pair encoding)

See “Neural Machine Translation of Rare Words with Subword Units”
Sennrich, Haddow, Birch, ACL 2016 for details

We split camel case and underscore identifiers, mark (sub)words boundaries, so that bpe doesn't merge them, split everything into characters, give target vocabulary size as input to stop when it is reached. Maybe, we need one more condition to be met along with vocab size, so that frequent words that need to be represented as whole words are included in the final dictionary. As a result, we have a list of merges. At the test time, we split a word into characters and then reassemble it using that list.

Advantage of bpe over the training of a nn thaining: we can specify exactly which vocab size we want to get!

Rename lorec command to logrec

identifiers that contain $ are not split

Examples from dataset:

'add$connection', 'short$connection', 'create$connection', 'short$pattern', '$feature', '$diagram'