Giter Club home page Giter Club logo

log-recommender's Introduction

Hi, I am Hlib Babii


hlibbabii's stats

Repositories

Readme Card Readme Card

log-recommender's People

Contributors

hlibbabii avatar

Stargazers

 avatar

Watchers

 avatar  avatar

log-recommender's Issues

Consider words like naïve and Café as english

Currently all the words that contain non-asci chars are considered as non-English. However there are English words like naïve and Café that contain non-asci chars.

Possible solution is to remove accents.
E.g.
Café -> Café
Naïve -> naive

Implement a separate script for vocabulary size calculation

logrec.dataprep.vocabsize.py

Currently, the vocabulary of the preprocessed datasets is calculated as the first step of language model training. Recently some logic has been added for calculation the vocab size using that script without training the whole data model. However, it is still quite a cumbersome procedure.
That's why it totally makes sense to have a separate script for this. An additional feature could be saving information about vocab size for 1,2,5, 15, 50 etc. percents of the dataset. Moreover, multiple files should be processed in parallel.

Along with vocabulary size, words that comprise the vocabulary and how frequently they occur should be returned. Saving this information for certain percentage of data is not necessary. Only the number of oov words (non-asci for the beginning) should be saved for different percents.

So, the subtasks would be:

  • Implement the script
  • Remove existing logic from lang_model.py script that allows running 'lang model training' in calc-vocab-size mode only
  • Implement calling this script from cli (logrec vocab build) hlibbabii/log-recommender-cli#2

Add parsing of log statements to parsing-preprocessing framework

Currently, the following line:
logger.info("The value is " + val);

will be parsed as

[ProcessableToken("logger"), '.', ProcessableToken('info'), StringLiteral([ ProcessableToken("The"), ProcessableToken("value"), ProcessableToken("is"), '+', ProcessableToken('val') ])]

In other words, logger, info are seen the same as any other identifier.

Since we are interested in logging statements, it would make sense to treat them in a special way.
So, instead, the original expression could be parsed as something like:

LogStatement(level=info, text=[ProcessableToken("The"), ProcessableToken("value"), ProcessableToken("is"), VarPlaceholder()], vars=[ProcessableToken('val')])

Review dictionaries (en + non-eng)

The dictionaries are stored under dicts/ directory.

It would be a good idea to review them, check where they are coming from, double-check the licenses under which they are distributed and if neccessary replace/remove them or add new dictionaries.

There is a certain set of words that is not split during same-case splitting

Examples:

'einstance', 'destructee', 'executionspecification', 'combinedfragment', 'interactionoperand', 'coregion', 'durationobservation', 'timeobservation', 'commentlink', 'constraintlink'
'propertyeditor', 'propertyeditorclass', 'readmethod', 'writemethod', 'customizerclass', 'lnsep', 'zic', 'zoneinfo', 'csyntax', 'representatives'

LM05: Calc non-English words stats for each file

For each file the following values should be calculated:

  • total tokens code

  • total tokens code (unique)

  • non-eng tokens code

  • non-eng tokens code (unique)

  • total tokens strings

  • total tokens strings(unique)

  • non-eng tokens strings

  • non-eng tokens strings (unique)

  • total tokens comments

  • total tokens comments (unique)

  • non-eng tokens comments

  • non-eng tokens comments (unique)

  • non-eng examples code

  • non-eng examples strings

  • non-eng examples comments

Steps

[x] Save information about file names parsed:
<project_name>.filenames files contain the information about filenames that were parsed into <project_name>.parsed projects (Line n in .<project_name>.filenames file corresponds to line n in <project_name>.parsed file)

  • nostrcom - break down into 2 different preprocess params

For classification use only projects that contain logs

Currently, there is a bunch of projects. There are to possible reasons for that:

  • The logging is done poorly in those projects. Obviously, we shouldn't use such projects for classifier trainer. Therefore when generating a dataset for classification some logic should be implemented that ignores projects with no logs
  • Our framework was unable to recognize the logging patterns in certain projects, In this case, the regexes for catching log statements should be improved.

The statistics about 1) the percentage of projects without logs at all, 2) the proportion of files with and without logs within the projects that has some logging - should be continuously monitored.

Implement generation of dataset for log position training

import java.mypackage

int a() {
log.warn("!");
}
int b() {
log.info("?");
}

Given the file, the following positive case should be generated (consisting of 2 subcases for different directions):

--> import java.mypackage int a() {
} int b() { log.info("?"); } <--

Negative cases

--> import java.mypackage int a()
{ } int b() { log.info("?"); } <--

Fix Unicode decode errors

During parsing, some files fail to parse because of UnicodeDecodeError.

A possible reason is that those files are encoded not in UTF-8. If it is so, maybe there is a way to 'guess' the encoding, if not, we should calculate at least the percentage fo such files to include into the report

Try marking split words as <s> w1 w2 w3 <s> instead of w1 <sep> w2 <sep> w3

Currently, if we split some word as e.g. MyFavoriteClass, we represent it as my <cc_sep> favorite <cc_sep> class

There is an idea to try an alternative way to mark split words:
<s> my favorite class <s>

Subtasks:

  • implement another preprocessing option, based on which we will mark split words in one or another way.
  • add support to cli

Make travis ci builds run faster

Currently travis ci builds take around 12 mins to run. The reason for that is the downloading of anaconda packages. It could be improved if there were a possibility to somehow cache anaconda dependencies

Time until completion logs are useless

Such kind of logs can be found in different scripts (parsing, preprocessing, vocab calculation etc.)

INFO:root:Time elapsed: 568.12 s, estimated time until completion: 97.

Although they are fine if the execution starts from the beginning if the task has been interrupted and restarted, the predicted time is far ffrom true.

Implement char pair encoding (modification of bpe - byte pair encoding)

See “Neural Machine Translation of Rare Words with Subword Units”
Sennrich, Haddow, Birch, ACL 2016 for details

We split camel case and underscore identifiers, mark (sub)words boundaries, so that bpe doesn't merge them, split everything into characters, give target vocabulary size as input to stop when it is reached. Maybe, we need one more condition to be met along with vocab size, so that frequent words that need to be represented as whole words are included in the final dictionary. As a result, we have a list of merges. At the test time, we split a word into characters and then reassemble it using that list.

Advantage of bpe over the training of a nn thaining: we can specify exactly which vocab size we want to get!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.