Giter Club home page Giter Club logo

psiminer's Introduction

PSIMiner

JetBrains Research

PSIMiner — a tool for processing PSI trees from the IntelliJ Platform. PSI trees contain code syntax trees as well as functions to work with them, and therefore can be used to enrich code representation using static analysis algorithms of modern IDEs.

PSIMiner is a plugin for IntelliJ IDEA that runs it in a headless mode and creates datasets for ML pipelines.

The complete documentation of different parts is stored in docs folder.

Installation

PSIMiner requires Java 11 for correct work. Check gradle will use the correct version. All other dependencies will be installed automatically.

Use ./gradlew build (or gradlew.bat build on Windows) to build the tool.

Usage

There are already predefined configurations compatible with the IntelliJ IDEA. Open or import project in it and run tool on test data or start tests. You can modify these configurations to suit your needs.

However, it is possible to run the tool through CLI. It is better to use predefined shell script (only for Unix system)

./psiminer.sh $dataset_path $output_folder $JSON_config

Logs

PSIMiner automatically store logs in home directory of user on each run. Check ~/psiminer.log (or something like C:\Users\yourusername\psiminer.log for Windows) and share it to describe your problem.

Configuration

PSIMiner completely configured by JSON. Check examples in the configs folder.

Logically PSIMiner consist of the following parts. There are a full documentation for them in docs folder:

  • Tree transformations — this is an interface for enriching trees with new information and other useful manipulations, e.g. resolve types or exclude whitespaces.
  • Filters — this is an interface for removing bad trees from the data, e.g. trees that are too big.
  • Label extractor — this is an interface to define the correct extraction of labels from raw trees, e.g. extract method name for each method.
  • Storage — this is an interface to define how tree should be saved on the disk, e.g. code2seq format or JSONL format.

There are also a few fields to define a parser and pipeline options. For example, setting up Language.

Additional preprocessing

If you turn on additional preprocessing:

  • ✅ more projects will be opened successfully by IDEA
  • ⚠️ files in your original dataset will be changed

More about additional preprocessing

Language support

Currently, PSIMiner supports Java and Kotlin datasets. But we developed the tool with the possibility to extend it to new languages. And since PSI trees supports big amount of languages, adding new language into the tool requires only implementing few interfaces.

Be aware that multiple tree transformations can't be adopted to new languages automatically. And therefore, require manual work to add support for the new language.

If you would like to see new languages, don't hesitate to create issues with their request. Or even implement them yourself and create a pull request.

Use as dependency

You can reuse different parts of the PSIMiner inside your one tool, e.g. plugin for model inference. To add core part of the tool (without dependency to CLI) add following code into your gradle.kts file:

dependencies {
    implementation("org.jetbrains.research.psiminer:psiminer-core") {
        version {
            branch = "main"
        }
    }
}

Remember that PSIMiner is plugin for IntelliJ IDEA and, therefore, can be integrated only in another plugin.

Citation

The paper dedicated to the PSIMiner was published in MSR'21. If you use PSIMiner in your academic work, please, cite it.

@inproceedings{spirin_psiminer,
  author={Spirin, Egor and Bogomolov, Egor and Kovalenko, Vladimir and Bryksin, Timofey},
  booktitle={2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)}, 
  title={PSIMiner: A Tool for Mining Rich Abstract Syntax Trees from Code}, 
  year={2021},
  pages={13-17},
  doi={10.1109/MSR52588.2021.00014}
}

psiminer's People

Contributors

dimart avatar egor-bogomolov avatar furetur avatar illided avatar koushik1703 avatar malodetz avatar max-martynov avatar spirinegor avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

psiminer's Issues

Support graph mining for Python

As a part of mining CodeSearchNet graphs (#31), we should be able to mine Python data. It should include:

  • Extraction of PSI for Python code
  • Extraction of basic edges (AST, next token)
  • Data flow edges
  • Control flow edges

Support graph mining for Ruby

As a part of mining CodeSearchNet graphs (#31), we should be able to mine Ruby data. It should include:

  • Extraction of PSI for Ruby code
  • Extraction of basic edges (AST, next token)
  • Data flow edges
  • Control flow edges

parsing errors

Does it provide parsing errors given some buggy code?

gradle build error, need help

Hi All,

I download the source code and open with idea version 2020.3.

It starts build, but fails. The log is:

A problem occurred configuring root project 'psiminer'.

Could not resolve all artifacts for configuration ':classpath'.
Could not resolve org.jetbrains.intellij.plugins:gradle-intellij-plugin:1.1.4.
Required by:
project : > org.jetbrains.intellij:org.jetbrains.intellij.gradle.plugin:1.1.4
> No matching variant of org.jetbrains.intellij.plugins:gradle-intellij-plugin:1.1.4 was found. The consumer was configured to find a runtime of a library compatible with Java 8, packaged as a jar, and its dependencies declared externally, as well as attribute 'org.gradle.plugin.api-version' with value '7.0.1' but:
- Variant 'apiElements' capability org.jetbrains.intellij.plugins:gradle-intellij-plugin:1.1.4 declares a library, packaged as a jar, and its dependencies declared externally:
- Incompatible because this component declares an API of a component compatible with Java 11 and the consumer needed a runtime of a component compatible with Java 8
- Other compatible attribute:
- Doesn't say anything about org.gradle.plugin.api-version (required '7.0.1')
- Variant 'runtimeElements' capability org.jetbrains.intellij.plugins:gradle-intellij-plugin:1.1.4 declares a runtime of a library, packaged as a jar, and its dependencies declared externally:
- Incompatible because this component declares a component compatible with Java 11 and the consumer needed a component compatible with Java 8
- Other compatible attribute:
- Doesn't say anything about org.gradle.plugin.api-version (required '7.0.1')

  • Try:
    Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

It seems that it fails to download dependencies.

Could anyone help?

Thanks!

Support graph mining for JS

As a part of mining CodeSearchNet graphs (#31), we should be able to mine JS data. It should include:

  • Extraction of PSI for JS code
  • Extraction of basic edges (AST, next token)
  • Data flow edges
  • Control flow edges

Support graph mining for PHP

As a part of mining CodeSearchNet graphs (#31), we should be able to mine PHP data. It should include:

  • Extraction of PSI for PHP code
  • Extraction of basic edges (AST, next token)
  • Data flow edges
  • Control flow edges

Psiminer doesn't work for large trees

The java-med dataset has a file train/stanfordnlp__CoreNLP/src/edu/stanford/nlp/process/PTBLexer.java containing 76704 lines of code. IDEA parses this file as a single PsiPlainText element and psiminer does the same. I think the miner should skip such files with a warning

Incomplete path contexts

While evaluating code2seq model on projects apache__hbase and wildfly__wildfly from test part of java-med dataset, preprocessed via psiminer (see the config) I got errors:

wildfly__wildfly
Global seed set to 7
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Testing: 0it [00:00, ?it/s][W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Testing:  93%|████████████████████████████████████████████████████████████████▍    | 14/15 [00:13<00:00,  1.14it/s]Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/fine-tuning/fine-tuning-ml-models/scripts/test_all.py", line 42, in <module>
    test_all(args.dataset, args.model, args.results)
  File "/home/ubuntu/fine-tuning/fine-tuning-ml-models/scripts/test_all.py", line 25, in test_all
    metrics = test_single(model_path, os.path.join(PREPROCESSED_DATASETS_DIR, project_name))
  File "/home/ubuntu/fine-tuning/fine-tuning-ml-models/scripts/test_single.py", line 21, in test_single
    results = test(model_path, project_path, batch_size=1)
  File "dependencies/code2seq_repo/code2seq/test.py", line 57, in test
    return trainer.test(model, datamodule=data_module)
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in test
    results = self._run(model)
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
    self.dispatch()
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 795, in dispatch
    self.accelerator.start_evaluating(self)
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 99, in start_evaluating
    self.training_type_plugin.start_evaluating(trainer)
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 148, in start_evaluating
    self._results = trainer.run_stage()
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 806, in run_stage
    return self.run_evaluate()
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1049, in run_evaluate
    eval_loop_results = self.run_evaluation()
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 954, in run_evaluation
    for batch_idx, batch in enumerate(dataloader):
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "dependencies/code2seq_repo/code2seq/dataset/path_context_dataset.py", line 82, in __getitem__
    splitted_contexts = [self._split_context(str_contexts[i]) for i in context_indexes]
  File "dependencies/code2seq_repo/code2seq/dataset/path_context_dataset.py", line 82, in <listcomp>
    splitted_contexts = [self._split_context(str_contexts[i]) for i in context_indexes]
  File "dependencies/code2seq_repo/code2seq/dataset/path_context_dataset.py", line 51, in _split_context
    from_token, path_nodes, to_token = context.split(",")
ValueError: not enough values to unpack (expected 3, got 2)
Exception ignored in: <function tqdm.__del__ at 0x7ff4422eeaf0>
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/tqdm/std.py", line 1122, in __del__
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/tqdm/std.py", line 1335, in close
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/tqdm/std.py", line 1514, in display
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/tqdm/std.py", line 1125, in __repr__
  File "/home/ubuntu/anaconda3/envs/fine-tuning-env/lib/python3.8/site-packages/tqdm/std.py", line 1475, in format_dict
TypeError: cannot unpack non-iterable NoneType object 

It seems that part of the path context is missing. Note, that it is also the last line of the .c2s file, so the preprocessed file looks uncompleted.

Change plugin-utilities dependency declaration

!!!Actions required!!!
We have published current plugin-utilities lib version from master branch to space maven repository and in a week planning to modify api (so master branch will become invalid). Please, replace plugin-utilities lib git-based installation in your build.gradle.kts with (standard dependency declaration) by adding space repository and declaring required dependencies:

repositories {
    maven(“https://packages.jetbrains.team/maven/p/big-code/bigcode”)
}

dependencies {
    implementation(“org.jetbrains.research:plugin-utilities-core:1.0")
}

Support graph mining for Go

As a part of mining CodeSearchNet graphs (#31), we should be able to mine Go data. It should include:

  • Extraction of PSI for Go code
  • Extraction of basic edges (AST, next token)
  • Data flow edges
  • Control flow edges

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.