Giter Club home page Giter Club logo

strict-replication-package's Introduction

Search Keyword Identification for Concept Location using Graph-Based Term Weighting

Abstract: During maintenance, software developers deal with numerous change requests that are written in an unstructured fashion using natural language texts. These texts illustrate the change requirements using various domain-related concepts. Software developers need to find appropriate keywords from these texts so that they could identify the relevant locations in the source code using a search technique. Once such locations are identified, they can implement the requested changes there. Studies suggest that developers often perform poorly in choosing the right keywords from a change request. In this article, we propose a novel technique --STRICT-- that (1) identifies suitable keywords from a change request using three graph-based term weighting algorithms -- TextRank, POSRank and WK-Core, and (2) then delivers an appropriate query using query quality analysis and machine learning. Our approach determines a term's importance based on not only its co-occurrences with other important terms but also its syntactic relationships and cohesion with them. Experiments using 955 change requests from 22 Java-based subject systems show that STRICT can offer better search queries than the baseline queries (i.e., preprocessed version of a request text) from 44%--63% of all requests. Our queries also achieve 20% higher accuracy, 10% higher precision and 7% higher reciprocal rank than that of the baseline queries. Comparisons with six existing approaches from the literature demonstrate that our approach can outperform them in improving the baseline queries. Our approach also achieves 14% higher accuracy, 15% higher precision and 13% higher reciprocal rank than that of the six other existing approaches.

Subject Systems (22)

  • adempiere-3.1.0 (12)
  • apache-nutch-1.8 (9)
  • apache-nutch-2.1 (17)
  • atunes-1.10.0 (16)
  • bookkeeper-4.1.0 (21)
  • commons-math-3-3.0 (16)
  • derby-10.9.1.0 (24)
  • ecf (68)
  • eclipse-2.0 (17)
  • eclipse.jdt.core (30)
  • eclipse.jdt.debug (84)
  • eclipse.jdt.ui (242)
  • eclipse.pde.ui (182)
  • jedit-4.2 (10)
  • lang (42)
  • mahout-0.4 (16)
  • mahout-0.8 (15)
  • math (60)
  • openjpa-2.0.1 (16)
  • tika-1.3 (18)
  • time (7)
  • tomcat70 (33)

Total: 955

Experimental Data

  • Baseline/query : Baseline queries extracted from the 955 change requests. They make use of title, description, structured tokens and whole texts. query-whole is our chosen baseline.

  • Baseline/rank : Query Effectiveness (QE) of the baseline queries (Method-level granularity).

  • Baseline/rank-class : Query Effectiveness (QE) of the baseline queries (Document-level granularity).

  • ChangeReqs : 955 change requests from the 22 subject systems.

  • Corpus/method.7z : Corpus containing the method bodies from source code documents of all systems. Please decompress before use.

  • Corpus/norm-method.7z : Corpus containing the normalized method bodies from all the systems. Please decompress before use.

  • Corpus/*.ckeys : Corpus document-index key mapping. This is an encoding of original methods into numbers.

  • Goldset : Ground truth for 955 change requests.

  • Lucene/index-method : Lucene index for concept location.

  • SelectedBug : IDs of the 955 change requests under study.

  • SelectedBug-HQB : IDs of the 225 change requests leading to good baseline queries.

  • SelectedBug-LQB : IDs of the 730 change requests leading to poor baseline queries.

  • tokens* : Tokens generated from project source and change requests for Splitting algorithm -- Samurai.

  • samurai-data : Meta data required by the Splitting algorithm -- Samurai.

  • models : Contains the models for POS tagging by Stanford CoreNLP library.

  • pp-data : Stop words used for pre-processing.

  • strict.lib : Contains the dependencies used by the proposed technique - STRICT.

Source Code

Please check the source code repository for details.

Proposed & Existing Techniques

  • Proposed-STRICT/query : Queries suggested by the proposed technique.

  • Proposed-STRICT/rank : Query Effectiveness (QE) of the suggested queries (Method-level granularity).

  • Proposed-STRICT/rank-class : Query Effectiveness (QE) of the suggested queries (Document-level granularity).

  • Proposed-STRICT/Query-Difficulty-Model : Query Difficulty model, predictions, class labels, and other materials.

  • Proposed-STRICT/Parameter-Tuning : Data related to parameter tuning of STRICT.

  • TF/query : Queries suggested by TF.

  • TF/rank : Query Effectiveness (QE) of the TF-based queries (Method-level granularity).

  • TF/rank-class : Query Effectiveness (QE) of the TF-based queries (Document-level granularity).

  • IDF/query : Queries suggested by IDF.

  • IDF/rank : Query Effectiveness (QE) of the IDF-based queries (Method-level granularity).

  • IDF/rank-class : Query Effectiveness (QE) of the IDF-based queries (Document-level granularity).

  • TF-IDF/query : Queries suggested by TF-IDF.

  • TF-IDF/rank : Query Effectiveness (QE) of the TF-IDF-based queries (Method-level granularity).

  • TF-IDF/rank-class : Query Effectiveness (QE) of the TF-IDF-based queries (Document-level granularity).

  • Kevic/query : Queries suggested by Kevic & Fritz.

  • Kevic/rank : Query Effectiveness (QE) of the Kevic & Fritz queries (Method-level granularity).

  • Kevic/rank-class : Query Effectiveness (QE) of the Kevic & Fritz queries (Document-level granularity).

  • Kevic/model : Machine learning model & prediction of search terms.

  • Rocchio/query : Queries suggested by Rocchio.

  • Rocchio/rank : Query Effectiveness (QE) of the Rocchio queries (Method-level granularity).

  • Rocchio/rank-class : Query Effectiveness (QE) of the Rocchio queries (Document-level granularity).

  • Scanniello/rank : Query Effectiveness (QE) of the Scanniello et al. queries (Method-level granularity).

  • Scanniello/rank-class : Query Effectiveness (QE) of the Scanniello et al. queries (Document-level granularity).

  • Scanniello/Method-PageRank : PageRank score calculated for Scanniello et al.

  • Scanniello/CFG* : CFG extracted from the source code of subject systems using java-callgraph.

  • Rahman & Roy/query : Queries suggested by our earlier work - Rahman & Roy, SANER 2017.

  • Rahman & Roy/rank : Query Effectiveness (QE) of the queries (Method-level granularity).

  • Rahman & Roy/rank-class : Query Effectiveness (QE) of the queries (Document-level granularity).

License & Others

  • README
  • LICENSE

Previously Accepted Papers

STRICT: Information Retrieval Based Search Term Identification for Concept Location

Mohammad Masudur Rahman, Chanchal K. Roy

Download this paper: PDF

TextRank based search term identification for software change tasks

Mohammad Masudur Rahman, Chanchal K. Roy

Download this paper: PDF

Please cite our work as

@INPROCEEDINGS{saner2017masud,
author={Mohammad Masudur Rahman and C. K. Roy},
booktitle={Proc. SANER},
title={STRICT: Information Retrieval Based Search Term Identification for Concept Location},
year={2017},
pages={79--90} }

Download this paper: PDF

@INPROCEEDINGS{saner2015masud,
author={Mohammad Masudur Rahman and C. K. Roy},
booktitle={Proc. SANER},
title={TextRank based search term identification for software change tasks},
year={2015},
pages={540-544} }

Download this paper: PDF

Related Projects: ACER, BLIZZARD, and QUICKAR

Something not working as expected?

Please contact Masud Rahman ([email protected])

or

Create a new issue for further information.

strict-replication-package's People

Contributors

masud-technope avatar

Stargazers

 avatar

Watchers

 avatar  avatar

strict-replication-package's Issues

Data Cleansing Manual

Phase I: Steps to be followed

  1. Select an issue report like this one
  2. Determine whether it discusses about a bug or a new feature.
  3. For example, this is a new feature, but this is a bug.
  4. Take a spreadsheet for each subject system, and mark its issue IDs as either new feature (NF) or bug (B).
  5. We have 8 subject systems. So, please create separate files for the systems.
  6. We have 2,885 issue reports from 8 systems. This should take a few days I guess.

Phase II: Steps to be followed

  1. Check the changed files for each bug report or feature request. For example, this is the changeset for this bug report.

  2. Count the number of changed Java files in each change set.

  3. If it is only one, then it is a valid changeset.

  4. If the count<=5, take a close look at the files. Are they really related to the bug report/feature request? Do they look related to the bug report or feature request? If yes, mark it as a valid changeset. You should spend at most 3 minutes for this.

  5. If the count>5, take a closer look at the changeset. Are they really related to the bug report/feature request?. According to my experience, they could have changed files which are not related to the bug fix or the feature implementation. If you find that a changeset contains one or more files unrelated to either the bug fix or the feature implementation, mark the changeset as bloated changeset. For example, this is definitely a bloated changeset for the issue report #263537. You should spend at most 6 minutes for this. If you cannot decide within 6 minutes, just mark it as bloated changeset.

  6. The output format for each issue report entry:
    BugID, #ChangedFiles, ** #Valid/Bloated**
    Please create separate files for individual subject systems.

  7. You can use code such as Valid Changeset=VC and Bloated Changeset=BC

Tasks for the Participants.

Available Features

The prototype offers four distinct features currently. I provided 12 bug reports from the ECF repository.

  • How to execute your own query for a specific Bug ID?
  • How to execute the baseline queries for a specific change request?
  • How to get suggested queries from STRICT for a single change request?
  • How to get suggested queries from STRICT for all the change requests?

Tasks to be Performed

Using these above features, the students should answer the following questions:

Note: **QE** = *Query Effectiveness*. The lower the QE is, the better the query is. 
However, QE= -1 means *RESULT NOT FOUND*, i.e., the worst query.
  • T1: Report your output for each of the sample commands provided in the README file.
  • T2: What is the best query you can come up with for each change request/bug report? The better your query is, the higher your grade is. Use the above available features to answer this question.
  • T3: What is the rank improvement between your query and the best performing baseline query? Suppose, your QE is 10 and the baseline QE is 20. Then you have a rank improvement of 10.
  • T4: Provide your best query for each change request/bug report which is the shortest in length, i.e., contains the least amount of keywords.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.