Search Keyword Identification for Concept Location using Graph-Based Term Weighting

Abstract: During maintenance, software developers deal with numerous change requests that are written in an unstructured fashion using natural language texts. These texts illustrate the change requirements using various domain-related concepts. Software developers need to find appropriate keywords from these texts so that they could identify the relevant locations in the source code using a search technique. Once such locations are identified, they can implement the requested changes there. Studies suggest that developers often perform poorly in choosing the right keywords from a change request. In this article, we propose a novel technique --STRICT-- that (1) identifies suitable keywords from a change request using three graph-based term weighting algorithms -- TextRank, POSRank and WK-Core, and (2) then delivers an appropriate query using query quality analysis and machine learning. Our approach determines a term's importance based on not only its co-occurrences with other important terms but also its syntactic relationships and cohesion with them. Experiments using 955 change requests from 22 Java-based subject systems show that STRICT can offer better search queries than the baseline queries (i.e., preprocessed version of a request text) from 44%--63% of all requests. Our queries also achieve 20% higher accuracy, 10% higher precision and 7% higher reciprocal rank than that of the baseline queries. Comparisons with six existing approaches from the literature demonstrate that our approach can outperform them in improving the baseline queries. Our approach also achieves 14% higher accuracy, 15% higher precision and 13% higher reciprocal rank than that of the six other existing approaches.

Subject Systems (22)

adempiere-3.1.0 (12)
apache-nutch-1.8 (9)
apache-nutch-2.1 (17)
atunes-1.10.0 (16)
bookkeeper-4.1.0 (21)
commons-math-3-3.0 (16)
derby-10.9.1.0 (24)
ecf (68)
eclipse-2.0 (17)
eclipse.jdt.core (30)
eclipse.jdt.debug (84)
eclipse.jdt.ui (242)
eclipse.pde.ui (182)
jedit-4.2 (10)
lang (42)
mahout-0.4 (16)
mahout-0.8 (15)
math (60)
openjpa-2.0.1 (16)
tika-1.3 (18)
time (7)
tomcat70 (33)

Total: 955

Experimental Data

Baseline/query : Baseline queries extracted from the 955 change requests. They make use of title, description, structured tokens and whole texts. query-whole is our chosen baseline.
Baseline/rank : Query Effectiveness (QE) of the baseline queries (Method-level granularity).
Baseline/rank-class : Query Effectiveness (QE) of the baseline queries (Document-level granularity).
ChangeReqs : 955 change requests from the 22 subject systems.
Corpus/method.7z : Corpus containing the method bodies from source code documents of all systems. Please decompress before use.
Corpus/norm-method.7z : Corpus containing the normalized method bodies from all the systems. Please decompress before use.
Corpus/*.ckeys : Corpus document-index key mapping. This is an encoding of original methods into numbers.
Goldset : Ground truth for 955 change requests.
Lucene/index-method : Lucene index for concept location.
SelectedBug : IDs of the 955 change requests under study.
SelectedBug-HQB : IDs of the 225 change requests leading to good baseline queries.
SelectedBug-LQB : IDs of the 730 change requests leading to poor baseline queries.
tokens* : Tokens generated from project source and change requests for Splitting algorithm -- Samurai.
samurai-data : Meta data required by the Splitting algorithm -- Samurai.
models : Contains the models for POS tagging by Stanford CoreNLP library.
pp-data : Stop words used for pre-processing.
strict.lib : Contains the dependencies used by the proposed technique - STRICT.

Source Code

Please check the source code repository for details.

Proposed & Existing Techniques

Proposed-STRICT/query : Queries suggested by the proposed technique.
Proposed-STRICT/rank : Query Effectiveness (QE) of the suggested queries (Method-level granularity).
Proposed-STRICT/rank-class : Query Effectiveness (QE) of the suggested queries (Document-level granularity).
Proposed-STRICT/Query-Difficulty-Model : Query Difficulty model, predictions, class labels, and other materials.
Proposed-STRICT/Parameter-Tuning : Data related to parameter tuning of STRICT.
TF/query : Queries suggested by TF.
TF/rank : Query Effectiveness (QE) of the TF-based queries (Method-level granularity).
TF/rank-class : Query Effectiveness (QE) of the TF-based queries (Document-level granularity).
IDF/query : Queries suggested by IDF.
IDF/rank : Query Effectiveness (QE) of the IDF-based queries (Method-level granularity).
IDF/rank-class : Query Effectiveness (QE) of the IDF-based queries (Document-level granularity).
TF-IDF/query : Queries suggested by TF-IDF.
TF-IDF/rank : Query Effectiveness (QE) of the TF-IDF-based queries (Method-level granularity).
TF-IDF/rank-class : Query Effectiveness (QE) of the TF-IDF-based queries (Document-level granularity).
Kevic/query : Queries suggested by Kevic & Fritz.
Kevic/rank : Query Effectiveness (QE) of the Kevic & Fritz queries (Method-level granularity).
Kevic/rank-class : Query Effectiveness (QE) of the Kevic & Fritz queries (Document-level granularity).
Kevic/model : Machine learning model & prediction of search terms.
Rocchio/query : Queries suggested by Rocchio.
Rocchio/rank : Query Effectiveness (QE) of the Rocchio queries (Method-level granularity).
Rocchio/rank-class : Query Effectiveness (QE) of the Rocchio queries (Document-level granularity).
Scanniello/rank : Query Effectiveness (QE) of the Scanniello et al. queries (Method-level granularity).
Scanniello/rank-class : Query Effectiveness (QE) of the Scanniello et al. queries (Document-level granularity).
Scanniello/Method-PageRank : PageRank score calculated for Scanniello et al.
Scanniello/CFG* : CFG extracted from the source code of subject systems using java-callgraph.
Rahman & Roy/query : Queries suggested by our earlier work - Rahman & Roy, SANER 2017.
Rahman & Roy/rank : Query Effectiveness (QE) of the queries (Method-level granularity).
Rahman & Roy/rank-class : Query Effectiveness (QE) of the queries (Document-level granularity).

License & Others

README
LICENSE

Previously Accepted Papers

STRICT: Information Retrieval Based Search Term Identification for Concept Location

Mohammad Masudur Rahman, Chanchal K. Roy

Download this paper:

TextRank based search term identification for software change tasks

Mohammad Masudur Rahman, Chanchal K. Roy

Download this paper:

Please cite our work as

@INPROCEEDINGS{saner2017masud,
author={Mohammad Masudur Rahman and C. K. Roy},
booktitle={Proc. SANER},
title={STRICT: Information Retrieval Based Search Term Identification for Concept Location},
year={2017},
pages={79--90} }

Download this paper:

@INPROCEEDINGS{saner2015masud,
author={Mohammad Masudur Rahman and C. K. Roy},
booktitle={Proc. SANER},
title={TextRank based search term identification for software change tasks},
year={2015},
pages={540-544} }

Download this paper:

Related Projects: ACER, BLIZZARD, and QUICKAR

Something not working as expected?

Please contact Masud Rahman ([email protected])

Create a new issue for further information.

Data Cleansing Manual

Phase I: Steps to be followed

Select an issue report like this one
Determine whether it discusses about a bug or a new feature.
For example, this is a new feature, but this is a bug.
Take a spreadsheet for each subject system, and mark its issue IDs as either new feature (NF) or bug (B).
We have 8 subject systems. So, please create separate files for the systems.
We have 2,885 issue reports from 8 systems. This should take a few days I guess.

Phase II: Steps to be followed

Check the changed files for each bug report or feature request. For example, this is the changeset for this bug report.
Count the number of changed Java files in each change set.
If it is only one, then it is a valid changeset.
If the count<=5, take a close look at the files. Are they really related to the bug report/feature request? Do they look related to the bug report or feature request? If yes, mark it as a valid changeset. You should spend at most 3 minutes for this.
If the count>5, take a closer look at the changeset. Are they really related to the bug report/feature request?. According to my experience, they could have changed files which are not related to the bug fix or the feature implementation. If you find that a changeset contains one or more files unrelated to either the bug fix or the feature implementation, mark the changeset as bloated changeset. For example, this is definitely a bloated changeset for the issue report #263537. You should spend at most 6 minutes for this. If you cannot decide within 6 minutes, just mark it as bloated changeset.
The output format for each issue report entry:
BugID, #ChangedFiles, ** #Valid/Bloated**
Please create separate files for individual subject systems.
You can use code such as Valid Changeset=VC and Bloated Changeset=BC

masud-technope / strict-replication-package Goto Github PK

strict-replication-package's Introduction

Search Keyword Identification for Concept Location using Graph-Based Term Weighting

Subject Systems (22)

Experimental Data

Source Code

Proposed & Existing Techniques

License & Others

Previously Accepted Papers

Please cite our work as

Related Projects: ACER, BLIZZARD, and QUICKAR

Something not working as expected?

strict-replication-package's People

Contributors

Stargazers

Watchers

Forkers

strict-replication-package's Issues

Phase I: Steps to be followed

Phase II: Steps to be followed

Available Features

Tasks to be Performed

Recommend Projects

Recommend Topics

Recommend Org