Giter Club home page Giter Club logo

sourcerercc's People

Contributors

crista avatar danhper avatar dyanguci avatar hexcles avatar hsajnani avatar paridhisirohi avatar pedromartins4 avatar rayrzh avatar saini avatar vizigin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sourcerercc's Issues

okenizer-muse features -> tokenizer

The tokenizer-muse contains various bug fixes (including a serious control flow bug), and improved logging and fault tolerance. These have to be applied to the generic tokenizer.

Why controller.py can not run? Could anybody help me?

When I was running python controller.py ,the following exception came out:

search will be carried out with 2 nodes
loading previous run state
previous run state 1
current state: 1
flushing current state 1
running new command /mnt/hgfs/G/SourcererCC-master/clone-detector/restore-gtpm.sh
running new command /mnt/hgfs/G/SourcererCC-master/clone-detector/runnodes.sh init 1
Traceback (most recent call last):
File "controller.py", line 180, in
controller.execute()
File "controller.py", line 144, in execute
raise ScriptControllerException("error during init.")
main.ScriptControllerException: error during init.

How can I deal with this trouble? Could you help me?

There is something wrong in step2 when detect file clone

When I run this sourererCC for detecting a whole project, it works will. But when I detected only one file, it blocked.
I accessed the openssl in github and use two visions of it to test SourcererCC. When the tested project is whole, SourcererCC worked and I can got the result. But when I only detect the same files in two visions during step2, it cannot have a result, as the picture show.
} AU04F3PT(8N{E5Y09G8

emmmm, if the SourcererCCcan can only detect a whole project? Or is there anything wrong when I run this program? I run the SourcererCC as the pipeline1.txt in the virtual machine.

Reproducing results of ICSE 16 paper

Hi, I am trying to reproduce the results of https://arxiv.org/pdf/1512.06448.pdf.

In my setup, I am using the file-level tokenizer, I've changed sourcerer-cc.properties MIN_TOKENS to 1,

# Ignore all files outside these bounds
MIN_TOKENS=1
MAX_TOKENS=500000

as well as changed runnodes.sh's threshold threshold="${3:-7}".

Using BigCloneEval, I'm using these flags "-st both -mit 50 -mil 6". The default clone matcher is used. I'm getting the following results for type-1 and type-2 clones:

Type-1: 34301 / 35787 = 0.9584765417609746
Type-2: 3334 / 4573 = 0.7290618849770392

According to the ICSE paper, SourcererCC is able to get 1.0 on Type-1, and 0.98 on Type-2.

Is there any step in particular that I missed, or is there another configuration to change, in order to reproduce the ICSE paper's results?

collector.py doesn't generate query_*, only report.csv

I have the same problem as this issue. it says SUCCESS: Search Completed on all nodes but generates only report.csv in clone-detector/NODE_*/output8.0.

In VM it runs and python version in VM is 2.7.12+.

I run controller.py with python2 (2.7.18) but the problem isn't fixed, which can be fixed in that issue.

failed in testing tokenizer with tokenizer-sample-input

the output files are all empty

*** Starting priority projects...
*** Starting regular projects...
Starting new process 0
[INFO] (MainThread) Process 0 starting
[INFO] (MainThread) Starting zip project <1,./tokenizer-sample-input/aesthetic-master.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball ./tokenizer-sample-input/aesthetic-master.zip
[INFO] (MainThread) Successfully ran process_zip_ball ./tokenizer-sample-input/aesthetic-master.zip
[INFO] (MainThread) Project finished <1,./tokenizer-sample-input/aesthetic-master.zip> (process 0)
[INFO] (MainThread) (0): Total: 0:00:00.001463micros | Zip: 0 Read: 0 Separators: 0micros Tokens: 0micros Write: 0micros Hash: 0 regex: 0
[INFO] (MainThread) Starting zip project <2,./tokenizer-sample-input/OffsetAnimator-master.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball ./tokenizer-sample-input/OffsetAnimator-master.zip
[INFO] (MainThread) Successfully ran process_zip_ball ./tokenizer-sample-input/OffsetAnimator-master.zip
[INFO] (MainThread) Project finished <2,./tokenizer-sample-input/OffsetAnimator-master.zip> (process 0)
[INFO] (MainThread) (0): Total: 0:00:00.000711micros | Zip: 0 Read: 0 Separators: 0micros Tokens: 0micros Write: 0micros Hash: 0 regex: 0
[INFO] (MainThread) Process 0 finished. 0 files in 0s.
Process 0 finished, 0 files processed (1). Current total: 0
Starting new process 0
*** No more projects to process. Waiting for children to finish...
[INFO] (MainThread) Process 0 starting
[INFO] (MainThread) Starting zip project <3,./tokenizer-sample-input/ResourceInspector-master.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball ./tokenizer-sample-input/ResourceInspector-master.zip
[INFO] (MainThread) Successfully ran process_zip_ball ./tokenizer-sample-input/ResourceInspector-master.zip
[INFO] (MainThread) Project finished <3,./tokenizer-sample-input/ResourceInspector-master.zip> (process 0)
[INFO] (MainThread) (0): Total: 0:00:00.000759micros | Zip: 0 Read: 0 Separators: 0micros Tokens: 0micros Write: 0micros Hash: 0 regex: 0
[INFO] (MainThread) Starting zip project <4,./tokenizer-sample-input/zachtaylor-JPokemon.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball ./tokenizer-sample-input/zachtaylor-JPokemon.zip
[INFO] (MainThread) Successfully ran process_zip_ball ./tokenizer-sample-input/zachtaylor-JPokemon.zip
[INFO] (MainThread) Project finished <4,./tokenizer-sample-input/zachtaylor-JPokemon.zip> (process 0)
[INFO] (MainThread) (0): Total: 0:00:00.002997micros | Zip: 0 Read: 0 Separators: 0micros Tokens: 0micros Write: 0micros Hash: 0 regex: 0
[INFO] (MainThread) Process 0 finished. 0 files in 0s.
Process 0 finished, 0 files processed (1). Current total: 0
*** All done. 0 files in 0:00:00.020079
rexime@10-60-75-142:~/SourcererCC-master/tokenizers/file-level$

SourcererCC used old library of eproperties

Hi, guys

We tried to apply SourcererCC to our deployment infrastructure, but It have old library, which not supported by java 8 update 201. I look here and It's looks like dead (last update May 15, 2012). Have you plan to fixes for SourcererCC?

How to understand the result generated by block-level detection

Hi
Recently I have figured out the previous problem and get the block-level detection result of the ten projects in my study. But I have lots of troubles understanding it.
For example:
image
I just don't know what b11 means. How to find the corresponding source code using this result?
And this one
image
I guess 11 and 12 represent the projects but what does 2 refer to? Since the ten projects are represented by 11 to 20, I can't understand what 2 mean.
These information cannot be accessed in the README file and that's why I come for help.
Thanks~

Changing the value of threshold

When I want to change the value of the threshold from 8 to 5, I get this error:
Traceback (most recent call last):
File "controller.py", line 202, in
controller.execute()
File "controller.py", line 79, in execute
command_params, self.full_file_path("Log_init.out"), self.full_file_path("Log_init.err"))
File "controller.py", line 171, in run_command
universal_newlines=True
File "/usr/lib/python2.7/subprocess.py", line 711, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1343, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error
What should I do to fix it? Please help me in this matter.

Feature request: daemon support

Hi, does SourcererCC have any support for running as a daemon?
For example, SCC is launched as a background process after loading the dataset. The daemon can then be given queries and detect clones in the query without reloading the dataset. The resulting clones are then sent back to whomever initialized the query.

Question on <parentId, blockId>

The source code [not the pre-built jar file] seem to expect a tuple with three fields as opposed to two mentioned in the doc <parentId, blockId>. What is the third argument? Can someone clarify? Or is this a recent bug?

String[] bagAndTokens = s.split("@#@");
String[] bagMetadata = bagAndTokens[0].split(",");
String functionId = bagMetadata[0];
String bagId = bagMetadata[1];
int bagSize = Integer.parseInt(bagMetadata[2]);

0 and 1 token clones not detected

I'm running SourcererCC on some really simple test data, among others a couple of empty files and 2 instances a file containing only one (identical) token. I have set
MIN_TOKENS=0
and
MAX_TOKENS=2000000000
in sourcerer-cc.properties.
Clones with two tokens or more are detected, but not the ones with 0 or 1 token. Is this inherent in the algorithm, a feature of the clone detector or may it be a bug? Attached is my blocks.file, obtained following the README instructions. Irrelevant lines are removed. (".txt" needed to be added before GitHub would let me upload the file.)
blocks.file.txt

Bug with traversing subfolders in file-level tokenizer

The generic file-level tokenizer (tokenizers/file-level) has problems with deep hierarchy of project folders and their subfolders.

Let's say I have input dataset of files for tokenization in "project-folder" (PATH_proj_paths=project-folder) and it looks like this:

$ tree project-folder
project-folder
|-- sub
|   |-- subsub
|   |   `-- index.js
|   `-- util.js
`-- test2.js

2 directories, 3 files

When I run python tokenizer.py folder, it does find all the files in subfolders, however, it tries to tokenize the found filenames from the root directory:

[INFO] (MainThread) File projects_success.txt no found
[INFO] (MainThread) Process 1
[INFO] (MainThread) Starting file <3,0,project-folder/test2.js>
[INFO] (MainThread) Starting file <3,1,project-folder/util.js>
[ERROR] (MainThread) File not found <3,1,project-folder/util.js>
[INFO] (MainThread) Starting file <3,2,project-folder/index.js>
[ERROR] (MainThread) File not found <3,2,project-folder/index.js>

I am submitting a PR with a fix. (cc @pedromartins4)

Generate links to GitHub from found clone pairs based on bookkeeping

I'd be great to be able to easily re-check a pair of clones from the output directory. Based on bookkeeping we should be able to get file paths of the clones.

Even more awesome would be to generate a link to the files on GitHub based on metadata (like default_branch, namespace, reponame) from dataset (if it's github dataset).

I've written a simple script that works on my sample dataset of top 1000 GitHub JS repos. It's in SourcererCC fomat. I'll generalize it when I have more time: jakubzitny@0111c6a

Sample output from it looks like this:

$ ./githubpair.sh 
https://github.com/jsdoc3/jsdoc/blob/master/plugins/test/specs/commentConvert.js
https://github.com/jsdoc3/jsdoc/blob/master/plugins/test/specs/shout.js
===========
https://github.com/jsdoc3/jsdoc/blob/master/plugins/test/specs/shout.js
https://github.com/jsdoc3/jsdoc/blob/master/plugins/test/specs/escapeHtml.js
===========
https://github.com/thomasdavis/backbonetutorials/blob/gh-pages/examples/modular-backbone/js/text.js
https://github.com/thomasdavis/backbonetutorials/blob/gh-pages/examples/nodejs-mongodb-mongoose-restify/app/js/libs/require/text.js
===========
https://github.com/thomasdavis/backbonetutorials/blob/gh-pages/examples/modular-backbone/js/text.js
https://github.com/thomasdavis/backbonetutorials/blob/gh-pages/examples/cross-domain/js/libs/require/text.js
===========
https://github.com/thomasdavis/backbonetutorials/blob/gh-pages/examples/modular-backbone/js/main.js
https://github.com/thomasdavis/backbonetutorials/blob/gh-pages/examples/nodejs-mongodb-mongoose-restify/app/js/main.js
===========
<...>

Some examples of JS clones: jakubzitny@5d20786735f14f5f73af4a82a6c6c90d.

web of DéjàVu can not be accessed

DéjàVu is a supporting web-tool to allow quick and simple clone analysis, can be found here.

here can not be accessed.
Could you please give a new website for DéjàVu ?

How to create the clone mapping in C or C++?

SourcererCC is a great tool for clone mapping! However, I still have several questions about it and DejaVu.

You said 'we have created a mapping between file clones in four languages: Java, C++, JavaScript and Python.' in your website, and I am interested in finding code clones among C++ code files. However, when doing the tokenizing, I haven't found a file named extractCFunction.py yet to finish parsing.

By the way, can I find block clones without the method structures? For instance, for a few lines of statement?

Thank you very much!

Help. Is there any tutorial for incremental SCC?

It seems that incremental codes clone detection is supported by SCC. In most cases, I don't need to go over the inventory code. Only those incremental codes should be detected. It would be very appreciated if anybody can provide the pipeline for it.

Remove config files from git

As an improvement I'd suggest removing the config files from the repository. I am not sure how you guys work with git, but usually you only submit a sample config and each contributor/user then copies the file to correct location and changes its contensts. This way no useless unstaged changes will be shown in git.

I'd also add all "standard" output locations to .gitignore.

For this project the config files would be at least

  • clone-detector/sourcerer-cc.properties
  • tokenizers/file-level/config.ini

And the locations for ignore would be

  • *.log
  • clone-detector/dist/
  • clone-detector/fwdindex/
  • clone-detector/index/
  • clone-detector/gtpm/
  • tokenizers/file-level/bookkeeping_files/
  • tokenizers/file-level/bookkeeping_projs/
  • tokenizers/file-level/projects_success.txt
  • tokenizers/file-level/project_starting_index.txt
  • tokenizers/file-level/projects_fail.txt
  • tokenizers/file-level/mirror_repo
  • tokenizers/file-level/tokens
  • tokenizers/file-level/project-list.txt

Is it possible that a type-1 clone won't be reported as a clone by SourcererCC?

Hi,

I recently faced a problem. Two functions with identical token content could be manually recognized as a type-1 clone. It was not reported by SourcererCC. I tried to do troubleshooting but I couldn't find the reason.
I can show you the token content. Can you shed some light on this?
Thanks.

11,100003000000,20,12,407aab7539aad9635a5258199248d490@#@function@@::@@1,target@@::@@3,onlyOwner@@::@@1,uint256@@::@@1,Transfer@@::@@2,balances@@::@@1,0@@::@@1,mintToken@@::@@1,address@@::@@1,owner@@::@@2,mintedAmount@@::@@5,_totalSupply@@::@@1 12,100003000001,5,5,cc777ed82a99633e2ac159baae382dbb@#@function@@::@@1,owner@@::@@1,sender@@::@@1,owned@@::@@1,msg@@::@@1 12,100013000001,7,6,a6a1eefb5b11fbdc3e46acef1f275263@#@function@@::@@1,newOwner@@::@@2,transferOwnership@@::@@1,onlyOwner@@::@@1,address@@::@@1,owner@@::@@1 12,100023000001,8,7,b27f467b7e068c194a64699308168a25@#@function@@::@@1,owner@@::@@2,sender@@::@@1,Fiocoin@@::@@1,msg@@::@@1,_totalSupply@@::@@1,balances@@::@@1 12,100033000001,8,7,d589f4759ef658596d45ebb0327bf62b@#@function@@::@@1,totalSupply@@::@@2,returns@@::@@1,constant@@::@@1,_totalSupply@@::@@1,uint256@@::@@1,return@@::@@1 12,100043000001,11,10,4d402391021f605517f589e36d53e787@#@function@@::@@1,constant@@::@@1,uint256@@::@@1,_owner@@::@@2,balances@@::@@1,returns@@::@@1,address@@::@@1,return@@::@@1,balance@@::@@1,balanceOf@@::@@1 12,100053000001,43,21,c4592b5b8ca4e4888307d25378d9d28f@#@function@@::@@1,return@@::@@2,_amount@@::@@7,Transfer@@::@@1,_to@@::@@5,balances@@::@@5,address@@::@@1,false@@::@@1,else@@::@@1,true@@::@@1,throw@@::@@1,if@@::@@2,sender@@::@@4,success@@::@@1,uint256@@::@@1,transfer@@::@@1,frozenAccount@@::@@1,0@@::@@1,returns@@::@@1,bool@@::@@1,msg@@::@@4 12,100063000001,47,21,881ca3adfc368b2d75808038edf1487e@#@function@@::@@1,return@@::@@2,Transfer@@::@@1,address@@::@@2,_to@@::@@5,balances@@::@@5,allowed@@::@@2,false@@::@@1,_from@@::@@6,else@@::@@1,_amount@@::@@9,if@@::@@1,sender@@::@@2,success@@::@@1,uint256@@::@@1,true@@::@@1,0@@::@@1,returns@@::@@1,bool@@::@@1,transferFrom@@::@@1,msg@@::@@2 12,100073000001,21,15,176a627a59bc4163aec3898e4e53017d@#@function@@::@@1,_spender@@::@@3,return@@::@@1,sender@@::@@2,success@@::@@1,uint256@@::@@1,approve@@::@@1,address@@::@@1,returns@@::@@1,bool@@::@@1,allowed@@::@@1,msg@@::@@2,Approval@@::@@1,_amount@@::@@3,true@@::@@1 12,100083000001,14,11,c3cdf51a44c8d370a574cd29f62f2975@#@function@@::@@1,_spender@@::@@2,constant@@::@@1,uint256@@::@@1,address@@::@@2,_owner@@::@@2,returns@@::@@1,allowed@@::@@1,return@@::@@1,remaining@@::@@1,allowance@@::@@1 12,100093000001,20,12,407aab7539aad9635a5258199248d490@#@function@@::@@1,target@@::@@3,onlyOwner@@::@@1,uint256@@::@@1,Transfer@@::@@2,balances@@::@@1,0@@::@@1,mintToken@@::@@1,address@@::@@1,owner@@::@@2,mintedAmount@@::@@5,_totalSupply@@::@@1

Last stage problems

I made file "blocks.file"
After all it failed with this error:
2018-02-14 20:27:35,844 main ERROR Unable to move file /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-7.log to /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-6.log: java.nio.file.NoSuchFileException /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-7.log -> /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-6.log 2018-02-14 20:27:35,846 main ERROR Unable to copy file /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-7.log to /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-6.log: java.nio.file.NoSuchFileException /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-7.log org.apache.lucene.store.NoSuchDirectoryException: directory '/Users/saintnik/GitProjects/SourcererCC/clone-detector/fwdindex/1' does not exist at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:219) at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:743) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66) at com.mondego.indexbased.CodeSearcher.<init>(CodeSearcher.java:47) at com.mondego.indexbased.SearchManager.initSearchEnv(SearchManager.java:820) at com.mondego.indexbased.SearchManager.main(SearchManager.java:377) org.apache.lucene.store.NoSuchDirectoryException: directory '/Users/saintnik/GitProjects/SourcererCC/clone-detector/fwdindex/1' does not exist at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:219) at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:743) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66) at com.mondego.indexbased.CodeSearcher.<init>(CodeSearcher.java:47) at com.mondego.indexbased.SearchManager.initSearchEnv(SearchManager.java:820) at com.mondego.indexbased.SearchManager.main(SearchManager.java:377) NODE_1 FAILED - Job 62732 exited with a status of 1 NODE_2 FAILED - Job 62733 exited with a status of 1

Q How to resolve controller.execute() error : One or more nodes failed during Step Search.

im following the tutorial on the git page to run the SourcererCC clone detection on sample projects provided
the error occurs when I run the controller.py file
the error code shows

search will be carried out with 1 nodes
loading previous run state
previous run state 5
current state: 1
flushing current state 1
current state: 2
flushing current state 2
current state: 3
flushing current state 3
current state: 4
flushing current state 4
current state: 5
flushing current state 5
running new command /Users/yongjinc/Desktop/SourcererCC-master/clone-detector/runnodes.sh search 1
Traceback (most recent call last):
File "controller.py", line 180, in
controller.execute()
File "controller.py", line 133, in execute
Check Log_search.log for more details. grep for FAILED in the log file")
main.ScriptControllerException: One or more nodes failed during Step Search.
Check Log_search.log for more details. grep for FAILED in the log file

Why does the detection cost 3min?

There are only four files in the package, I think it will be very quick to detect clones,but it costs me 3 min and most in runnodes.sh. Is it normal or can I change some configurations to make it quick?

Bug Report: Tokenizer failed to handle the final line of listed projects when missing newline character

Steps to reproduce

When prepare the file assigned to FILE_projects_list in config.ini, DO NOT end with an empty line:

path/to/project1.zip
path/to/project2.zip

Run python tokenizer.py zip, then check the log file and see:

[INFO] (MainThread) Starting zip project <1, path/to/project1.zip> (process 0)
...
[INFO] (MainThread) Starting zip project <2, path/to/project2.zi> (process 0)

The path of the last project is handled incorrectly which results in project not found.


This may caused by proj_paths.append(line[:-1]) in tokenizers/file-level/tokenizer.py .

Recommend to use line.strip() instead of line[:-1].

cat: 'clone-detector/NODE_*/output8.0/query_*': No such file or directory

I run all the steps according to readme.

when run

python controller.py

it output:

search will be carried out with 1 nodes
loading previous run state
previous run state 0
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/execute.sh 1
current state: 1
flushing current state 1
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/backup-gtpm.sh
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/runnodes.sh init 1
current state: 2
flushing current state 2
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/runnodes.sh index 1
current state: 3
flushing current state 3
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/move-index.sh
current state: 4
flushing current state 4
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/execute.sh 1
current state: 5
flushing current state 5
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/runnodes.sh search 1
current state: 0
flushing current state 0
SUCCESS: Search Completed on all nodes

Then, when we run cat clone-detector/NODE_*/output8.0/query_* > results.pairs

The error is

cat: 'clone-detector/NODE_/output8.0/query_': No such file or directory

Only the file report.csv in clone-detector/NODE_*/output8.0

Corrupt lines in pair file

I have run the SourcererCC clone detector on a little bit more than 35,000,000 files. The resulting clone pair file consists of >18,000,000,000 lines. Of these, 5 lines contain more than 4 numbers separated with commas (which should be the expected format):

263694,263710,455981,41668,70616
591916,1015368,508215,591934,1015376,192522,333749
14702,100025479,527866,914862,100025719,706877,1213095
502505,200858502537,200858458,1527027,102616237
1454158,2021454205,202495178,785203,101352033

The first one is located on line 1604224 in query_3clones_index_WITH_FILTER.txt, which is attached in zipped format (split in 3 since I cannot upload files larger than 10MB). query_3clones_index_WITH_FILTER_1.txt.gz query_3clones_index_WITH_FILTER_2.txt.gz query_3clones_index_WITH_FILTER_3.txt.gz

The server that I ran on went down a couple of times, so one could imagine that 263694,<parts of an ID> was written before the crash, and the next clone pair was written on the same line. However, I don't think that's the case: Since SourcererCC starts from the last line logged in recovery.txt, I see two possibilities:

  1. The last line logged in recovery.txt is the last line before the one that was processed when the server went down. Then the second number of the line should end with the first number of the line, which is not the case.
  2. The last line processed (and giving rise to an output line) before the crash is not the last one logged in recovery.txt. Then the first line to be processed after recovery should already have been processed before the crash. Then we should find another line ending with 455981,41668,70616, which I can't.

My blocks file is 7,9 GB, so I don't attach it, but let me know if you need more information!

Error: no module named javalang

Hello,
we tried to run SourcererCC on zip java project and we got this error:

Traceback (most recent call last):
File "tokenizer.py", line 12, in
import javalang
ImportError: No module named javalang

then we tried with different java project and we got the same error

Why I failed when executing "python controller.py"

I try to reproduce the experiments using the original materials.
Everything goes well before I executed "python controller.py"
image
How can I solve this problem?
Is it correlated with my python version or something else?

Where can I get the clone pairs's detail?

I have got the report.csv, blocksclones_index_WITH_FILTER.txt,tokensclones_index_WITH_FILTER.txt.

And the report.csv like this:

index_time globalTokenPositionCreationTime num_candidates num_clonePairs total_run_time searchTime timeSpentInSearchingCandidates timeSpentInProcessResult operation sortTime_during_indexing
691 194 0 0 913 0 0 0 index 1
0 34 0 1 2183 2043 0 0 search  
812 166 0 0 987 0 0 0 index 1
995 315 0 0 1318 0 0 0 index 34
0 51 0 495 3240 3066 0 0 search  
0 154 0 495 3456 2914 0 0 search  
0 63 0 495 2941 2776 0 0 search  
0 50 0 495 2975 2825 0 0 search  

The num_clonePairs is 495.So, where is detail? The tokensclones_index_WITH_FILTER.txt is empty, the blocksclones_index_WITH_FILTER.txt like this:
1453,1457
1457,1458
1453,1458
1464,1465
1468,1469
1471,1472
1456,1457
1479,1480
1486,1487
1490,1491
1488,1490
1488,1491
1488,1492
1488,1493
1488,1494
1488,1495
1489,1490
1491,1492
1491,1493
1491,1494
1491,1495
1493,1494
1493,1495
1492,1493
1492,1494
1492,1495
1490,1492
1494,1495
1502,1508
1505,1506
1506,1507
1503,1509
1524,1525
1523,1524
1523,1525
1540,1541
1545,1546
1545,1547
1545,1548
1546,1547
1546,1548
1547,1548

Problematic tokens from tokenizer?

Some tokens from some of the tokenized files seem problematic for SourcererCC.

Here is an example of stderr when the indexing fails, the contents is e.g. weird whitespaces or chars like ||:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
    at noindex.CloneHelper.deserialise(Unknown Source)
    at indexbased.SearchManager.doIndex(Unknown Source)
    at indexbased.SearchManager.main(Unknown Source)

While indexing, there is a lot of EXCEPTION CAUGHT messages coming from caught ArrayIndexOutOfBoundsExceptions in CloneHelper.java. I'm not sure if it's a problem or not.

Also, while searching I am getting a lot of ERROR: more that one doc found. some error here. messages.

Maybe these problems are some small things in the tokenization process, you have any ideas what it might be? For now, I will update the handling of weird whitespaces and see how it helps.

Q:what wrong is it in this error log?

search will be carried out with 2 nodes
loading previous run state
/home/michael/SourcererCC-master/clone-detector/scriptinator_metadata.scc doesn't exist, creating one with state EXECUTE_1
previous run state 0
running new command /home/michael/SourcererCC-master/clone-detector/execute.sh 1
Traceback (most recent call last):
File "controller.py", line 180, in
controller.execute()
File "controller.py", line 147, in execute
"error in execute.sh script while preparing for init step.")
main.ScriptControllerException: error in execute.sh script while preparing for init step.

scriptinator_metadata.scc doesn't exist,why? This is running on ubuntu,

Is the bookkeeping file used by SourcererCC?

Hi, the Readme is not clear about where to put the bookkeeping headers.file.

Does SourcererCC actually use this file or is it just for manual check of the block/file that is found?

Failed to run block-level tokenizer

I set up a python 3.7 environment using conda and followed requirements.txt to set it up.
I used the test-env and set config.ini to python.
But I failed to run the block-level tokenizer. I kept getting warning:
join() argument must be str or bytes, not 'ZipInfo'

The output looks like this:

GO
'zipblocks'format
*** Starting priority projects...
*** Starting regular projects...
Starting new process 0
*** No more projects to process. Waiting for children to finish...
GO
[INFO] (MainThread) Process 0 starting
[INFO] (MainThread) Starting zip project <11,test-env/2Shirt-SpellBurner.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball test-env/2Shirt-SpellBurner.zip
[INFO] (MainThread) Successfully ran process_zip_ball test-env/2Shirt-SpellBurner.zip
[INFO] (MainThread) Project finished <11,test-env/2Shirt-SpellBurner.zip> (process 0)
[INFO] (MainThread)  (0): Total: 0:00:00.001971micros | Zip: 0 Read: 0 Separators: 0micros Tokens: 0micros Write: 0micros Hash: 0 regex: 0
[INFO] (MainThread) Starting zip project <12,test-env/2xyo-indicator-ip.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball test-env/2xyo-indicator-ip.zip
[INFO] (MainThread) Attempting to process_file_contents test-env/2xyo-indicator-ip.zip\indicator-ip-master/test.py
[WARNING] (MainThread) Unable to open zip on <test-env/2xyo-indicator-ip.zip> (process 0)
[WARNING] (MainThread) join() argument must be str or bytes, not 'ZipInfo'
[INFO] (MainThread) Project finished <12,test-env/2xyo-indicator-ip.zip> (process 0)
[INFO] (MainThread)  (0): Total: 0:00:00.002480micros | Zip: -1 Read: -1 Separators: -1micros Tokens: -1micros Write: -1micros Hash: -1 regex: -1
[INFO] (MainThread) Starting zip project <13,test-env/3demax-Take-a-break.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball test-env/3demax-Take-a-break.zip
[INFO] (MainThread) Attempting to process_file_contents test-env/3demax-Take-a-break.zip\Take-a-break-master/examples/appmenu.py
[WARNING] (MainThread) Unable to open zip on <test-env/3demax-Take-a-break.zip> (process 0)
[WARNING] (MainThread) join() argument must be str or bytes, not 'ZipInfo'
[INFO] (MainThread) Project finished <13,test-env/3demax-Take-a-break.zip> (process 0)
[INFO] (MainThread)  (0): Total: 0:00:00.002480micros | Zip: -1 Read: -1 Separators: -1micros Tokens: -1micros Write: -1micros Hash: -1 regex: -1
[INFO] (MainThread) Process 0 finished. 2 files in 0s.
Process 0 finished, 2 files processed (3000002). Current total: 2
*** All done. 2 files in 0:00:00.145329

I even used diff.txt. I know that's not the solution.
Is there someone who have met the same problem like me?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.