mondego / sourcerercc Goto Github PK
View Code? Open in Web Editor NEWSourcerer's Code Clone project
License: GNU General Public License v3.0
Sourcerer's Code Clone project
License: GNU General Public License v3.0
The tokenizer-muse contains various bug fixes (including a serious control flow bug), and improved logging and fault tolerance. These have to be applied to the generic tokenizer.
In your paper, you specify min length six lines, but I cannot find locations of the setting.
Where can I specify the setting?
When I was running python controller.py ,the following exception came out:
search will be carried out with 2 nodes
loading previous run state
previous run state 1
current state: 1
flushing current state 1
running new command /mnt/hgfs/G/SourcererCC-master/clone-detector/restore-gtpm.sh
running new command /mnt/hgfs/G/SourcererCC-master/clone-detector/runnodes.sh init 1
Traceback (most recent call last):
File "controller.py", line 180, in
controller.execute()
File "controller.py", line 144, in execute
raise ScriptControllerException("error during init.")
main.ScriptControllerException: error during init.
How can I deal with this trouble? Could you help me?
Hello Dear Authors,
Can you please locate where partial index algorithm and sub-block level indexing is in your project? Also where I can find GTP to sort tokens?
Thank you.
When I run this sourererCC for detecting a whole project, it works will. But when I detected only one file, it blocked.
I accessed the openssl in github and use two visions of it to test SourcererCC. When the tested project is whole, SourcererCC worked and I can got the result. But when I only detect the same files in two visions during step2, it cannot have a result, as the picture show.
emmmm, if the SourcererCCcan can only detect a whole project? Or is there anything wrong when I run this program? I run the SourcererCC as the pipeline1.txt in the virtual machine.
Hi,
I am a little confused when I run the runnodes.sh, what does the parameter 'threshold' mean? Is it the similarity threshold?
Hi, I am trying to reproduce the results of https://arxiv.org/pdf/1512.06448.pdf.
In my setup, I am using the file-level tokenizer, I've changed sourcerer-cc.properties MIN_TOKENS
to 1,
# Ignore all files outside these bounds
MIN_TOKENS=1
MAX_TOKENS=500000
as well as changed runnodes.sh's threshold threshold="${3:-7}"
.
Using BigCloneEval, I'm using these flags "-st both -mit 50 -mil 6". The default clone matcher is used. I'm getting the following results for type-1 and type-2 clones:
Type-1: 34301 / 35787 = 0.9584765417609746
Type-2: 3334 / 4573 = 0.7290618849770392
According to the ICSE paper, SourcererCC is able to get 1.0 on Type-1, and 0.98 on Type-2.
Is there any step in particular that I missed, or is there another configuration to change, in order to reproduce the ICSE paper's results?
I have the same problem as this issue. it says SUCCESS: Search Completed on all nodes
but generates only report.csv
in clone-detector/NODE_*/output8.0
.
In VM it runs and python version in VM is 2.7.12+.
I run controller.py
with python2 (2.7.18) but the problem isn't fixed, which can be fixed in that issue.
the output files are all empty
*** Starting priority projects...
*** Starting regular projects...
Starting new process 0
[INFO] (MainThread) Process 0 starting
[INFO] (MainThread) Starting zip project <1,./tokenizer-sample-input/aesthetic-master.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball ./tokenizer-sample-input/aesthetic-master.zip
[INFO] (MainThread) Successfully ran process_zip_ball ./tokenizer-sample-input/aesthetic-master.zip
[INFO] (MainThread) Project finished <1,./tokenizer-sample-input/aesthetic-master.zip> (process 0)
[INFO] (MainThread) (0): Total: 0:00:00.001463micros | Zip: 0 Read: 0 Separators: 0micros Tokens: 0micros Write: 0micros Hash: 0 regex: 0
[INFO] (MainThread) Starting zip project <2,./tokenizer-sample-input/OffsetAnimator-master.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball ./tokenizer-sample-input/OffsetAnimator-master.zip
[INFO] (MainThread) Successfully ran process_zip_ball ./tokenizer-sample-input/OffsetAnimator-master.zip
[INFO] (MainThread) Project finished <2,./tokenizer-sample-input/OffsetAnimator-master.zip> (process 0)
[INFO] (MainThread) (0): Total: 0:00:00.000711micros | Zip: 0 Read: 0 Separators: 0micros Tokens: 0micros Write: 0micros Hash: 0 regex: 0
[INFO] (MainThread) Process 0 finished. 0 files in 0s.
Process 0 finished, 0 files processed (1). Current total: 0
Starting new process 0
*** No more projects to process. Waiting for children to finish...
[INFO] (MainThread) Process 0 starting
[INFO] (MainThread) Starting zip project <3,./tokenizer-sample-input/ResourceInspector-master.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball ./tokenizer-sample-input/ResourceInspector-master.zip
[INFO] (MainThread) Successfully ran process_zip_ball ./tokenizer-sample-input/ResourceInspector-master.zip
[INFO] (MainThread) Project finished <3,./tokenizer-sample-input/ResourceInspector-master.zip> (process 0)
[INFO] (MainThread) (0): Total: 0:00:00.000759micros | Zip: 0 Read: 0 Separators: 0micros Tokens: 0micros Write: 0micros Hash: 0 regex: 0
[INFO] (MainThread) Starting zip project <4,./tokenizer-sample-input/zachtaylor-JPokemon.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball ./tokenizer-sample-input/zachtaylor-JPokemon.zip
[INFO] (MainThread) Successfully ran process_zip_ball ./tokenizer-sample-input/zachtaylor-JPokemon.zip
[INFO] (MainThread) Project finished <4,./tokenizer-sample-input/zachtaylor-JPokemon.zip> (process 0)
[INFO] (MainThread) (0): Total: 0:00:00.002997micros | Zip: 0 Read: 0 Separators: 0micros Tokens: 0micros Write: 0micros Hash: 0 regex: 0
[INFO] (MainThread) Process 0 finished. 0 files in 0s.
Process 0 finished, 0 files processed (1). Current total: 0
*** All done. 0 files in 0:00:00.020079
rexime@10-60-75-142:~/SourcererCC-master/tokenizers/file-level$
Hi, guys
We tried to apply SourcererCC to our deployment infrastructure, but It have old library, which not supported by java 8 update 201. I look here and It's looks like dead (last update May 15, 2012). Have you plan to fixes for SourcererCC?
Hi
Recently I have figured out the previous problem and get the block-level detection result of the ten projects in my study. But I have lots of troubles understanding it.
For example:
I just don't know what b11 means. How to find the corresponding source code using this result?
And this one
I guess 11 and 12 represent the projects but what does 2 refer to? Since the ten projects are represented by 11 to 20, I can't understand what 2 mean.
These information cannot be accessed in the README file and that's why I come for help.
Thanks~
When I want to change the value of the threshold from 8 to 5, I get this error:
Traceback (most recent call last):
File "controller.py", line 202, in
controller.execute()
File "controller.py", line 79, in execute
command_params, self.full_file_path("Log_init.out"), self.full_file_path("Log_init.err"))
File "controller.py", line 171, in run_command
universal_newlines=True
File "/usr/lib/python2.7/subprocess.py", line 711, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1343, in _execute_child
raise child_exception
OSError: [Errno 8] Exec format error
What should I do to fix it? Please help me in this matter.
Hi, does SourcererCC have any support for running as a daemon?
For example, SCC is launched as a background process after loading the dataset. The daemon can then be given queries and detect clones in the query without reloading the dataset. The resulting clones are then sent back to whomever initialized the query.
The source code [not the pre-built jar file] seem to expect a tuple with three fields as opposed to two mentioned in the doc <parentId, blockId>. What is the third argument? Can someone clarify? Or is this a recent bug?
String[] bagAndTokens = s.split("@#@");
String[] bagMetadata = bagAndTokens[0].split(",");
String functionId = bagMetadata[0];
String bagId = bagMetadata[1];
int bagSize = Integer.parseInt(bagMetadata[2]);
I'm running SourcererCC on some really simple test data, among others a couple of empty files and 2 instances a file containing only one (identical) token. I have set
MIN_TOKENS=0
and
MAX_TOKENS=2000000000
in sourcerer-cc.properties
.
Clones with two tokens or more are detected, but not the ones with 0 or 1 token. Is this inherent in the algorithm, a feature of the clone detector or may it be a bug? Attached is my blocks.file
, obtained following the README instructions. Irrelevant lines are removed. (".txt" needed to be added before GitHub would let me upload the file.)
blocks.file.txt
The generic file-level tokenizer (tokenizers/file-level
) has problems with deep hierarchy of project folders and their subfolders.
Let's say I have input dataset of files for tokenization in "project-folder" (PATH_proj_paths=project-folder
) and it looks like this:
$ tree project-folder
project-folder
|-- sub
| |-- subsub
| | `-- index.js
| `-- util.js
`-- test2.js
2 directories, 3 files
When I run python tokenizer.py folder
, it does find all the files in subfolders, however, it tries to tokenize the found filenames from the root directory:
[INFO] (MainThread) File projects_success.txt no found
[INFO] (MainThread) Process 1
[INFO] (MainThread) Starting file <3,0,project-folder/test2.js>
[INFO] (MainThread) Starting file <3,1,project-folder/util.js>
[ERROR] (MainThread) File not found <3,1,project-folder/util.js>
[INFO] (MainThread) Starting file <3,2,project-folder/index.js>
[ERROR] (MainThread) File not found <3,2,project-folder/index.js>
I am submitting a PR with a fix. (cc @pedromartins4)
The description can be found here.
https://github.com/Mondego/SourcererCC/issues/26
@dyangUCI @pedromartins4
Yeah, I used the samples in this repository (test-env). The three projects are zipped so I executed the command "python tokenizer.py zipblocks ". But as I have said, the document under /file_block_stats ("file-stats") is empty. I don't know what is wrong.
How does one support nested functions/closures accurately?
I'd be great to be able to easily re-check a pair of clones from the output directory. Based on bookkeeping we should be able to get file paths of the clones.
Even more awesome would be to generate a link to the files on GitHub based on metadata (like default_branch, namespace, reponame) from dataset (if it's github dataset).
I've written a simple script that works on my sample dataset of top 1000 GitHub JS repos. It's in SourcererCC fomat. I'll generalize it when I have more time: jakubzitny@0111c6a
Sample output from it looks like this:
$ ./githubpair.sh
https://github.com/jsdoc3/jsdoc/blob/master/plugins/test/specs/commentConvert.js
https://github.com/jsdoc3/jsdoc/blob/master/plugins/test/specs/shout.js
===========
https://github.com/jsdoc3/jsdoc/blob/master/plugins/test/specs/shout.js
https://github.com/jsdoc3/jsdoc/blob/master/plugins/test/specs/escapeHtml.js
===========
https://github.com/thomasdavis/backbonetutorials/blob/gh-pages/examples/modular-backbone/js/text.js
https://github.com/thomasdavis/backbonetutorials/blob/gh-pages/examples/nodejs-mongodb-mongoose-restify/app/js/libs/require/text.js
===========
https://github.com/thomasdavis/backbonetutorials/blob/gh-pages/examples/modular-backbone/js/text.js
https://github.com/thomasdavis/backbonetutorials/blob/gh-pages/examples/cross-domain/js/libs/require/text.js
===========
https://github.com/thomasdavis/backbonetutorials/blob/gh-pages/examples/modular-backbone/js/main.js
https://github.com/thomasdavis/backbonetutorials/blob/gh-pages/examples/nodejs-mongodb-mongoose-restify/app/js/main.js
===========
<...>
Some examples of JS clones: jakubzitny@5d20786735f14f5f73af4a82a6c6c90d.
DéjàVu is a supporting web-tool to allow quick and simple clone analysis, can be found here.
here can not be accessed.
Could you please give a new website for DéjàVu ?
SourcererCC is a great tool for clone mapping! However, I still have several questions about it and DejaVu.
You said 'we have created a mapping between file clones in four languages: Java, C++, JavaScript and Python.' in your website, and I am interested in finding code clones among C++ code files. However, when doing the tokenizing, I haven't found a file named extractCFunction.py yet to finish parsing.
By the way, can I find block clones without the method structures? For instance, for a few lines of statement?
Thank you very much!
It seems that incremental codes clone detection is supported by SCC. In most cases, I don't need to go over the inventory code. Only those incremental codes should be detected. It would be very appreciated if anybody can provide the pipeline for it.
As an improvement I'd suggest removing the config files from the repository. I am not sure how you guys work with git, but usually you only submit a sample config and each contributor/user then copies the file to correct location and changes its contensts. This way no useless unstaged changes will be shown in git.
I'd also add all "standard" output locations to .gitignore
.
For this project the config files would be at least
clone-detector/sourcerer-cc.properties
tokenizers/file-level/config.ini
And the locations for ignore would be
Hi,
I recently faced a problem. Two functions with identical token content could be manually recognized as a type-1 clone. It was not reported by SourcererCC. I tried to do troubleshooting but I couldn't find the reason.
I can show you the token content. Can you shed some light on this?
Thanks.
11,100003000000,20,12,407aab7539aad9635a5258199248d490@#@function@@::@@1,target@@::@@3,onlyOwner@@::@@1,uint256@@::@@1,Transfer@@::@@2,balances@@::@@1,0@@::@@1,mintToken@@::@@1,address@@::@@1,owner@@::@@2,mintedAmount@@::@@5,_totalSupply@@::@@1 12,100003000001,5,5,cc777ed82a99633e2ac159baae382dbb@#@function@@::@@1,owner@@::@@1,sender@@::@@1,owned@@::@@1,msg@@::@@1 12,100013000001,7,6,a6a1eefb5b11fbdc3e46acef1f275263@#@function@@::@@1,newOwner@@::@@2,transferOwnership@@::@@1,onlyOwner@@::@@1,address@@::@@1,owner@@::@@1 12,100023000001,8,7,b27f467b7e068c194a64699308168a25@#@function@@::@@1,owner@@::@@2,sender@@::@@1,Fiocoin@@::@@1,msg@@::@@1,_totalSupply@@::@@1,balances@@::@@1 12,100033000001,8,7,d589f4759ef658596d45ebb0327bf62b@#@function@@::@@1,totalSupply@@::@@2,returns@@::@@1,constant@@::@@1,_totalSupply@@::@@1,uint256@@::@@1,return@@::@@1 12,100043000001,11,10,4d402391021f605517f589e36d53e787@#@function@@::@@1,constant@@::@@1,uint256@@::@@1,_owner@@::@@2,balances@@::@@1,returns@@::@@1,address@@::@@1,return@@::@@1,balance@@::@@1,balanceOf@@::@@1 12,100053000001,43,21,c4592b5b8ca4e4888307d25378d9d28f@#@function@@::@@1,return@@::@@2,_amount@@::@@7,Transfer@@::@@1,_to@@::@@5,balances@@::@@5,address@@::@@1,false@@::@@1,else@@::@@1,true@@::@@1,throw@@::@@1,if@@::@@2,sender@@::@@4,success@@::@@1,uint256@@::@@1,transfer@@::@@1,frozenAccount@@::@@1,0@@::@@1,returns@@::@@1,bool@@::@@1,msg@@::@@4 12,100063000001,47,21,881ca3adfc368b2d75808038edf1487e@#@function@@::@@1,return@@::@@2,Transfer@@::@@1,address@@::@@2,_to@@::@@5,balances@@::@@5,allowed@@::@@2,false@@::@@1,_from@@::@@6,else@@::@@1,_amount@@::@@9,if@@::@@1,sender@@::@@2,success@@::@@1,uint256@@::@@1,true@@::@@1,0@@::@@1,returns@@::@@1,bool@@::@@1,transferFrom@@::@@1,msg@@::@@2 12,100073000001,21,15,176a627a59bc4163aec3898e4e53017d@#@function@@::@@1,_spender@@::@@3,return@@::@@1,sender@@::@@2,success@@::@@1,uint256@@::@@1,approve@@::@@1,address@@::@@1,returns@@::@@1,bool@@::@@1,allowed@@::@@1,msg@@::@@2,Approval@@::@@1,_amount@@::@@3,true@@::@@1 12,100083000001,14,11,c3cdf51a44c8d370a574cd29f62f2975@#@function@@::@@1,_spender@@::@@2,constant@@::@@1,uint256@@::@@1,address@@::@@2,_owner@@::@@2,returns@@::@@1,allowed@@::@@1,return@@::@@1,remaining@@::@@1,allowance@@::@@1 12,100093000001,20,12,407aab7539aad9635a5258199248d490@#@function@@::@@1,target@@::@@3,onlyOwner@@::@@1,uint256@@::@@1,Transfer@@::@@2,balances@@::@@1,0@@::@@1,mintToken@@::@@1,address@@::@@1,owner@@::@@2,mintedAmount@@::@@5,_totalSupply@@::@@1
I made file "blocks.file"
After all it failed with this error:
2018-02-14 20:27:35,844 main ERROR Unable to move file /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-7.log to /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-6.log: java.nio.file.NoSuchFileException /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-7.log -> /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-6.log 2018-02-14 20:27:35,846 main ERROR Unable to copy file /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-7.log to /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-6.log: java.nio.file.NoSuchFileException /Users/saintnik/GitProjects/SourcererCC/clone-detector/SCC_LOGS/2018-02/scc-02-14-2018-7.log org.apache.lucene.store.NoSuchDirectoryException: directory '/Users/saintnik/GitProjects/SourcererCC/clone-detector/fwdindex/1' does not exist at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:219) at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:743) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66) at com.mondego.indexbased.CodeSearcher.<init>(CodeSearcher.java:47) at com.mondego.indexbased.SearchManager.initSearchEnv(SearchManager.java:820) at com.mondego.indexbased.SearchManager.main(SearchManager.java:377) org.apache.lucene.store.NoSuchDirectoryException: directory '/Users/saintnik/GitProjects/SourcererCC/clone-detector/fwdindex/1' does not exist at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:219) at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:243) at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:743) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:52) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:66) at com.mondego.indexbased.CodeSearcher.<init>(CodeSearcher.java:47) at com.mondego.indexbased.SearchManager.initSearchEnv(SearchManager.java:820) at com.mondego.indexbased.SearchManager.main(SearchManager.java:377) NODE_1 FAILED - Job 62732 exited with a status of 1 NODE_2 FAILED - Job 62733 exited with a status of 1
im following the tutorial on the git page to run the SourcererCC clone detection on sample projects provided
the error occurs when I run the controller.py file
the error code shows
search will be carried out with 1 nodes
loading previous run state
previous run state 5
current state: 1
flushing current state 1
current state: 2
flushing current state 2
current state: 3
flushing current state 3
current state: 4
flushing current state 4
current state: 5
flushing current state 5
running new command /Users/yongjinc/Desktop/SourcererCC-master/clone-detector/runnodes.sh search 1
Traceback (most recent call last):
File "controller.py", line 180, in
controller.execute()
File "controller.py", line 133, in execute
Check Log_search.log for more details. grep for FAILED in the log file")
main.ScriptControllerException: One or more nodes failed during Step Search.
Check Log_search.log for more details. grep for FAILED in the log file
There are only four files in the package, I think it will be very quick to detect clones,but it costs me 3 min and most in runnodes.sh. Is it normal or can I change some configurations to make it quick?
http://mondego.ics.uci.edu/projects/SourcererCC/
this is the website given by paper SourcererCC,i want to reproduce paper,some tools i cant download,can you provide the new websites of the tools?
When prepare the file assigned to FILE_projects_list
in config.ini
, DO NOT end with an empty line:
path/to/project1.zip
path/to/project2.zip
Run python tokenizer.py zip
, then check the log file and see:
[INFO] (MainThread) Starting zip project <1, path/to/project1.zip> (process 0)
...
[INFO] (MainThread) Starting zip project <2, path/to/project2.zi> (process 0)
The path of the last project is handled incorrectly which results in project not found.
This may caused by proj_paths.append(line[:-1])
in tokenizers/file-level/tokenizer.py .
Recommend to use line.strip()
instead of line[:-1]
.
I run all the steps according to readme.
when run
python controller.py
it output:
search will be carried out with 1 nodes
loading previous run state
previous run state 0
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/execute.sh 1
current state: 1
flushing current state 1
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/backup-gtpm.sh
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/runnodes.sh init 1
current state: 2
flushing current state 2
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/runnodes.sh index 1
current state: 3
flushing current state 3
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/move-index.sh
current state: 4
flushing current state 4
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/execute.sh 1
current state: 5
flushing current state 5
running new command /home/v-ensh/workspace/SourcererCC/clone-detector/runnodes.sh search 1
current state: 0
flushing current state 0
SUCCESS: Search Completed on all nodes
Then, when we run cat clone-detector/NODE_*/output8.0/query_* > results.pairs
The error is
cat: 'clone-detector/NODE_/output8.0/query_': No such file or directory
Only the file report.csv
in clone-detector/NODE_*/output8.0
I have run the SourcererCC clone detector on a little bit more than 35,000,000 files. The resulting clone pair file consists of >18,000,000,000 lines. Of these, 5 lines contain more than 4 numbers separated with commas (which should be the expected format):
263694,263710,455981,41668,70616
591916,1015368,508215,591934,1015376,192522,333749
14702,100025479,527866,914862,100025719,706877,1213095
502505,200858502537,200858458,1527027,102616237
1454158,2021454205,202495178,785203,101352033
The first one is located on line 1604224 in query_3clones_index_WITH_FILTER.txt
, which is attached in zipped format (split in 3 since I cannot upload files larger than 10MB). query_3clones_index_WITH_FILTER_1.txt.gz query_3clones_index_WITH_FILTER_2.txt.gz query_3clones_index_WITH_FILTER_3.txt.gz
The server that I ran on went down a couple of times, so one could imagine that 263694,<parts of an ID>
was written before the crash, and the next clone pair was written on the same line. However, I don't think that's the case: Since SourcererCC starts from the last line logged in recovery.txt, I see two possibilities:
455981,41668,70616
, which I can't.My blocks file is 7,9 GB, so I don't attach it, but let me know if you need more information!
Hello,
we tried to run SourcererCC on zip java project and we got this error:
Traceback (most recent call last):
File "tokenizer.py", line 12, in
import javalang
ImportError: No module named javalang
then we tried with different java project and we got the same error
I can get the clone pair in the result.pair file, but there is no similarity score of the clone pair. How can I get the score?
I have got the report.csv, blocksclones_index_WITH_FILTER.txt,tokensclones_index_WITH_FILTER.txt.
And the report.csv like this:
index_time | globalTokenPositionCreationTime | num_candidates | num_clonePairs | total_run_time | searchTime | timeSpentInSearchingCandidates | timeSpentInProcessResult | operation | sortTime_during_indexing |
---|---|---|---|---|---|---|---|---|---|
691 | 194 | 0 | 0 | 913 | 0 | 0 | 0 | index | 1 |
0 | 34 | 0 | 1 | 2183 | 2043 | 0 | 0 | search | |
812 | 166 | 0 | 0 | 987 | 0 | 0 | 0 | index | 1 |
995 | 315 | 0 | 0 | 1318 | 0 | 0 | 0 | index | 34 |
0 | 51 | 0 | 495 | 3240 | 3066 | 0 | 0 | search | |
0 | 154 | 0 | 495 | 3456 | 2914 | 0 | 0 | search | |
0 | 63 | 0 | 495 | 2941 | 2776 | 0 | 0 | search | |
0 | 50 | 0 | 495 | 2975 | 2825 | 0 | 0 | search |
The num_clonePairs is 495.So, where is detail? The tokensclones_index_WITH_FILTER.txt is empty, the blocksclones_index_WITH_FILTER.txt like this:
1453,1457
1457,1458
1453,1458
1464,1465
1468,1469
1471,1472
1456,1457
1479,1480
1486,1487
1490,1491
1488,1490
1488,1491
1488,1492
1488,1493
1488,1494
1488,1495
1489,1490
1491,1492
1491,1493
1491,1494
1491,1495
1493,1494
1493,1495
1492,1493
1492,1494
1492,1495
1490,1492
1494,1495
1502,1508
1505,1506
1506,1507
1503,1509
1524,1525
1523,1524
1523,1525
1540,1541
1545,1546
1545,1547
1545,1548
1546,1547
1546,1548
1547,1548
Some tokens from some of the tokenized files seem problematic for SourcererCC.
Here is an example of stderr
when the indexing fails, the contents is e.g. weird whitespaces or chars like ||
:
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1
at noindex.CloneHelper.deserialise(Unknown Source)
at indexbased.SearchManager.doIndex(Unknown Source)
at indexbased.SearchManager.main(Unknown Source)
While indexing, there is a lot of EXCEPTION CAUGHT
messages coming from caught ArrayIndexOutOfBoundsException
s in CloneHelper.java. I'm not sure if it's a problem or not.
Also, while searching I am getting a lot of ERROR: more that one doc found. some error here.
messages.
Maybe these problems are some small things in the tokenization process, you have any ideas what it might be? For now, I will update the handling of weird whitespaces and see how it helps.
How to location the cloned code in lines in block-level mode?
search will be carried out with 2 nodes
loading previous run state
/home/michael/SourcererCC-master/clone-detector/scriptinator_metadata.scc doesn't exist, creating one with state EXECUTE_1
previous run state 0
running new command /home/michael/SourcererCC-master/clone-detector/execute.sh 1
Traceback (most recent call last):
File "controller.py", line 180, in
controller.execute()
File "controller.py", line 147, in execute
"error in execute.sh script while preparing for init step.")
main.ScriptControllerException: error in execute.sh script while preparing for init step.
scriptinator_metadata.scc doesn't exist,why? This is running on ubuntu,
Hi, the Readme
is not clear about where to put the bookkeeping headers.file
.
Does SourcererCC actually use this file or is it just for manual check of the block/file that is found?
I set up a python 3.7 environment using conda and followed requirements.txt to set it up.
I used the test-env and set config.ini to python.
But I failed to run the block-level tokenizer. I kept getting warning:
join() argument must be str or bytes, not 'ZipInfo'
The output looks like this:
GO
'zipblocks'format
*** Starting priority projects...
*** Starting regular projects...
Starting new process 0
*** No more projects to process. Waiting for children to finish...
GO
[INFO] (MainThread) Process 0 starting
[INFO] (MainThread) Starting zip project <11,test-env/2Shirt-SpellBurner.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball test-env/2Shirt-SpellBurner.zip
[INFO] (MainThread) Successfully ran process_zip_ball test-env/2Shirt-SpellBurner.zip
[INFO] (MainThread) Project finished <11,test-env/2Shirt-SpellBurner.zip> (process 0)
[INFO] (MainThread) (0): Total: 0:00:00.001971micros | Zip: 0 Read: 0 Separators: 0micros Tokens: 0micros Write: 0micros Hash: 0 regex: 0
[INFO] (MainThread) Starting zip project <12,test-env/2xyo-indicator-ip.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball test-env/2xyo-indicator-ip.zip
[INFO] (MainThread) Attempting to process_file_contents test-env/2xyo-indicator-ip.zip\indicator-ip-master/test.py
[WARNING] (MainThread) Unable to open zip on <test-env/2xyo-indicator-ip.zip> (process 0)
[WARNING] (MainThread) join() argument must be str or bytes, not 'ZipInfo'
[INFO] (MainThread) Project finished <12,test-env/2xyo-indicator-ip.zip> (process 0)
[INFO] (MainThread) (0): Total: 0:00:00.002480micros | Zip: -1 Read: -1 Separators: -1micros Tokens: -1micros Write: -1micros Hash: -1 regex: -1
[INFO] (MainThread) Starting zip project <13,test-env/3demax-Take-a-break.zip> (process 0)
[INFO] (MainThread) Attempting to process_zip_ball test-env/3demax-Take-a-break.zip
[INFO] (MainThread) Attempting to process_file_contents test-env/3demax-Take-a-break.zip\Take-a-break-master/examples/appmenu.py
[WARNING] (MainThread) Unable to open zip on <test-env/3demax-Take-a-break.zip> (process 0)
[WARNING] (MainThread) join() argument must be str or bytes, not 'ZipInfo'
[INFO] (MainThread) Project finished <13,test-env/3demax-Take-a-break.zip> (process 0)
[INFO] (MainThread) (0): Total: 0:00:00.002480micros | Zip: -1 Read: -1 Separators: -1micros Tokens: -1micros Write: -1micros Hash: -1 regex: -1
[INFO] (MainThread) Process 0 finished. 2 files in 0s.
Process 0 finished, 2 files processed (3000002). Current total: 2
*** All done. 2 files in 0:00:00.145329
I even used diff.txt. I know that's not the solution.
Is there someone who have met the same problem like me?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.