platisd / duplicate-code-detection-tool Goto Github PK
View Code? Open in Web Editor NEWA simple Python3 tool to detect similarities between files within a repository
License: MIT License
A simple Python3 tool to detect similarities between files within a repository
License: MIT License
Hello,
I'm using:
> duplicate_code_detection.py --ignore-directories general/util/async/file/test -d general
in gitbash terminal on Windows, but the files in general/util/async/file/test are not ignored.
Value example of involved variables:
ignore_directories = ['general/util/async/file/test']
files_to_ignore = ['general/util/async/file/test\\util_async_file_test\\File_test.cpp', ...]
source_code_files = [..., 'general\\util\\async\\file\\test\\util_async_file_test\\File_test.cpp',...]
I fixed it, do you want me to submit a PR ? :-)
It better to reuse (overwrite) existing comment (which starts with Duplicate code detection tool report
) to avoid being overloaded by bot comments.
Thinking about being able to compare SQL used by data analysts and present similarity results in a familiar form e.g. a table.
I use the tool for different parts seperately in my project. My script looks like below and it checks 3 parts in my code:
name: "Duplicate code"
on:
issue_comment:
types:
- created
permissions:
contents: read
pull-requests: write
jobs:
duplicate-code-check:
name: Check for duplicate code
runs-on: ubuntu-latest
if: github.event.issue.pull_request && contains(github.event.comment.body, 'run_duplicate_code_detection')
steps:
- name: Check for duplicate code(core)
uses: platisd/duplicate-code-detection-tool@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
directories: "core"
file_extensions: "java, py"
one_comment: true
- name: Check for duplicate code(datasource)
if: always()
uses: platisd/duplicate-code-detection-tool@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
directories: "dataSources"
file_extensions: "java, py"
one_comment: true
- name: Check for duplicate code(session, shared)
if: always()
uses: platisd/duplicate-code-detection-tool@master
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
directories: "session, shared"
file_extensions: "java, py"
one_comment: true
I would like having one comment in PR for each part (that is 3 comments in total). And when the script runs again, it chooses to update these 3 comments.
For now, it can only update the latest comment. So it would make one comment and edit it twice. But if I choose to create comment every time, there will be too many comments.
It would be nice to have an option to do not run the comparison on all of the files within the directory, but just on a filtered subset of them.
Idea to implement: An input argument that could be a regular expression that would be executed on the list of files.
If you don't mind, I would volunteer to implement it too.
I'm getting high similarity scores, even though when I analyze the files with grep -Fxf FILE1 FILE2
it's just comments. It would be nice to ignore comments.
The main business logic under duplicate_code_detection.py is lacking unit tests to be automatically run on CI.
Let's introduce some! ๐
I dont know how to use it, I have django project run by docker on my local development, and I need to run this tool within it, How ?
I am running on a legacy repo and it has lots of duplication. So much in fact that the output file is larger than Github allows to be posted to a comment.
Posting results to GitHub failed with code: 422
{"message":"Validation Failed","errors":[{"resource":"IssueComment","code":"unprocessable","field":"data","message":"Body is too long (maximum is 65536 characters)"}],"documentation_url":"https://docs.github.com/rest/reference/issues#create-an-issue-comment"}
Can the output be clamped in size to avoid this error?
My temporary workaround is to increase the ignore_below value.
I get this error when trying to run the action on a private repo (when I set the repo to public it works just fine)
fatal: could not read Password for 'https://***@github.com': No such device or address
I've looked around and it seems like it's the same issue as this one
Here is a log of my run:
Run platisd/[email protected]
with:
github_token: ***
project_root_dir: projects
directories: .
fail_above: 70
ignore_below: 0
file_extensions: h, hpp, c, cpp, cc, java, py, cs
warn_above: 100
/usr/bin/docker run --name a95ec8fa99b7c7894e8abe71ec[2](https://github.com/itaykraise-vayyar/testing-git-hub-actions/runs/5480820825?check_suite_focus=true#step:4:2)eb60de7cf_d2f098 --label 29a95e --workdir /github/workspace --rm -e INPUT_GITHUB_TOKEN -e INPUT_PROJECT_ROOT_DIR -e INPUT_DIRECTORIES -e INPUT_FAIL_ABOVE -e INPUT_IGNORE_BELOW -e INPUT_IGNORE_DIRECTORIES -e INPUT_FILE_EXTENSIONS -e INPUT_WARN_ABOVE -e HOME -e GITHUB_JOB -e GITHUB_REF -e GITHUB_SHA -e GITHUB_REPOSITORY -e GITHUB_REPOSITORY_OWNER -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RETENTION_DAYS -e GITHUB_RUN_ATTEMPT -e GITHUB_ACTOR -e GITHUB_WORKFLOW -e GITHUB_HEAD_REF -e GITHUB_BASE_REF -e GITHUB_EVENT_NAME -e GITHUB_SERVER_URL -e GITHUB_API_URL -e GITHUB_GRAPHQL_URL -e GITHUB_REF_NAME -e GITHUB_REF_PROTECTED -e GITHUB_REF_TYPE -e GITHUB_WORKSPACE -e GITHUB_ACTION -e GITHUB_EVENT_PATH -e GITHUB_ACTION_REPOSITORY -e GITHUB_ACTION_REF -e GITHUB_PATH -e GITHUB_ENV -e GITHUB_STEP_SUMMARY -e RUNNER_OS -e RUNNER_ARCH -e RUNNER_NAME -e RUNNER_TOOL_CACHE -e RUNNER_TEMP -e RUNNER_WORKSPACE -e ACTIONS_RUNTIME_URL -e ACTIONS_RUNTIME_TOKEN -e ACTIONS_CACHE_URL -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/work/_temp/_github_home":"/github/home" -v "/home/runner/work/_temp/_github_workflow":"/github/workflow" -v "/home/runner/work/_temp/_runner_file_commands":"/github/file_commands" -v "/home/runner/work/testing-git-hub-actions/testing-git-hub-actions":"/github/workspace" 29a95e:c8fa99b7c789[4](https://github.com/itaykraise-vayyar/testing-git-hub-actions/runs/5480820825?check_suite_focus=true#step:4:4)e8abe71ec2eb[6](https://github.com/itaykraise-vayyar/testing-git-hub-actions/runs/5480820825?check_suite_focus=true#step:4:6)0de[7](https://github.com/itaykraise-vayyar/testing-git-hub-actions/runs/5480820825?check_suite_focus=true#step:4:7)cf
Cloning into 'itaykraise-vayyar/testing-git-hub-actions'...
fatal: could not read Password for 'https://***@github.com': No such device or address
What do you think?
Is it possible for your to provide output markdown file which i can use for publishing as checkrun using another Github Action. Or alternatevely if you can publish the same markdown table as checkrun thats is also good enough for me.
What the title says, let's fix it
from the source code, I see gensim be used here. gensim be used as text similarity check is cool, but whether be suitable to code similarity check? I mean, for example I want to compare two Unity projects, all scripts must follow C# syntax, many C# fixed words, and to use Unity Engine, all same framework sentences be used such as 'using UnityEngine'. Can gensim ignore these? no mis-check?
Hello,
It would be nice to help the results analysis to add an option --with-loc
that add the lines of code count for each file.
The ouputs will be:
Code duplication probability for general\bar\EventLoop.cpp,351
--------------------------------------------------------------
File,#LoC Similarity (%)
--------------------------------------------------------------
general\util\async\EventLoop.cpp,351 100.00
general\foo\EventLoop.cpp,351 100.00
File A,#LoC A,File B,#LoC B,Similarity
general\bar\EventLoop.cpp,351,general\foo\EventLoop.cpp,351,100.0
general\bar\EventLoop.cpp,351,general\util\async\EventLoop.cpp,351,100.0
{
"general\\bar\\EventLoop.cpp": {
"#LoC": 351,
"general\\foo\\EventLoop.cpp": {
"#LoC": 351,
"similarity": 100.0
},
"general\\util\\async\\EventLoop.cpp": {
"#LoC": 351,
"similarity": 100.0
}
}
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.