Giter Club home page Giter Club logo

duplicate-code-detection-tool's Issues

--ignore-directories option does work on gitbash (Windows)

Hello,
I'm using:
> duplicate_code_detection.py --ignore-directories general/util/async/file/test -d general
in gitbash terminal on Windows, but the files in general/util/async/file/test are not ignored.

Value example of involved variables:

ignore_directories = ['general/util/async/file/test']
files_to_ignore = ['general/util/async/file/test\\util_async_file_test\\File_test.cpp', ...]
source_code_files = [..., 'general\\util\\async\\file\\test\\util_async_file_test\\File_test.cpp',...]

I fixed it, do you want me to submit a PR ? :-)

Reuse of the previous comment

It better to reuse (overwrite) existing comment (which starts with Duplicate code detection tool report) to avoid being overloaded by bot comments.

Output as a CSV?

Thinking about being able to compare SQL used by data analysts and present similarity results in a familiar form e.g. a table.

feature request: update comments for different detections

I use the tool for different parts seperately in my project. My script looks like below and it checks 3 parts in my code:

name: "Duplicate code"

on:
  issue_comment:
    types:
      - created
permissions:
  contents: read
  pull-requests: write

jobs:
  duplicate-code-check:
    name: Check for duplicate code
    runs-on: ubuntu-latest
    if: github.event.issue.pull_request && contains(github.event.comment.body, 'run_duplicate_code_detection')
    steps:
      - name: Check for duplicate code(core)
        uses: platisd/duplicate-code-detection-tool@master
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          directories: "core"
          file_extensions: "java, py"
          one_comment: true

      - name: Check for duplicate code(datasource)
        if: always()
        uses: platisd/duplicate-code-detection-tool@master
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          directories: "dataSources"
          file_extensions: "java, py"
          one_comment: true

      - name: Check for duplicate code(session, shared)
        if: always()
        uses: platisd/duplicate-code-detection-tool@master
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          directories: "session, shared"
          file_extensions: "java, py"
          one_comment: true

I would like having one comment in PR for each part (that is 3 comments in total). And when the script runs again, it chooses to update these 3 comments.

For now, it can only update the latest comment. So it would make one comment and edit it twice. But if I choose to create comment every time, there will be too many comments.

Feature request: Option to filter some files from input directory

It would be nice to have an option to do not run the comparison on all of the files within the directory, but just on a filtered subset of them.
Idea to implement: An input argument that could be a regular expression that would be executed on the list of files.

If you don't mind, I would volunteer to implement it too.

No module named 'gensim'

image

python duplicate_code_detection.py -d smartcar_shield/src
Traceback (most recent call last):
File "E:\nbu\similar\code\duplicate-code-detection-tool\duplicate_code_detection.py", line 10, in
import gensim
ModuleNotFoundError: No module named 'gensim'

Reports files without duplicated code

Hi, I've been using this GH Action and set ignore_below: 20, tool anyway shows up, even when there's no similarities found whatsoever

Example:
Screenshot 2022-07-29 at 14 56 51

I might be doing something wrong, but I've literally copied over from the example provided

Thanks

feature request: Ignore comments

I'm getting high similarity scores, even though when I analyze the files with grep -Fxf FILE1 FILE2 it's just comments. It would be nice to ignore comments.

How to use it

I dont know how to use it, I have django project run by docker on my local development, and I need to run this tool within it, How ?

Github Action fails to post when too many characters

I am running on a legacy repo and it has lots of duplication. So much in fact that the output file is larger than Github allows to be posted to a comment.

Posting results to GitHub failed with code: 422
{"message":"Validation Failed","errors":[{"resource":"IssueComment","code":"unprocessable","field":"data","message":"Body is too long (maximum is 65536 characters)"}],"documentation_url":"https://docs.github.com/rest/reference/issues#create-an-issue-comment"}

Can the output be clamped in size to avoid this error?

My temporary workaround is to increase the ignore_below value.

clone doesn't work for private repositories

I get this error when trying to run the action on a private repo (when I set the repo to public it works just fine)
fatal: could not read Password for 'https://***@github.com': No such device or address

I've looked around and it seems like it's the same issue as this one

Here is a log of my run:

Run platisd/[email protected]
  with:
    github_token: ***
    project_root_dir: projects
    directories: .
    fail_above: 70
    ignore_below: 0
    file_extensions: h, hpp, c, cpp, cc, java, py, cs
    warn_above: 100
/usr/bin/docker run --name a95ec8fa99b7c7894e8abe71ec[2](https://github.com/itaykraise-vayyar/testing-git-hub-actions/runs/5480820825?check_suite_focus=true#step:4:2)eb60de7cf_d2f098 --label 29a95e --workdir /github/workspace --rm -e INPUT_GITHUB_TOKEN -e INPUT_PROJECT_ROOT_DIR -e INPUT_DIRECTORIES -e INPUT_FAIL_ABOVE -e INPUT_IGNORE_BELOW -e INPUT_IGNORE_DIRECTORIES -e INPUT_FILE_EXTENSIONS -e INPUT_WARN_ABOVE -e HOME -e GITHUB_JOB -e GITHUB_REF -e GITHUB_SHA -e GITHUB_REPOSITORY -e GITHUB_REPOSITORY_OWNER -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RETENTION_DAYS -e GITHUB_RUN_ATTEMPT -e GITHUB_ACTOR -e GITHUB_WORKFLOW -e GITHUB_HEAD_REF -e GITHUB_BASE_REF -e GITHUB_EVENT_NAME -e GITHUB_SERVER_URL -e GITHUB_API_URL -e GITHUB_GRAPHQL_URL -e GITHUB_REF_NAME -e GITHUB_REF_PROTECTED -e GITHUB_REF_TYPE -e GITHUB_WORKSPACE -e GITHUB_ACTION -e GITHUB_EVENT_PATH -e GITHUB_ACTION_REPOSITORY -e GITHUB_ACTION_REF -e GITHUB_PATH -e GITHUB_ENV -e GITHUB_STEP_SUMMARY -e RUNNER_OS -e RUNNER_ARCH -e RUNNER_NAME -e RUNNER_TOOL_CACHE -e RUNNER_TEMP -e RUNNER_WORKSPACE -e ACTIONS_RUNTIME_URL -e ACTIONS_RUNTIME_TOKEN -e ACTIONS_CACHE_URL -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/home/runner/work/_temp/_github_home":"/github/home" -v "/home/runner/work/_temp/_github_workflow":"/github/workflow" -v "/home/runner/work/_temp/_runner_file_commands":"/github/file_commands" -v "/home/runner/work/testing-git-hub-actions/testing-git-hub-actions":"/github/workspace" 29a95e:c8fa99b7c789[4](https://github.com/itaykraise-vayyar/testing-git-hub-actions/runs/5480820825?check_suite_focus=true#step:4:4)e8abe71ec2eb[6](https://github.com/itaykraise-vayyar/testing-git-hub-actions/runs/5480820825?check_suite_focus=true#step:4:6)0de[7](https://github.com/itaykraise-vayyar/testing-git-hub-actions/runs/5480820825?check_suite_focus=true#step:4:7)cf
Cloning into 'itaykraise-vayyar/testing-git-hub-actions'...
fatal: could not read Password for 'https://***@github.com': No such device or address

What do you think?

Request to provide output markdown file

@platisd

Is it possible for your to provide output markdown file which i can use for publishing as checkrun using another Github Action. Or alternatevely if you can publish the same markdown table as checkrun thats is also good enough for me.

be suitable for code similarity check?

from the source code, I see gensim be used here. gensim be used as text similarity check is cool, but whether be suitable to code similarity check? I mean, for example I want to compare two Unity projects, all scripts must follow C# syntax, many C# fixed words, and to use Unity Engine, all same framework sentences be used such as 'using UnityEngine'. Can gensim ignore these? no mis-check?

Feature proposal: add an option to add LoC in outputs

Hello,
It would be nice to help the results analysis to add an option --with-loc that add the lines of code count for each file.
The ouputs will be:

  • text
Code duplication probability for general\bar\EventLoop.cpp,351
--------------------------------------------------------------
                   File,#LoC                     Similarity (%)
--------------------------------------------------------------
general\util\async\EventLoop.cpp,351                 100.00
general\foo\EventLoop.cpp,351                        100.00
  • csv
File A,#LoC A,File B,#LoC B,Similarity
general\bar\EventLoop.cpp,351,general\foo\EventLoop.cpp,351,100.0
general\bar\EventLoop.cpp,351,general\util\async\EventLoop.cpp,351,100.0
  • json
{
    "general\\bar\\EventLoop.cpp": {
        "#LoC": 351,
        "general\\foo\\EventLoop.cpp": {
            "#LoC": 351,
            "similarity": 100.0
        },
        "general\\util\\async\\EventLoop.cpp": {
            "#LoC": 351,
            "similarity": 100.0
        }
    }
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.