cs50 / compare50 Goto Github PK

View Code? Open in Web Editor NEW

191.0 36.0 49.0 3.82 MB

This is compare50, a fast and extensible plagiarism-detection tool.

License: GNU General Public License v3.0

Python 70.45% JavaScript 22.76% CSS 2.89% HTML 3.78% Shell 0.12%

compare50's Introduction

compare50

compare50 is currently under active development.

compare50's People

Contributors

Stargazers

Watchers

compare50's Issues

Compare50 raises an AttributeError on python3.8 (+ mac?)

Process SpawnProcess-4:
Traceback (most recent call last):
  File "/usr/local/var/pyenv/versions/3.8.0/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
  File "/usr/local/var/pyenv/versions/3.8.0/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/var/pyenv/versions/3.8.0/lib/python3.8/concurrent/futures/process.py", line 233, in _process_worker
    call_item = call_queue.get(block=True)
  File "/usr/local/var/pyenv/versions/3.8.0/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'Preprocessor' on <module '__main__' (built-in)>

Looks like something changed with Pickle? in 3.8.

Quick workaround for now, run compare50 with --debug.

split_on_whitespace gets rid of tokens that are only whitespace

I assume this isn't intentional, right?

import compare50
import pygments.token

tok = compare50.Token(start=0, end=4, val="    ", type=pygments.token.Text)
print(list(compare50.preprocessors.split_on_whitespace([tok]))) # prints []

Null Interval objects not allowed in IntervalTree

It appears that markdown code blocks mess up compare50's span ranges because the starting index of some tokens is reset when a new code block ( ```LANGUAGE) is encountered (presumably because the tokenizer wants to think of separate code blocks as separate source files). For example:

a/foo.md:

    ```lua
    local push = require "push"

    local gameWidth, gameHeight = 1080, 720 --fixed game resolution
    local windowWidth, windowHeight = love.window.getDesktopDimensions()

    push:setupScreen(gameWidth, gameHeight, windowWidth, windowHeight, {fullscreen = true})

    function love.draw()
      push:start()
  
      --draw here
  
      push:finish()
    end
    ```
    ```lua
    local push = require "push"

    local gameWidth, gameHeight = 1080, 720 --fixed game resolution
    local windowWidth, windowHeight = love.window.getDesktopDimensions()
    windowWidth, windowHeight = windowWidth*.7, windowHeight*.7 --make the window a bit 
    smaller than the screen itself

    push:setupScreen(gameWidth, gameHeight, windowWidth, windowHeight, {fullscreen = false})

    function love.draw()
      push:start()
  
      --draw here
  
      push:finish()
    end
    ```

b/foo.md:

    ```lua
    local push = require "push"

    local gameWidth, gameHeight = 1080, 720 --fixed game resolution
    local windowWidth, windowHeight = love.window.getDesktopDimensions()

    push:setupScreen(gameWidth, gameHeight, windowWidth, windowHeight, {fullscreen = true})

    function love.draw()
      push:start()
      
      --draw here
      
      push:finish()
    end
    local push = require "push"

    local gameWidth, gameHeight = 1080, 720 --fixed game resolution
    local windowWidth, windowHeight = love.window.getDesktopDimensions()
    windowWidth, windowHeight = windowWidth*.7, windowHeight*.7 --make the window a bit smaller than the screen itself

    push:setupScreen(gameWidth, gameHeight, windowWidth, windowHeight, {fullscreen = false})

    function love.draw()
      push:start()
      
      --draw here
      
      push:finish()
    end
    ```

Running compare50 on these produces the following:

$ compare50 --passes structure --verbose a/foo.md b/foo.md
...
Sorry, something's wrong! Let [email protected] know!
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.7/site-packages/compare50/__main__.py", line 351, in <module>
    main()
  File "/usr/local/lib/python3.7/site-packages/compare50/__main__.py", line 340, in main
    pass_to_results[pass_] = _api.compare(scores, ignored_files, pass_)
  File "/usr/local/lib/python3.7/site-packages/compare50/_api.py", line 76, in compare
    for comparison in pass_.comparator.compare(scores, ignored_files):
  File "/usr/local/lib/python3.7/site-packages/compare50/comparators/_winnowing.py", line 133, in compare
    span_matches += _api.expand(index_a.compare(index_b), tokens_a, tokens_b)
  File "/usr/local/lib/python3.7/site-packages/compare50/_api.py", line 231, in expand
    span_tree_a.addi(new_span_a.start, new_span_a.end)
  File "/usr/local/lib/python3.7/site-packages/intervaltree/intervaltree.py", line 330, in addi
    return self.add(Interval(begin, end, data))
  File "/usr/local/lib/python3.7/site-packages/intervaltree/intervaltree.py", line 313, in add
    " {0}".format(interval)
ValueError: IntervalTree: Null Interval objects not allowed in IntervalTree: Interval(335, 173)

Some users match too many times

Some submissions match against multiple archives or current students and actually fill up a large amount of the 50 slots. Should we set some maximum that a user can match, maybe like 5 or so and have the be overwriteable with a command line flag?

@dmalan @crossroads1112 @Jelleas

Be able to display matches with distro differently

This requires adding distro matches to the output of compare, which requires explicit matching against the distro indices. Should this explicit match happen only for the raw text? Should this be configurable?

Bottom scrollbar doesn't disappear

In match_#.html, add "previous" button too for flipping through results

Perhaps use < and > for previous/next match, and << and >> for previous/next file?

Scrolling when clicking "next" is out of sync

When clicking "Next" in match_#.html, the left side scrolls to the next match more quickly than the right side scrolls. Not sure if that's intentional or not, but for scanning matches quickly it would be more efficient if both matches scrolled at the same time.

Write initial frontend template

The frontend should have a view for sorting the result pairs based on scores for different passes and a view that shows a pair of submissions side by side with shared fragments highlighted. Clicking a highlighted fragment should show a hyperlinked list of matching fragments in both submissions. There should be a toggle for turning on and off highlighting of fragments from different passes.

The JSON data will have the following schema:

{
    "files": [file paths...],
    "groups": [[file indices for submission group], ...],
    "results": [
        {
            "subs": [fist submission group index, second submission group index],
            "passes": {
                pass name: [
                    score,
                    [
                        [[[file index A, start, stop] for fragment matching hash in first submission],
                        [[file index B, start, stop] for fragment matching hash in second submission]]
                        for each hash shared among the submissions
                    ]
                ]
                ...
            }
        }
        ...
    ]
}

Refactor compare() to do one file type and pass at a time

This should dramatically reduce memory usage considerably and possibly increase performance due to better locality.

Add some sanity checks/warnings

e.g. if they ask us to compare a folder with many subfolders they may have intended to add /* or similar.

Similarly we may want to blacklist/warn about certain file extensions. Like, "do you really want to include this pdf in the submission?"

show ranking by each pass; combined global ranking

Being able to see the rank the output by any pass and a combined ranking would both be somewhat helpful. The first should be pretty easy, the second is a bit trickier.

In render_#.html, add feature to POST side-by-side files to render.cs50.io

At the moment, this might need to be done with Ajax in order to POST both files as File objects, https://cs50.readthedocs.io/render/, as by creating them from Strings first.

In match_#.html, whitespace isn't always highlighted, even if exact match

allow for comparison of text in string form

As compare50 stands, it ignores text like comments, which is usually fine except in the case of the quiz where most of the submission is text in an .md file. Any way to turn this off?

@Jelleas @crossroads1112 @dmalan

In match_#.html, ensure long file paths forcibly wrap with word-break: break-word

In index.html, add text-overflow: ellipsis to cells for long file paths

In index.html, move graph below or above rows so that long file paths are more likely to be visible, perhaps have UI option for hiding graph so as to see more rows

analysis across multiple runs of pairs/students that show up most frequently

This probably has to be a separate script that parses a collection of index.html files produced by compare50. But it helps detect patterns of students overcollaborating across many psets.

In match_#.html, "next" uses the current highlighted section instead of the next section

For example, if there are three matched regions, and I manually click on match 2, then I click "next", I would expect to be taken to match 3. Instead, match 2 is highlighted again.

show uniqueness of each match

For each matching area, show how many files match ("2 files" means just the current two files), similar to etector. Etector gives details in tooltip, and give larger font size to more unique things. Former is very helpful, but latter can make comparing files difficult.

Knowing how unique a match is is very helpful for determining which cases to refer and articulating to committee how improbable the similarity is.

add support for modeline

Recognize below, where ... is any freeform string (for now).
- # cs50: ...
- // cs50: ...
- /* cs50: ...
Can appear in first line or, hashbang, second.
Ignore when comparing files.
Remove from output, display value thereof elsewhere in UI.
If present on first line and blank line below it, remove blank line too.

In index.html, add explanatory text or tooltip explaining what slider's values are

'utf-8' codec can't decode .DS_store

Ignore .files?

Try to decode, ignore when not possible => show warning on result page?

add a backend to keep track of various passes online

Benedict mentioned this in the meeting today just so that we could have a history of past runs

switch color scheme of line numbers/text

Code/text is gray, while line numbers are black. This should be inverted, please!

In match_#.html, allow user to "exact" by default instead of "structure"

In index.html, allow graph to grow to fill space, else nodes end up densely close to one another, hard to discern nodes with most edges

Eliminate web fonts to decrease file size?

Is it necessary to embed, e.g., Proxima-Nova and Consolas rather than just rely on browser's built-in fonts?

zsh: argument list too long: compare50

compare50 cash/submissions/* -a cash/archives/2012/fall/* cash/archives/2013/fall/* cash/archives/2014/fall/* cash/archives/2015/fall/* cash/archives/2015/spring/* cash/archives/2016/fall/* cash/archives/2017/fall/* cash/archives/2018/fall/* cash/archives/2018/spring/* cash/archives/honeypots/03052019/*

Can I just do cash/archives/* even though that does not constitute one submission?

Upgrade to Bootstrap 4.5

Super long submission names get cut off in the top bar

Clusters highlight together even when no longer a cluster

If a set of submissions forms a large cluster, and the threshold increases such that the cluster breaks up into smaller clusters, hovering over one of the submissions still highlights all of the submissions in the original cluster rather than the new smaller one. This is usually not ideal because the original cluster is often very large and therefore not too meaningful, since the threshold starts low.

turn off highlighting

Hover over highlighting isn't great for trying to review comparisons with others. Is there a way to toggle this off?

@dmalan @dlloyd09

compare50 - Killed

This keeps happening for the homepage assignment. @crossroads1112 @Jelleas is there a workaround to stop it from timing out? I assume this will probably take the longest to run since some students have a lot of .html files and compare50 is trying to compare each student's html files against every other .html file in the archives and submissions folder.

@dmalan

In match_#.html, relocate lefthand sidebar with buttons to top navbar, so that code can fill viewport's width

Even on large TVs when reviewing code, would be helpful to leverage full viewport width.

Show common code

Request from the board here: also show common approaches to a problem, what do most students do. This to help answer the question, why is it telling that an approach differs in an expected plagiarism case.

Unintentionally ignored characters in exact comparison

Spaces are not accounted for while finding missing_spans in an exact comparison. As a result, when there are N spaces on a line, N characters at the end get ignored.

Have compare50 read from stdin if no submissions given

Problem: We store submissions like so:

<student>/<problem>__<timestamp>

For this we have a small script that selects just the last submitted problem for comparison. Would be nice if we could pipe the result from said script to compare50.

$ cli50
$ sudo pip install compare50
$ ls -l /usr/local/lib/python3.7/site-packages/compare50/comparators/
total 1436
-rw-r--r-- 1 root root      74 Aug  7 13:06 __init__.py
drwxr-xr-x 2 root root    4096 Aug  7 13:06 __pycache__/
-rw-r--r-- 1 root root    4078 Aug  7 13:06 _misspellings.py
-rw-r--r-- 1 root root   13639 Aug  7 13:06 _winnowing.py
-rw------- 1 root root 1439228 Aug  7 13:06 english_dictionary.txt

cs50 / compare50 Goto Github PK

compare50's Introduction

compare50

compare50's People

Contributors

Stargazers

Watchers

Forkers

compare50's Issues

Recommend Projects

Recommend Topics

Recommend Org