compare50
is currently under active development.
cs50 / compare50 Goto Github PK
View Code? Open in Web Editor NEWThis is compare50, a fast and extensible plagiarism-detection tool.
License: GNU General Public License v3.0
This is compare50, a fast and extensible plagiarism-detection tool.
License: GNU General Public License v3.0
Process SpawnProcess-4:
Traceback (most recent call last):
File "/usr/local/var/pyenv/versions/3.8.0/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap
self.run()
File "/usr/local/var/pyenv/versions/3.8.0/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/var/pyenv/versions/3.8.0/lib/python3.8/concurrent/futures/process.py", line 233, in _process_worker
call_item = call_queue.get(block=True)
File "/usr/local/var/pyenv/versions/3.8.0/lib/python3.8/multiprocessing/queues.py", line 116, in get
return _ForkingPickler.loads(res)
AttributeError: Can't get attribute 'Preprocessor' on <module '__main__' (built-in)>
Looks like something changed with Pickle? in 3.8.
Quick workaround for now, run compare50 with --debug
.
I assume this isn't intentional, right?
import compare50
import pygments.token
tok = compare50.Token(start=0, end=4, val=" ", type=pygments.token.Text)
print(list(compare50.preprocessors.split_on_whitespace([tok]))) # prints []
It appears that markdown code blocks mess up compare50's span ranges because the starting index of some tokens is reset when a new code block ( ```LANGUAGE) is encountered (presumably because the tokenizer wants to think of separate code blocks as separate source files). For example:
a/foo.md:
```lua
local push = require "push"
local gameWidth, gameHeight = 1080, 720 --fixed game resolution
local windowWidth, windowHeight = love.window.getDesktopDimensions()
push:setupScreen(gameWidth, gameHeight, windowWidth, windowHeight, {fullscreen = true})
function love.draw()
push:start()
--draw here
push:finish()
end
```
```lua
local push = require "push"
local gameWidth, gameHeight = 1080, 720 --fixed game resolution
local windowWidth, windowHeight = love.window.getDesktopDimensions()
windowWidth, windowHeight = windowWidth*.7, windowHeight*.7 --make the window a bit
smaller than the screen itself
push:setupScreen(gameWidth, gameHeight, windowWidth, windowHeight, {fullscreen = false})
function love.draw()
push:start()
--draw here
push:finish()
end
```
b/foo.md:
```lua
local push = require "push"
local gameWidth, gameHeight = 1080, 720 --fixed game resolution
local windowWidth, windowHeight = love.window.getDesktopDimensions()
push:setupScreen(gameWidth, gameHeight, windowWidth, windowHeight, {fullscreen = true})
function love.draw()
push:start()
--draw here
push:finish()
end
local push = require "push"
local gameWidth, gameHeight = 1080, 720 --fixed game resolution
local windowWidth, windowHeight = love.window.getDesktopDimensions()
windowWidth, windowHeight = windowWidth*.7, windowHeight*.7 --make the window a bit smaller than the screen itself
push:setupScreen(gameWidth, gameHeight, windowWidth, windowHeight, {fullscreen = false})
function love.draw()
push:start()
--draw here
push:finish()
end
```
Running compare50 on these produces the following:
$ compare50 --passes structure --verbose a/foo.md b/foo.md
...
Sorry, something's wrong! Let [email protected] know!
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.7/site-packages/compare50/__main__.py", line 351, in <module>
main()
File "/usr/local/lib/python3.7/site-packages/compare50/__main__.py", line 340, in main
pass_to_results[pass_] = _api.compare(scores, ignored_files, pass_)
File "/usr/local/lib/python3.7/site-packages/compare50/_api.py", line 76, in compare
for comparison in pass_.comparator.compare(scores, ignored_files):
File "/usr/local/lib/python3.7/site-packages/compare50/comparators/_winnowing.py", line 133, in compare
span_matches += _api.expand(index_a.compare(index_b), tokens_a, tokens_b)
File "/usr/local/lib/python3.7/site-packages/compare50/_api.py", line 231, in expand
span_tree_a.addi(new_span_a.start, new_span_a.end)
File "/usr/local/lib/python3.7/site-packages/intervaltree/intervaltree.py", line 330, in addi
return self.add(Interval(begin, end, data))
File "/usr/local/lib/python3.7/site-packages/intervaltree/intervaltree.py", line 313, in add
" {0}".format(interval)
ValueError: IntervalTree: Null Interval objects not allowed in IntervalTree: Interval(335, 173)
This requires adding distro matches to the output of compare
, which requires explicit matching against the distro indices. Should this explicit match happen only for the raw text? Should this be configurable?
Perhaps use < and > for previous/next match, and << and >> for previous/next file?
When clicking "Next" in match_#.html, the left side scrolls to the next match more quickly than the right side scrolls. Not sure if that's intentional or not, but for scanning matches quickly it would be more efficient if both matches scrolled at the same time.
The frontend should have a view for sorting the result pairs based on scores for different passes and a view that shows a pair of submissions side by side with shared fragments highlighted. Clicking a highlighted fragment should show a hyperlinked list of matching fragments in both submissions. There should be a toggle for turning on and off highlighting of fragments from different passes.
The JSON data will have the following schema:
{
"files": [file paths...],
"groups": [[file indices for submission group], ...],
"results": [
{
"subs": [fist submission group index, second submission group index],
"passes": {
pass name: [
score,
[
[[[file index A, start, stop] for fragment matching hash in first submission],
[[file index B, start, stop] for fragment matching hash in second submission]]
for each hash shared among the submissions
]
]
...
}
}
...
]
}
This should dramatically reduce memory usage considerably and possibly increase performance due to better locality.
e.g. if they ask us to compare a folder with many subfolders they may have intended to add /* or similar.
Similarly we may want to blacklist/warn about certain file extensions. Like, "do you really want to include this pdf in the submission?"
Being able to see the rank the output by any pass and a combined ranking would both be somewhat helpful. The first should be pretty easy, the second is a bit trickier.
At the moment, this might need to be done with Ajax in order to POST both files as File objects, https://cs50.readthedocs.io/render/, as by creating them from Strings first.
This probably has to be a separate script that parses a collection of index.html files produced by compare50. But it helps detect patterns of students overcollaborating across many psets.
For example, if there are three matched regions, and I manually click on match 2, then I click "next", I would expect to be taken to match 3. Instead, match 2 is highlighted again.
For each matching area, show how many files match ("2 files" means just the current two files), similar to etector. Etector gives details in tooltip, and give larger font size to more unique things. Former is very helpful, but latter can make comparing files difficult.
Knowing how unique a match is is very helpful for determining which cases to refer and articulating to committee how improbable the similarity is.
...
is any freeform string (for now).
# cs50: ...
// cs50: ...
/* cs50: ...
Ignore .files?
Try to decode, ignore when not possible => show warning on result page?
Benedict mentioned this in the meeting today just so that we could have a history of past runs
Code/text is gray, while line numbers are black. This should be inverted, please!
Is it necessary to embed, e.g., Proxima-Nova and Consolas rather than just rely on browser's built-in fonts?
compare50 cash/submissions/* -a cash/archives/2012/fall/* cash/archives/2013/fall/* cash/archives/2014/fall/* cash/archives/2015/fall/* cash/archives/2015/spring/* cash/archives/2016/fall/* cash/archives/2017/fall/* cash/archives/2018/fall/* cash/archives/2018/spring/* cash/archives/honeypots/03052019/*
Can I just do cash/archives/* even though that does not constitute one submission?
If a set of submissions forms a large cluster, and the threshold increases such that the cluster breaks up into smaller clusters, hovering over one of the submissions still highlights all of the submissions in the original cluster rather than the new smaller one. This is usually not ideal because the original cluster is often very large and therefore not too meaningful, since the threshold starts low.
This keeps happening for the homepage assignment. @crossroads1112 @Jelleas is there a workaround to stop it from timing out? I assume this will probably take the longest to run since some students have a lot of .html files and compare50 is trying to compare each student's html files against every other .html file in the archives and submissions folder.
Even on large TVs when reviewing code, would be helpful to leverage full viewport width.
Request from the board here: also show common approaches to a problem, what do most students do. This to help answer the question, why is it telling that an approach differs in an expected plagiarism case.
Problem: We store submissions like so:
<student>/<problem>__<timestamp>
For this we have a small script that selects just the last submitted problem for comparison. Would be nice if we could pipe the result from said script to compare50.
Based on experience, we always started with "exact" when reviewing.
e.g. \cdot for tab characters and trailing spaces.
$ cli50
$ sudo pip install compare50
$ ls -l /usr/local/lib/python3.7/site-packages/compare50/comparators/
total 1436
-rw-r--r-- 1 root root 74 Aug 7 13:06 __init__.py
drwxr-xr-x 2 root root 4096 Aug 7 13:06 __pycache__/
-rw-r--r-- 1 root root 4078 Aug 7 13:06 _misspellings.py
-rw-r--r-- 1 root root 13639 Aug 7 13:06 _winnowing.py
-rw------- 1 root root 1439228 Aug 7 13:06 english_dictionary.txt
Requested by Erin
does -x and -i not handle exact matches
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.