SnappetChallenge

My understanding of what needs to be done

Take the provided raw data and create a report (or several) that would be useful for a teacher.

Data analysis

I opened the provided files in a text editor and noticed that contrary to what the challenge description claims they are not actually equivalent: the JSON file is corrupted, some original characters, like the plus/minus sign (±), are replaced in it with the unicode replacement character (�), which gets inserted by some programs when they encounter an invalid byte sequence, see the screenshot for details:

The CSV file seems to be OK, so futher I worked with it.

I used R to analyse the data.

Scatter plot

Let's look at the scatter plot matrix, it may show interesting dependencies between columns. This plot shows several peculiarities about data, to some of which I haven't found an explanation.

Fact 1. The Correct column most often contains values 0 and 1, but sometimes it contains 3.

Why? I would expect an answer to be either incorrect (0) or correct (1).

Fact 2. As a rule, a correct answer gives a positive progress, whereas an incorrect one gives negative. The following answers break this rule.

Why punish people for answering correctly? Sounds kinda cruel :-).

Other observations

Fact 3. The users have only one chance to get a reward or a penalty (positive/negative progress respectively) for an exercise: on their first attempt at solving it. All subsequent attempts are always given zero progress.

Proof (I will refer to answers having a non-zero Progress value as "progress-inducing answers"):

It's likely that the provided data sample was taken from a real system. The following facts are the clues for that.

Fact 4. Correct answer rate for exercises is distributed as one would expect from tests where answers are chosen from a list: with peaks on rational numbers.

When tests are designed this way, people will often try to guess the correct answer, which leads to an average share of correct answers equal to 1/(number of options to choose from).

Fact 5. On average, incorrect answers are more often given to more difficult exercises.

This fact is clearly seen from this violin plot of difficulty distributions drawn separately for correct and incorrect answers. The horizontal lines are the 1st, 2nd and 3rd quartiles.

Fact 6. There's a positive correlation between the Progress given to a correct answer and the Difficulty associated with it. Likewise, the more difficult the exercise, the less amount of progress it tends to cost the user to answer it incorrectly.

Again, totally expected. After splitting the dataset in two by the Correct attribute, the correlation, which was hidden in the initial scatter plot, is clearly seen.

Fact 7. Answers don't fall on weekends.

Fact 8. During the day most answers happen between 6:30 and 12:00.

Normalization

Let's find out which functional dependencies exist between the attributes.

With a little bit of R-hacking and a little bit of time we get the answer:

                    x                 y
 1:        ExerciseId        Difficulty
 2:        ExerciseId            Domain
 3:        ExerciseId LearningObjective
 4:        ExerciseId           Subject
 5: LearningObjective            Domain
 6: LearningObjective           Subject
 7:    SubmitDateTime           Subject
 8: SubmittedAnswerId           Correct
 9: SubmittedAnswerId        Difficulty
10: SubmittedAnswerId            Domain
11: SubmittedAnswerId        ExerciseId
12: SubmittedAnswerId LearningObjective
13: SubmittedAnswerId          Progress
14: SubmittedAnswerId           Subject
15: SubmittedAnswerId    SubmitDateTime
16: SubmittedAnswerId            UserId

Just for reference, here is the R code that generated the above result, feel free to skip it, is's ugly and inefficient.

library("data.table")

counts <- fread("Data/work.csv", na.strings = "NULL", stringsAsFactors = TRUE, encoding = "Latin-1")
counts$UserId <- as.factor(counts$UserId)
counts$ExerciseId <- as.factor(counts$ExerciseId)
isFunction <- function(dtX) {
	return(all(counts[, .(AllSame = all(duplicated(.SD[, get(dtX[, y])])[-1L])), by = .(get(dtX[, x]))][, AllSame]))
}

funcs <- CJ(names(counts), names(counts))[V1 != V2,]
funcs <- funcs[, .(x = V1, y = V2)]
funcs[, r := .I]
funcs[, isF := isFunction(.SD), by = r] # Long calculation.
functions <- funcs[isF == TRUE, .(x, y)]
functions

Here is the graph of functional dependencies:

SubmittedAnswerId is the key of the relation, because it functionally determines all the other columns.

Let's see in which normal form the relation is.

It is in the first normal form (no lists in cells, values are indivisible units),
and also in the second normal form (no attribute depends only on part of the key, because the key consists of a single attribute).
There are transitive dependencies, e.g.: SubmittedAnswerId → ExerciseId → LearningObjective → Subject, so it's not in the 3rd normal form, i.e. the requirement of the 3rd normal form that a non-key attribute does not determine another non-key attribute is not satisfied.

Suspicious learning objectives

The fact that Domain does not functionally determine Subject might be a mistake. If only the two learning objectives shown below had the domain "Taalverzorging", as the other 20 from the same subject "Spelling" do, there would be a functional dependecy Domain → Subject.

Maybe there was an update anomaly due to the highly denormalized way the data is stored.

Proposed conceptual model

To make working with the data more convenient, I will store the data in the database in the normalized form, i.e. I will split the relation into several, each in the 3rd normal form, adding surrogate keys as needed. Here is the resulting conceptual model:

Implementation

Platform and libraries

To implement the requirements I took the opportunity to study what Microsoft has to offer in the way of creating single page applications and I found this template.

If I don't get hired, at least I'll learn something new :-).

Data import

The CsvImport project is intended to import data from the CSV format into the normalized DB.

Reports

I made one report: user statistics overview. To view it, build and run the app (see instructions below), and navigate to the "Overview" tab. Click the "Help" button to view the column descriptions.

What is missing

I didn't bother implementing some things that are usually present in real applications:

authentication and authorization,
logging,
unit tests,
a proper configuration file,
XML-documentation,
etc.

Running the application

The chosen platform is still in the preview state, so the installation of the prerequisites is a bit messy, sorry for that. I did my best to describe what needs to be installed and how.

Prerequisites

LocalDB
.NET Core SDK 1.0.0-rc4-004771
node.js
yarn
webpack installed as a global package (see below)

There are three ways to install them:

Manually. Click the links, download packages, install.
Via Visual Studio 2017. Run the VS2017 installer and set checkboxes for the node.js and .Net Core payloads. You'll have to install Yarn manually.

Via chocolatey.

choco install -y mssqlserver2014-sqllocaldb nodejs yarn
refreshenv

You'll have to install .NET Core SDK manually.

When yarn is already installed (make sure it's in PATH by running yarn --version), install webpack like this (in an admin console):

yarn global add webpack

Building

Execute src/App/build.cmd

Running

NB. Before running, build the application (see the Building section).

Also, before the first launch of the web application, populate the DB by executing src/CsvImport/build-and-run.cmd.

Either open the src/App/App.sln solution in Visual Studio 2017 and press F5 (or click the "Run" button), or

Navigate to the src/App folder in a console.
Execute dotnet run
Navigate to http://localhost:5000 in the browser.

PS: If you want to enable hot module reloading (auto-update of browser content on any source file change), set the ASPNETCORE_ENVIRONMENT environment variable to "Development" before running.

otto-gebb / snappetchallenge Goto Github PK