Show us something interesting using the provided dataset. You should leverage python, along with any other tools/programming languages of your choice so long they are available for free (you may assume we have Microsoft Office). The use of additional, external datasets is welcomed, but not required.
Examples of interesting analyses include (but are not limited to):
- How effective is occupation as a predictor of whether a person will
enjoy an action film?
- Are there generalizations that can be made regarding opinions of films
released at different points in the viewer's life?
How you choose to convey your findings is completely up to you. Examples include: a Word document containing a written account of the analysis, a Power Point slide deck, a website, a Jupyter Notebook, etc.
The purpose of this challenge is to demonstrate your:
- Programming technique, focusing on data manipulation
- Ability to think through a problem from start to finish, and complete the work
- Ability to identify and present accurate findings in a clear, simple manner
The dataset you will be using is the MovieLens 100K Dataset. This dataset contains 100,000 movie ratings from 1000 users on 1700 films. It was compiled by the University of Minnesota during a seven-month time period between September 19th, 1997 and April 22nd, 1998. More information on the dataset can be found in the README, located here: http://files.grouplens.org/datasets/movielens/ml-100k-README.txt.
The data can be downloaded through the following url: http://files.grouplens.org/datasets/movielens/ml-100k.zip
You have either 48 or 72 hours to complete your analysis depending on whether you choose to use the weekend or weeknights, respectively. This short time period is not meant to intimidate; rather, it is meant to be a reflection of our expectations for the depth of analysis in this challenge. We're not expecting academic-level research--just enough to make us say, "Hmm, that's interesting." The total time spent on this challenge should be around 6 to 8 hours, though more or less is fine.
Before the end of your time limit, please reply to this email with one of the following:
- (Preferred) A link to a GitHub repo containing all findings, presentation
materials, and code used in this challenge. If you choose this method,
please fork this repo and complete your work in the forked repo.
- A zip file (no larger than 25MB) containing all finding, presentation
materials, and code used in this challenge.
Additionally, your zip file/repo should contain a file called "README.md" which explains where your findings can be found and how to reproduce your analysis.
On your interview day, you will meet with a member of the Novos Growth team to discuss your analysis. Be prepared to answer the following questions during this time:
- If given more time what else would you do?
- Why and how did you use the tools you used?
- How would your approach change if the data set contained 1 billion
ratings? Which tools would you use?
- Let’s say that you wanted to reproduce your analysis on a regular
basis—-say, once a week in perpetuity. How would you accomplish this?
If you have any questions during your completion of the challenge or find any bugs along the way, please create an issue for this repo and tag it appropriately. A member of the team will respond as soon as possible.
Good Luck!