Giter Club home page Giter Club logo

sports_popularity_pipeline's Introduction

NENT Tech test

For this Tech test I had to develop 3 tasks related to data transformation. I have tried to get the task execution stable enough by also developing a small suite of unit tests for each task.

Dependencies

For this tech test I have used Python 3.7 and Pandas for the data manipulation. The test was developed in a Python virtual environment created and managed with pipenv, a python library for virtual environment management. This means that the only requirements needed in you computer to run this project are Python 3.7 and pipenv==2020.11.15.

Inside the virtual environment are the actual dependencies of the project. Here is a list of all of them obtained via the command pipenv run pip freeze:

numpy==1.20.2
pandas==1.2.4
pyarrow==4.0.0
python-dateutil==2.8.1
pytz==2021.1
six==1.15.0

In order to install these dependencies in the virtual environment the following command must be run before executing any code:

pipenv run pip install -r requirements.txt

Execution instructions

In order to execute any code in this project one must only run the command:

pipenv run python file_to_execute.py

Task 1

For this task I had to get the favorite sports, leagues and teams of each user in the dataset ViewingData.snappy.parquet. Apart from this, some other things must be taken into consideration like the user not having a profile_id in the dataset, in which case I should use the user_id.

The expected output of this task has the following structure:

[profile_id: string (nullable = true),
favSports: array ( struct (name: string, viewedMilliseconds: long),
favLeagues: array ( struct (name: string, viewedMilliseconds: long),
favTeams: array ( struct (name: string, viewedMilliseconds: long, seriesGuid: string, sport: string)]

My approach to this task was to group all rows in the dataset by their profile_id and then perform the necessary operations to get the favorite sports, leagues and teams based on the time they have spent watching their content.

The final output of this task is saved in a csv file with 4 columns:

  • profile_id: The ID of the user in the platform.

  • fav_sports: a Python list with each sport the user has watched ordered by the total time spent watching it. Each sport is saved in the structure of a Python dictionary with attributes: name and viewed_milli_seconds.

  • fav_leagues: a Python list with each league the user has watched ordered by the total time spent watching it. Each league is saved in the structure of a Python dictionary with attributes: name and viewed_milli_seconds.

  • fav_teams: a Python list with each league the user has watched ordered by the total time spent watching it. Each league is saved in the structure of a Python dictionary with attributes: name, sport, league and viewed_milli_seconds. Each team in this list is considered individually, which means that for every match shown in the dataset the names of the teams were extracted and the viewed time of the match was added.

Task 2

The second task consisted in counting the number of watches each individual program from the dataset started_streams.csv gets, as well as the number of unique users that watch each program and in which device, country and product type.

The expected output of this task has the following structure:

[program_title,device_name,country_code,product_type,unique_users,content_count]

For this task, my approach was once again to group the rows of the dataset by program_title and then use aggregation functions to count the number of views and the number of unique users.

As for the country, product type and device I made the assumption that the task wanted me to pick the most common of each of them for that specific program_title as the task specification document did not explicitly say how to select those fields.

Task 3

In this last task the objective was to get the top 10 most watched genres in the dataset started_streams.csv as well as which hour got the maximum watches for each of them.

The expected output of this task has the following structure:

[genre,unique_users,peak_hour(24 hours patterns means 00 to 23), peak_hour_unique_users]

The first thing I did for this task was to extract the hour from the column time to a new column in order to make the following operations more convenient to implement.

Then, I once again grouped the rows in the dataset by genre and counted the number of unique users, calculated the peak_hour (most common hour for that genre) and the number of unique users that were watching at that peak_hour.

Tests

Tests can be found in the test/ folder of the project in 3 different files, one for each task:

task_1_test.py
task_2_test.py
task_3_test.py

Test can be run with:

pipenv run python task_X_test.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.