Giter Club home page Giter Club logo

data-science-tests's Introduction

2022-08 ICRAR Data Science Exercise

General instructions

  1. Clone this repository into your own GitHub account.
  2. All your work will be done in your repository.
  3. Inspect the data and understand what is available.
  4. When going through the exercise, create git commits, making a history of changes that can be inspected later.
  5. The exercise should be completed in Python 3.8+.
  6. Your cloned git repository should be structured and populated using standard Python practice for software, which other people might reuse.
  7. At the end of the exercise, you will have to push your code to your GitHUB repository and share it with the two people who contributed to the original repo. You can also push regularly before that if you want.

Details

The two objectives of this exercise are:

  1. Produce cleaned and consolidated data files suitable for quick loading by another process (which is not part of this exercise).
  2. Produce a visualisation showing significant events/features in the data.

Objective 1

You need to clean and consolidate the CSV data files in the data directory and produce three output files, one for each month. The cadence of the output data files needs to be strictly hourly (i.e. 00:00:00, 01:00:00, 02:00:00, etc.) and complete, i.e. 24 rows for every day in the monthly file. When you encounter missing rows in the original data files, use the following rules:

  1. If the top of the hour (00 minutes) entry is missing or empty, the data for 10 minutes to the hour should be used.
  2. If that is also unavailable, use the data for 10 minutes past the hour.
  3. If neither of these values is available, record NaNs for all the columns.

No matter what, the observe_time column should always contain the respective top of the hour value.

The format of the output files is up to you. Choose whatever you think is appropriate for the data at hand to allow for quick loading times for additional steps you would expect in an ML pipeline.

Objective 2

For the visualisation, think about a situation, where you have to explain and describe the data to somebody and verify that your cleaning actually worked as expected. What would you show and how would you show it? Given that we have actually not told you anything about what the data actually is, this really needs to concentrate on what you can derive from looking at the data, i.e. plotting it in various ways.

The python package you use to produce the visualisation is up to you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.