Giter Club home page Giter Club logo

data-science-olympics-2019's Introduction

Data Science Olympics Event, 2019

Here is the notebook with which I made my final submission, where I ranked 16th out of the 295 competitors who submitted a solution (top 6%).

We were given two datasets (requests_train.csv, requests_test.csv), a tutorial notebook which included the custom metric competition_scorer and a baseline submission using an untuned logistic regression model and 3 features. Halfway through the competition, we were given another two datasets (individuals_train.csv and individuals_test.csv).

My approach was simple. I trained an LGBM (Light Gradient Boosting Machine) model on a subset of the train set using early stopping (with evaluations done on the remaining part of the train set), predicted on the test set, and submitted. I would normally perform cross-validation, but due to the size of the dataset (220,000 training rows) and the very limited amount of time available (2 hours), I only used a single validation set.

I'm genuinely surprised that such a simple solution did well. I attribute this to the following:

  • I took great care to ensure that the model's specifications matched those of the custom evaluation metric (this was given as a hint by the organisers);
  • LGBM automatically handles missing variables, categorical variables, and doesn't require feature rescaling. This allowed me to produce a good solution relatively quickly.

Had I had more time, I would have taken greater care of feature engineering, understanding the dataset, and interpreting the model's predictions. I can't understand why birth_month had a high feature importance, and suspect that it may have harmed my score on the test set. Unfortunately, I lost too much time taking care of the custom evaluation metric, and this left me little time for feature engineering / data exploration. To me, however, a reliable validation scheme is the most important thing, so I'm glad that's what I prioritised.

Constructive crticism is always welcome :)

data-science-olympics-2019's People

Contributors

marcogorelli avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.