Giter Club home page Giter Club logo

amazon's Introduction

This code produces a portion of our 1st place code from the Kaggle Amazon Access Competition.

My partner, Paul Duan, produced the other portion. At the time we also had code to blend our various model outputs.

see: https://www.kaggle.com/c/amazon-employee-access-challenge

and: https://www.kaggle.com/c/amazon-employee-access-challenge/leaderboard/private

Also included is an ipython notebook that uses the same data. I used this notebook to present a practical introduction to scitkit learn and the random forest algorithm to the Pittsburgh Python User Meetup group.

About the Data: The goal was to predict employee resource access grants from categorical job description data. The scoring metric is AUC (area under the ROC curve). There are only 9 categorical input feature columns, one of which is completely redundant. There are roughly 30,000 training rows and 50,000 test rows.

About the Code: The general strategy was to produce 2 feature sets: one categorical to be modeled with decision tree based approaches and the second a sparse matrix of binary features, created by binarizing all categorical values and 2nd and 3rd order combinations of categorical values. The latter features could be modeled with Logistic Regressoin, SVMs, etc. The starting point of this latter set of code was provided on the forums by Miroslaw Horbal. The most critical modeification I made to it was in merging the most rarely occuring binary features into a much smaller number of features holding these rare values.

ensemble.py generates the categorical featuers and runs the ensemble of tree models. logistic_features.py generates the binarized categorical data. It's currently set up to predict using a NaiveBayes algorithm (for speed), but a 1-line change could replace that with Logistic regression, which should be much higher scoring.

Requirements:

This code assumes you have the competition data (train.csv & test.csv) saved in the working directory.
I have not provided the data set in this repository. One would need to sign the competition rules agreement form before downloading from the competition links to run this code fully.

It was run most recently on windows 8 with python 2.7.5 and the following package versions:

scikit-learn 14.1

pandas 12.0

numpy 1.8

amazon's People

Contributors

bensolucky avatar

Watchers

Shubham Pachori avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.