Giter Club home page Giter Club logo

rain-part2's People

Contributors

mlandry22 avatar thakurrajanand avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

rain-part2's Issues

Probability distribution matching

Could be a wild goose chase here, but maybe you guys have experience that could take this somewhere...

After creating predictions, I did the following things:

  • used R fitdistrplus package to best fit a gamma distribution to the data
  • rank ordered the predicted values from xgb, keeping integrity with ground truth
  • rank ordered the best fit gamma distr values and merged into the ranked predictions
  • computed MAEs and tested blends

Each time the MAE for the fit distribution is close and just a little higher. Contrary to what I thought, the values are such that blending does not seem to yield any improvement.
image

Our New Team!

Welcome John and Thakur! Let's hope the three of us can do some fun stuff with this one. I wouldn't be surprised if some small ideas plus a lot of blending go a long way. Let's hope between the three of us we can find those small ideas and hopefully somebody can be working on this while the others are busy with other things (that's how these things go, usually).

Years ago, I told Thakur I really wanted to get him his master's status. We were so, so close in the credit one, just losing out on the last couple days of what turned into a sprint to the finish. Maybe this will be a top 10 finish for us.

There's no right/wrong organization, but I figure we can use this thread for any team related stuff.

For me, my main purpose here is to get robust methods that make it easy to do competitions in a scalable way. Finishing fifth in the Rain Part 1, I'd like to do well here. But they've simplified the problem in ways that the parts I did the best on aren't as big a deal here. We'll see. Again, hopefully we will all provide a little value in our own ways.

Submission Management

Thread to discuss submission strategy. At 3 members and 2 submissions per day, it won't be too obvious how to go about this.

Logistic Regression

I've been spending a lot of time the last couple days trying to get the right transformation function. And it occurs to me what I really want is bins, and then the method I want is multinomial classification.
I might have pointed this out before, but this is the inspiration:
http://fastml.com/regression-as-classification/

That is still one of my favorite competitions and it also implements absolute error, so it might be interesting. On that data set, there were a lot of duplicates, so that could be an important reason this might not work. But looking at how jagged our output values are (i.e. specific values occur often) it might be favorable to this method and should help keep rigid bins still reasonable. Additionally, I realize this makes it reasonable to solve with H2O, so I'm going to give it a try.

Tasks

  • Start simple: 10-20 bins
    • Find bins
    • create method of translating bins to output: probabilities to leading bin, or specific value
    • understand how this compares with existing methods
  • Add more: 30-50 bins
    • same tasks
  • Decide if sub-bins is worthwhile
    • taking the initial 10-20 and subdividing some might make sense, but not sure what that would look like; analyze where loss is being incurred and whether narrower buckets using existing probabilities will make sense.
  • Decide if the entire thing is worthwhile
    • This might just be a bad idea

Getting Started

As I mentioned, I use these mainly to keep communication with enhanced features (markdown: code, tables, etc.).
You'll probably get emails whenever I update it, but in case things render poorly, the Issue directly on Github will likely have better formatting.

Experience with Features

Building on the other thread's overview of features, here is my experience with features so far. Most of these may be covered in your models but perhaps there is something new here?

This is the importance plot from xgb in R. As one might imagine, the precip estimate based on reflectivity(Ref) provides the most gain.

UPDATED 11/6

image

Here are code descriptions:
bigflag = mean(bigflag, na.rm = T),
precip = sum(timespans * rate, na.rm = T),
precipC = sum(timespans * rateC, na.rm = T),
ratemax = max(rate, na.rm = T),
ratesd = sd(rate, na.rm = T),
rd = mean(radardist_km, na.rm = T),
rdxref = mean(radardist_km * Ref, na.rm = T),
rdxrefc = mean(radardist_km * RefComposite, na.rm = T),
records = .N,
ref1 = mean(Ref_5x5_10th, na.rm = T),
ref1sq = mean(Ref_5x5_10th^2, na.rm = T),
ref5 = mean(Ref_5x5_50th, na.rm = T),
ref5sq = mean(Ref_5x5_50th^2, na.rm = T),
ref9sq = mean(Ref_5x5_90th^2, na.rm = T),
refc1sq = mean(RefComposite_5x5_10th^2, na.rm = T),
refc9sq = mean(RefComposite_5x5_90th^2, na.rm = T),
refcdivrd = mean(RefComposite / radardist_km, na.rm = T),
refcratio2 = mean((RefComposite_5x5_90th-RefComposite_5x5_10th)/RefComposite, na.rm = T),
refcsd = sd(RefComposite, na.rm = T),
refdiff = mean(Ref_5x5_50th-Ref, na.rm = T),
refdivrd = mean(Ref / radardist_km, na.rm = T),
refmissratio = sum(is.na(Ref))/.N
refratio2 = mean((Ref_5x5_90th-Ref_5x5_10th)/Ref, na.rm = T),
refsd = sd(Ref, na.rm = T),
target = log1p(mean(Expected, na.rm = T)),
wref = mean(timespans * Ref, na.rm = T),
wrefc = mean(timespans * RefComposite, na.rm = T),
zdr5 = mean(Zdr_5x5_50th, na.rm = T),

Script to produce sample_submission.csv

Maybe this was known/obvious to you already, but I came across this script today:

https://www.kaggle.com/captcalculator/how-much-did-it-rain-ii/marshall-palmer-in-r

Previously I was blending with the sample submission in the dark, thinking that it helped me with the large values but not really knowing. Now I can run it on my validation set and get an idea of what's going on. Happiness!

By the way, much of this is new to me so if you see me doing anything else crazy, definitely call me on it!

Finding Outliers

So I've started looking at the 'problem within a problem', which is to identify IDs in the test set likely to have unreasonably high Expected(rainfall) values. As pointed out in the forums, these outliers are largely noise and are likely responsible for most of the MAE. Here is the contribution from a validation set using our best XGB model. .

image

Identifying even a subset of the outliers could significantly reduce MAE! Here's how I've approached the problem so far:

  1. From validation results and the training data, identify a subset of outliers for further study
  2. Train a binary classifier to predict whether an ID is an outlier or not based on common characteristics hidden in the data
  3. Apply the model to a validation set and for anything classified as an outlier, provide an alternative prediction for Expected(rainfall)
  4. Apply this composite model to a holdout set and compare MAE before and after

I've had good results in the lab, but no progress at all on the public LB, and can't say why. I'll include details of work to date in separate comments.

Precip-related Functions

Since 'precip' as I call it is my most significant factor, I dove into it a bit today to see what's going on. No significant findings yet, but you guys were in Rain 1 and may have insight.

Apparently there has been a ton of work done to establish the relationship between z, reflectivity, and R, rainfall rate. There are many variations depending on geography, intensity of precip and many other factors.

The short version is that I experimented with many equations (i.e., variations of Marshall-Palmer) but did not see any real improvements. I even tried using different constants for light rain vs. heavy rain, but results were always similar. I'll probably try again after thinking about it a bit. Any insight is appreciated.

Here's a comparison of 3 widely used relationships for z-R.

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.