mlandry22 / rain-part2 Goto Github PK

How Much Did it Rain Pt 2 - Kaggle Competition

R 100.00%

rain-part2's Issues

Submission Management

Thread to discuss submission strategy. At 3 members and 2 submissions per day, it won't be too obvious how to go about this.

I've been spending a lot of time the last couple days trying to get the right transformation function. And it occurs to me what I really want is bins, and then the method I want is multinomial classification.
I might have pointed this out before, but this is the inspiration:
http://fastml.com/regression-as-classification/

That is still one of my favorite competitions and it also implements absolute error, so it might be interesting. On that data set, there were a lot of duplicates, so that could be an important reason this might not work. But looking at how jagged our output values are (i.e. specific values occur often) it might be favorable to this method and should help keep rigid bins still reasonable. Additionally, I realize this makes it reasonable to solve with H2O, so I'm going to give it a try.

Tasks

Start simple: 10-20 bins
- Find bins
- create method of translating bins to output: probabilities to leading bin, or specific value
- understand how this compares with existing methods
Add more: 30-50 bins
- same tasks
Decide if sub-bins is worthwhile
- taking the initial 10-20 and subdividing some might make sense, but not sure what that would look like; analyze where loss is being incurred and whether narrower buckets using existing probabilities will make sense.
Decide if the entire thing is worthwhile
- This might just be a bad idea

Experience with Features

Building on the other thread's overview of features, here is my experience with features so far. Most of these may be covered in your models but perhaps there is something new here?

This is the importance plot from xgb in R. As one might imagine, the precip estimate based on reflectivity(Ref) provides the most gain.

UPDATED 11/6

Here are code descriptions:
bigflag = mean(bigflag, na.rm = T),
precip = sum(timespans * rate, na.rm = T),
precipC = sum(timespans * rateC, na.rm = T),
ratemax = max(rate, na.rm = T),
ratesd = sd(rate, na.rm = T),
rd = mean(radardist_km, na.rm = T),
rdxref = mean(radardist_km * Ref, na.rm = T),
rdxrefc = mean(radardist_km * RefComposite, na.rm = T),
records = .N,
ref1 = mean(Ref_5x5_10th, na.rm = T),
ref1sq = mean(Ref_5x5_10th^2, na.rm = T),
ref5 = mean(Ref_5x5_50th, na.rm = T),
ref5sq = mean(Ref_5x5_50th^2, na.rm = T),
ref9sq = mean(Ref_5x5_90th^2, na.rm = T),
refc1sq = mean(RefComposite_5x5_10th^2, na.rm = T),
refc9sq = mean(RefComposite_5x5_90th^2, na.rm = T),
refcdivrd = mean(RefComposite / radardist_km, na.rm = T),
refcratio2 = mean((RefComposite_5x5_90th-RefComposite_5x5_10th)/RefComposite, na.rm = T),
refcsd = sd(RefComposite, na.rm = T),
refdiff = mean(Ref_5x5_50th-Ref, na.rm = T),
refdivrd = mean(Ref / radardist_km, na.rm = T),
refmissratio = sum(is.na(Ref))/.N
refratio2 = mean((Ref_5x5_90th-Ref_5x5_10th)/Ref, na.rm = T),
refsd = sd(Ref, na.rm = T),
target = log1p(mean(Expected, na.rm = T)),
wref = mean(timespans * Ref, na.rm = T),
wrefc = mean(timespans * RefComposite, na.rm = T),
zdr5 = mean(Zdr_5x5_50th, na.rm = T),

Our New Team!

Welcome John and Thakur! Let's hope the three of us can do some fun stuff with this one. I wouldn't be surprised if some small ideas plus a lot of blending go a long way. Let's hope between the three of us we can find those small ideas and hopefully somebody can be working on this while the others are busy with other things (that's how these things go, usually).

Years ago, I told Thakur I really wanted to get him his master's status. We were so, so close in the credit one, just losing out on the last couple days of what turned into a sprint to the finish. Maybe this will be a top 10 finish for us.

There's no right/wrong organization, but I figure we can use this thread for any team related stuff.

For me, my main purpose here is to get robust methods that make it easy to do competitions in a scalable way. Finishing fifth in the Rain Part 1, I'd like to do well here. But they've simplified the problem in ways that the parts I did the best on aren't as big a deal here. We'll see. Again, hopefully we will all provide a little value in our own ways.

Probability distribution matching

Could be a wild goose chase here, but maybe you guys have experience that could take this somewhere...

After creating predictions, I did the following things:

used R fitdistrplus package to best fit a gamma distribution to the data
rank ordered the predicted values from xgb, keeping integrity with ground truth
rank ordered the best fit gamma distr values and merged into the ranked predictions
computed MAEs and tested blends

Each time the MAE for the fit distribution is close and just a little higher. Contrary to what I thought, the values are such that blending does not seem to yield any improvement.

Precip-related Functions

Since 'precip' as I call it is my most significant factor, I dove into it a bit today to see what's going on. No significant findings yet, but you guys were in Rain 1 and may have insight.

Apparently there has been a ton of work done to establish the relationship between z, reflectivity, and R, rainfall rate. There are many variations depending on geography, intensity of precip and many other factors.

The short version is that I experimented with many equations (i.e., variations of Marshall-Palmer) but did not see any real improvements. I even tried using different constants for light rain vs. heavy rain, but results were always similar. I'll probably try again after thinking about it a bit. Any insight is appreciated.

Here's a comparison of 3 widely used relationships for z-R.

Finding Outliers

So I've started looking at the 'problem within a problem', which is to identify IDs in the test set likely to have unreasonably high Expected(rainfall) values. As pointed out in the forums, these outliers are largely noise and are likely responsible for most of the MAE. Here is the contribution from a validation set using our best XGB model. .

Identifying even a subset of the outliers could significantly reduce MAE! Here's how I've approached the problem so far:

From validation results and the training data, identify a subset of outliers for further study
Train a binary classifier to predict whether an ID is an outlier or not based on common characteristics hidden in the data
Apply the model to a validation set and for anything classified as an outlier, provide an alternative prediction for Expected(rainfall)
Apply this composite model to a holdout set and compare MAE before and after

I've had good results in the lab, but no progress at all on the public LB, and can't say why. I'll include details of work to date in separate comments.

Getting Started

As I mentioned, I use these mainly to keep communication with enhanced features (markdown: code, tables, etc.).
You'll probably get emails whenever I update it, but in case things render poorly, the Issue directly on Github will likely have better formatting.

Script to produce sample_submission.csv

Maybe this was known/obvious to you already, but I came across this script today:

https://www.kaggle.com/captcalculator/how-much-did-it-rain-ii/marshall-palmer-in-r

Previously I was blending with the sample submission in the dark, thinking that it helped me with the large values but not really knowing. Now I can run it on my validation set and get an idea of what's going on. Happiness!

By the way, much of this is new to me so if you see me doing anything else crazy, definitely call me on it!

mlandry22 / rain-part2 Goto Github PK

rain-part2's Issues

Submission Management

Logistic Regression

Experience with Features

Our New Team!

Probability distribution matching

Precip-related Functions

Finding Outliers

Getting Started

Script to produce sample_submission.csv

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent