mlandry22 / rain-part2 Goto Github PK
View Code? Open in Web Editor NEWHow Much Did it Rain Pt 2 - Kaggle Competition
How Much Did it Rain Pt 2 - Kaggle Competition
Building on the other thread's overview of features, here is my experience with features so far. Most of these may be covered in your models but perhaps there is something new here?
This is the importance plot from xgb in R. As one might imagine, the precip estimate based on reflectivity(Ref) provides the most gain.
UPDATED 11/6
Here are code descriptions:
bigflag = mean(bigflag, na.rm = T),
precip = sum(timespans * rate, na.rm = T),
precipC = sum(timespans * rateC, na.rm = T),
ratemax = max(rate, na.rm = T),
ratesd = sd(rate, na.rm = T),
rd = mean(radardist_km, na.rm = T),
rdxref = mean(radardist_km * Ref, na.rm = T),
rdxrefc = mean(radardist_km * RefComposite, na.rm = T),
records = .N,
ref1 = mean(Ref_5x5_10th, na.rm = T),
ref1sq = mean(Ref_5x5_10th^2, na.rm = T),
ref5 = mean(Ref_5x5_50th, na.rm = T),
ref5sq = mean(Ref_5x5_50th^2, na.rm = T),
ref9sq = mean(Ref_5x5_90th^2, na.rm = T),
refc1sq = mean(RefComposite_5x5_10th^2, na.rm = T),
refc9sq = mean(RefComposite_5x5_90th^2, na.rm = T),
refcdivrd = mean(RefComposite / radardist_km, na.rm = T),
refcratio2 = mean((RefComposite_5x5_90th-RefComposite_5x5_10th)/RefComposite, na.rm = T),
refcsd = sd(RefComposite, na.rm = T),
refdiff = mean(Ref_5x5_50th-Ref, na.rm = T),
refdivrd = mean(Ref / radardist_km, na.rm = T),
refmissratio = sum(is.na(Ref))/.N
refratio2 = mean((Ref_5x5_90th-Ref_5x5_10th)/Ref, na.rm = T),
refsd = sd(Ref, na.rm = T),
target = log1p(mean(Expected, na.rm = T)),
wref = mean(timespans * Ref, na.rm = T),
wrefc = mean(timespans * RefComposite, na.rm = T),
zdr5 = mean(Zdr_5x5_50th, na.rm = T),
I've been spending a lot of time the last couple days trying to get the right transformation function. And it occurs to me what I really want is bins, and then the method I want is multinomial classification.
I might have pointed this out before, but this is the inspiration:
http://fastml.com/regression-as-classification/
That is still one of my favorite competitions and it also implements absolute error, so it might be interesting. On that data set, there were a lot of duplicates, so that could be an important reason this might not work. But looking at how jagged our output values are (i.e. specific values occur often) it might be favorable to this method and should help keep rigid bins still reasonable. Additionally, I realize this makes it reasonable to solve with H2O, so I'm going to give it a try.
Tasks
Thread to discuss submission strategy. At 3 members and 2 submissions per day, it won't be too obvious how to go about this.
As I mentioned, I use these mainly to keep communication with enhanced features (markdown: code, tables, etc.).
You'll probably get emails whenever I update it, but in case things render poorly, the Issue directly on Github will likely have better formatting.
Could be a wild goose chase here, but maybe you guys have experience that could take this somewhere...
After creating predictions, I did the following things:
Each time the MAE for the fit distribution is close and just a little higher. Contrary to what I thought, the values are such that blending does not seem to yield any improvement.
Maybe this was known/obvious to you already, but I came across this script today:
https://www.kaggle.com/captcalculator/how-much-did-it-rain-ii/marshall-palmer-in-r
Previously I was blending with the sample submission in the dark, thinking that it helped me with the large values but not really knowing. Now I can run it on my validation set and get an idea of what's going on. Happiness!
By the way, much of this is new to me so if you see me doing anything else crazy, definitely call me on it!
Since 'precip' as I call it is my most significant factor, I dove into it a bit today to see what's going on. No significant findings yet, but you guys were in Rain 1 and may have insight.
Apparently there has been a ton of work done to establish the relationship between z, reflectivity, and R, rainfall rate. There are many variations depending on geography, intensity of precip and many other factors.
The short version is that I experimented with many equations (i.e., variations of Marshall-Palmer) but did not see any real improvements. I even tried using different constants for light rain vs. heavy rain, but results were always similar. I'll probably try again after thinking about it a bit. Any insight is appreciated.
Here's a comparison of 3 widely used relationships for z-R.
So I've started looking at the 'problem within a problem', which is to identify IDs in the test set likely to have unreasonably high Expected(rainfall) values. As pointed out in the forums, these outliers are largely noise and are likely responsible for most of the MAE. Here is the contribution from a validation set using our best XGB model. .
Identifying even a subset of the outliers could significantly reduce MAE! Here's how I've approached the problem so far:
I've had good results in the lab, but no progress at all on the public LB, and can't say why. I'll include details of work to date in separate comments.
Welcome John and Thakur! Let's hope the three of us can do some fun stuff with this one. I wouldn't be surprised if some small ideas plus a lot of blending go a long way. Let's hope between the three of us we can find those small ideas and hopefully somebody can be working on this while the others are busy with other things (that's how these things go, usually).
Years ago, I told Thakur I really wanted to get him his master's status. We were so, so close in the credit one, just losing out on the last couple days of what turned into a sprint to the finish. Maybe this will be a top 10 finish for us.
There's no right/wrong organization, but I figure we can use this thread for any team related stuff.
For me, my main purpose here is to get robust methods that make it easy to do competitions in a scalable way. Finishing fifth in the Rain Part 1, I'd like to do well here. But they've simplified the problem in ways that the parts I did the best on aren't as big a deal here. We'll see. Again, hopefully we will all provide a little value in our own ways.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.