I've been spending a lot of time the last couple days trying to get the right transfor

Logistic Regression about rain-part2 HOT 5 OPEN

mlandry22 commented on September 22, 2024

Logistic Regression

from rain-part2.

Comments (5)

mlandry22 commented on September 22, 2024

A few ways I was playing around with to cut up Expected based on the prevalent values.

train<-fread("input/train.csv")
rollup<-train[,.(Expected=mean(Expected),.N,naCount=sum(ifelse(is.na(Ref),1,0))),Id]
remove<-rollup[N==naCount,]
keep<-rollup[N>naCount,]
common0<-keep[,.N,Expected]
common1<-keep[,.N,.(Expected=round(Expected,1))]
common2<-keep[,.N,.(Expected=round(Expected))]
vals0<-common0[order(-N),][1:50,][order(Expected),]
vals1<-common1[order(-N),][1:50,][order(Expected),]
vals2<-common2[order(-N),][1:50,][order(Expected),]
cuts0<-cut(keep[,Expected],breaks=c(0,vals0[,Expected],Inf),labels=round(c(0,vals0[,Expected]),2))
cuts1<-cut(keep[,Expected],breaks=c(-1,vals1[,Expected],Inf),labels=c(-1,vals1[,Expected]))
cuts2<-cut(keep[,Expected],breaks=c(-1,vals2[,Expected],Inf),labels=c(-1,vals2[,Expected]))

#common1<-keep[,.N,.(Expected=round(Expected,1))]
vals3<-common1[order(-N),][1:20,][order(Expected),]
cuts3<-cut(keep[,Expected],breaks=c(-1,vals3[,Expected],Inf),labels=paste0("x",c(-1,vals3[,Expected])))

from rain-part2.

JohnM-TX commented on September 22, 2024

This sounds interesting. If I follow (which may not be the case) then it is similar to what I've been reading about precip estimates. Depending on the level of precipitation, the best way to estimate can be quite different. For light precip, you might use Ref, for medium precip you might use a different function with Kdp, and so on (realizing that this is an example and may not be the real functions.) Anyway, one of the things I haven't tried yet is breaking the data into 2 or more sets and modeling that way, but I think it's promising.

from rain-part2.

mlandry22 commented on September 22, 2024

You're right, technically they would get different models. The first version literally had different models to estimate the chance of Light, Medium, Heavy, and all the 1mm values between them, and they definitely took different variables into account.
This is a similar idea. But I was about to set it up slightly differently, and I'm glad you made that comment because I should try both ways before seeing this through.
The difference is that in Rain 1, each MM level had a binary classification model to estimate the probability independently. And that one was directly associated with the loss metric, so you directly used each of those probabilities. This one I am setting up as a multinomial classification, rather than binary. So the model is going to try and learn probabilities for each specific bucket compared against the others.
Launching the first one....now.
Will try to get quick feedback for us so we know whether it will be worth adding to the overall last-week strategy or not.

from rain-part2.

mlandry22 commented on September 22, 2024

It's early, but interesting to see how the GBM is solving the problem.
First, it seems my way of setting up buckets missed, as some of these are not populated enough. However, I'm not too concerned yet, as 20 is a bit high for the first pass.

Aside from x14, this is in sorted order. You can read the bucket as the minimum of the bucket. So x0 contains those between 0.0mm and 0.1mm. x4.3 are the readings between 4.3mm and 14mm.

So what it appears to be doing is guessing the mode (x0.1) as the default and then finding ways to guess something the second most popular, which is x4.3. Error in all other buckets is 99% or so. Truthfully I'm far more concerned with the probabilities, in hopes that a mini-stacking can figure out the best absolute error guess, given the suite of 20 probabilities.

This was after 33 trees only and error is still well in the steep descent part of the curve, so plenty of room to go. Validation error is still nearly identical to training error (shown).

from rain-part2.

JohnM-TX commented on September 22, 2024

Don't know if domain knowledge is useful for you, but I found these articles informative:
http://www.nwas.org/jom/articles/2013/2013-JOM19/2013-JOM19.pdf
http://www.nwas.org/jom/articles/2013/2013-JOM20/2013-JOM20.pdf
http://www.nwas.org/jom/articles/2013/2013-JOM21/2013-JOM21.pdf

There is a chart in the first article showing viable ranges of variables:

from rain-part2.

Logistic Regression about rain-part2 HOT 5 OPEN

Comments (5)

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent