Comments (11)
Getting my driver script ready. After thinking through versions that would make sense for the future, I'm just trying to get it done simply using a CSV.
On one hand, I don't figure this one will be too much in the way of hyperparameter tuning. On the other, all models will probably be not great, so maybe a big ensemble will do a good job. Either way, here are some intro thoughts on things we'd want to experiment with. Feel free to directly add to or update this list.
- Models
- R GBM: optimizes MAE directly
- Scikit GBM: optimizes MAE directly
- XGBoost: only optimizes it directly if you write the function yourself; and not just as a scoring function; that only keeps track of the best iteration. To use it inside the algorithm, you have to create functions to provide the gradient and hessian (which I think is 0 for MAE?) So...XGBoost may not be an option without some work.
- R QuantReg: MAE random forest package for R. It's quite slow, but it kinda works.
- Binned multinomial classification: might make sense here; I even have the code to do so from the last competition; Adzuna 3rd place solution has always been intriguing because it did so with salaries.
- Others? Solving MAE (quantile regression) is a pretty big deal for these, unless we can find sufficient translations to remove this need (sometimes log will do it, but this time that doesn't seem to be the case)
- Preprocessing: target
- Removing outliers (including various cutpoints: 50, 60, 70, 80, 90, 100)
- Capping outliers (same cutpoint discussion)
- Discarding values not in 0.01" increments (may indicate gauge error)
- Exclude IDs with no value for Ref as these will not be scored for the test set
- Feature generation - nearly unlimited, but still useful to think of general patterns and then try applying across the board
- Estimated rainfall rate based on reflectivity (Marshall-Palmer, NWS, or similar)
- Ratio or difference of 5x5 values to point value
- Mean
- Squared Mean (this was very useful last time and shows promise here; it allows the strong points to have more weight in the average, and apparently that's a good thing)
- Weighted mean by time interval (using amount of time since last measurement as the weight)
- Max
- Sd or variance
- Count
- Count NAs
- Hyperparameters
- Try everything available in each algorithm, more or less
- XGBoost's early stopping is the ideal way to go; R and H2O can add more trees, which is decent, but does require some sort of a loop. but that loop is very worthwhile so that trees doesn't have to be a parameter to tune.
Surely more.
Also we can try predicting the outliers themselves. I don't have a lot of hope there, but it seems reasonable to give it a shot.
- If estimated rates based on reflectivity are useful, we might try modifying the function for outliers. Research indicates different conversion factors are appropriate for different levels of rainfall; i.e., mist vs. hurricane.
- Might try creating a best-fit gamma distribution and then force outliers to match the distribution?
from rain-part2.
Also, I haven't put code to run these in parallel in, but it's possible for R.
Here is an example:
library(doSNOW)
cl <- makeCluster(rep("localhost",each=4)) ##replace with more processes if desired
registerDoSNOW(cl)
a<-foreach(i=1:10, .combine=rbind) %dopar% runZone_andOutputParallel(i,taskNum)
from rain-part2.
Added my thoughts to the main comment here.
from rain-part2.
Cool. Yes, some good additions.
The initial test of my driver seems to be working the way I want it to, so I am adding the parallel piece and then will post that code. Nothing too extreme bit it should allow for use of a csv to drive iterations. It won't look much different from a grid search (e.g. caret), but I will spend the next several days connecting most/all of the options above so we can try various preprocessing measures. It won't be pretty bit hopefully it will be simple enough.
from rain-part2.
Driver thing sounds cool. I would like to share my experience of using Spearmint for parameter search. It works really good for Random Forest and Extra Trees. I can share how to use it in case any of you are interested.
from rain-part2.
Oh, great. Yes, bayesian optimization is often where people go. That's a great way to let the computer stay busy, too.
I'm familiar with it from here: http://fastml.com/tuning-hyperparams-automatically-with-spearmint/
But have never used it. We discuss it quite often at H2O.
Related is this paper, that we've been looking over at H2O:
http://www.jmaxkanter.com/static/papers/DSAA_DSM_2015.pdf
Cool stuff, Thakur. It would be great if you could use scikit's GBM (or anything else scikit) since they support MAE.
from rain-part2.
Also related: somebody interviewing at H2O used this, which allows you to drive scikit learn with a config file, similar to what I'm trying.
https://skll.readthedocs.org/en/latest/
It's surely worth trying these sorts of things. I'll probably stick with mine for the moment: I'm running 6 CSV entries in parallel right now, so it is looking good. But alternatives might prove very useful.
from rain-part2.
Actually you have to use 2 files by default for Spearmint. Attach are the example files from Spearmint competition for Random Forest.
config.txt --- config.json
rf_spearmint.txt --- rf_spearmint.py
from rain-part2.
You need to have mongoDB installed for using Spearmint. I will make a small document of step-by-step process of Spearmint and will post tomorrow.
from rain-part2.
Thanks for sharing. This is great to see as I'm trying to create my own--it comes to life a lot more.
So you could run spearmint for more decisions than just the hyperparameters, since they're just being connected in your RF.py. You just expose more things in both sides. That's effectively how mine is working.
And I get the MongoDB part, too. I'm just shooting out text files. I really had it continually updating the same CSV, but once I went parallel that is no longer a good idea. The characteristic I want by having it all together rather than separated in different files is to be able to efficiently analyze it. For me a CSV is easy. But a Mongo query is surely easy, too. But a fairly large overhead, unfortunately.
As I get closer to the final vision of my simple thing, I'll talk more about why I want what I want, as far as how that might differ from Spearmint.
from rain-part2.
My first run's results:
id val train model trees learn depth minObs rowSample colSample distribution
1 1 2.331730 NA r-gbm 200 0.05 5 10 0.7 NA laplace
2 2 2.319348 NA r-gbm 200 0.05 10 10 0.7 NA laplace
3 3 2.336913 NA r-gbm 200 0.05 5 1 0.7 NA laplace
4 4 2.318312 NA r-gbm 200 0.05 10 1 0.7 NA laplace
5 5 2.308039 NA r-gbm 200 0.05 15 10 0.7 NA laplace
6 6 2.305827 NA r-gbm 200 0.05 15 5 0.7 NA laplace
from rain-part2.
Related Issues (9)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rain-part2.