Comments (4)
Wendy Wong commented: {noformat}library(h2o) # version 3.36.0.3
library(dplyr)
data <- read.csv(system.file("extdata", "prostate.csv", package = "h2o"))
creating fake factor variable for repr. example
data <- data %>% mutate(state = as.factor(state.abb[DPROS]),
weight = 1) %>%
select(dependent_var = AGE,
numeric_var1 = GLEASON,
numeric_var2 = PSA,
state,
weight)
initialize h2o
h2o.init()
generating training frame
data_train <- data[1:300,] %>% as.h2o()
specify model vars and interactions
model_interaction_pairs <- list(c('numeric_var1', 'state'))
model_vars <- c('numeric_var1','state', 'numeric_var2')
interaction_glm <- h2o.glm(y = "dependent_var",
x = model_vars,
training_frame = data_train,
interaction_pairs = model_interaction_pairs,
offset_column = "weight")
taking the first 6 observations in the data set
bad <- head(data)
bad_h2o <- bad %>% as.h2o
generates out of bounds error since AR isn't present in the data set.
bad <- bad %>%
mutate(prediction = as.vector(h2o.predict(interaction_glm, bad_h2o)))
taking the first 7 observations in the data set, now AR is included.
good <- head(data, 7)
good_h2o <- good %>% as.h2o
no error since all states in train are present.
good <- good %>%
mutate(prediction = as.vector(h2o.predict(interaction_glm, good_h2o)))
to further illustrate:
unique(data[1:300,]$state) # states present in train
unique(good$state) # states present in first 7 observations
unique(bad$state) # states present in first 6 observations, AR left out.{noformat}
from h2o-3.
Wendy Wong commented: 0
To preface, this question is specific to h2o package version {{3.36.0.3}}, I have not yet tested it on other versions but for my purpose {{3.36.0.3}} is unfortunately mandatory.
In this reproducible example we can see that each categorical variable used for interactions ({{state}}) must be present at least once in new data to generate predictions without an error. For convenience sake we can pretend that data sets {{good}} and {{bad}} are entirely new observations that the model has not seen.
In the training set, {{state}} takes on values of either AK, AL, AZ, or AR. For some reason, if one of these states are not present in the new data, {{h2o.predict()}} generates the error:
{noformat}java.lang.RuntimeException: DistributedException from localhost: 'Index 6 out of bounds for length 3', caused by java.lang.ArrayIndexOutOfBoundsException: Index 6 out of bounds for length 3
{noformat}
Predicting on this data set does work:
{noformat}> good
dependent_var numeric_var1 numeric_var2 state weight prediction
1 65 6 1.4 AK 1 65.35646
2 72 7 6.7 AZ 1 65.45639
3 70 6 4.9 AL 1 67.21862
4 76 7 51.2 AK 1 66.11350
5 69 6 12.3 AL 1 67.21862
6 71 8 3.3 AZ 1 65.45639
7 68 7 31.9 AR 1 66.34384
{noformat}
Predicting on this data set does not, and returns the "java.lang.ArrayIndexOutOfBoundsException" error since AR is not present in the data.
{noformat}> bad
dependent_var numeric_var1 numeric_var2 state weight
1 65 6 1.4 AK 1
2 72 7 6.7 AZ 1
3 70 6 4.9 AL 1
4 76 7 51.2 AK 1
5 69 6 12.3 AL 1
6 71 8 3.3 AZ 1
{noformat}
I have some possible solutions to this, but they definitely aren't as convenient as I'd like.
Add rows to the new data set for every unique state in the training frame, remove rows after predicting. This works, but outside of this example I'd need to implement quite a few steps and checks to make it dynamic (i.e. 0 categorical interactions, >1 categorical interactions, and the corresponding unique values present in the training frame)
Modify the h2o model object to remove the variable & interaction variable that are not in use for new data. (can't seem to get this to work, and it might be hard to make dynamic outside of this example. Also probably not best practice to modify a model object)
Add new data to the original set and predict on everything, then filter out by some indicator. This is also not ideal since the data outside of this example is pretty big.
I'm not quite understanding why each categorical variable must be present in new data that is being predicted, since each prediction should be based on that specific row. Is this just a limitation of h2o, or am I missing some additional argument or some alternative function? Are any other ways to use categorical interactions for an h2o glm when new data doesn't encompass every category?
from h2o-3.
Wendy Wong commented: {noformat}# ugly fix 1 (works)
bad1 <- bind_rows(bad,data.table(unique(data[1:300,]$state)) %>% select(state = V1))
bad1_h2o <- bad1 %>% as.h2o()
bad1 %>% mutate(prediction = as.vector(h2o.predict(interaction_glm, bad1_h2o))) %>% filter(!is.na(dependent_var))
# ugly fix 2 (failed)
interaction_glm2 <- interaction_glm
interaction_glm2@model[["domains"]][[1]] <- paste0(unique(bad$state))
interaction_glm2@model[["domains"]][[2]] <- paste0(unique(bad$state))
interaction_glm2@model[["coefficients_table"]] <- interaction_glm@model[["coefficients_table"]] %>% filter(!grepl("AR",names))
interaction_glm2@model[["standardized_coefficient_magnitudes"]] <- interaction_glm@model[["standardized_coefficient_magnitudes"]] %>% filter(!grepl("AR",names))
interaction_glm2@model[["coefficients"]] <- interaction_glm@model[["coefficients"]][c(-4, -8)]
interaction_glm2@model[["model_summary"]][["number_of_predictors_total"]] <- interaction_glm@model[["model_summary"]][["number_of_predictors_total"]] - 2
interaction_glm2@model[["model_summary"]][["number_of_active_predictors"]] <- interaction_glm@model[["model_summary"]][["number_of_active_predictors"]] - 2
# doesn't work
bad2 <- bad %>%
mutate(prediction = as.vector(h2o.predict(interaction_glm2, bad_h2o))){noformat}
from h2o-3.
JIRA Issue Details
Jira Issue: PUBDEV-8949
Assignee: Yuliia Syzon
Reporter: Wendy Wong
State: Open
Fix Version: N/A
Attachments: N/A
Development PRs: N/A
from h2o-3.
Related Issues (20)
- Add support for Python 3.12 to H2O-3
- add support for R 4.3 to H2O-3
- Investigate why GAM fails with dataset of certain size.
- Resolve High Vulnerabilities HOT 1
- Question about reading back-normalized SHAP values after normalizing data HOT 1
- Change Python CamelCase function calls to snake_case
- Resolve jenkins failures in 3.46.0.2
- `h2o.exportFile` NPE with parquet "string"s HOT 1
- ParseSetup and ParserInfo to utilize supplied domains. HOT 5
- Fix CVE-2024-21634 in ion-java
- Implementation of ArrayUtils.maxValue is incorrect
- Rows Duplicated When Reading Python Dictionary
- POJO Reading change
- Validate h2o-3 package on R 4.4
- `h2o.set_s3_credentials` should support endpoint and region HOT 1
- Infogram bug in algorithm_params HOT 2
- Updating user guide - Index to adhere to makersaurus guidelines
- Updating user guide - Welcome to adhere to makersaurus guidelines
- Enable set auc_type parameter when using H2OAutoML HOT 1
- Updating user guide - Cloud Integration to adhere to makersaurus guidelines
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from h2o-3.