Comments (7)
You want to be able to set the domain of the categorical columns that you want to parse, right? If the dataset you injest does not have the corresponding categorical value, the current domain will miss those values in its domains.
The team is very busy right now, I am trying to get a workaround for you before the real implementation and it seesm like I have failed to do so.
I will try my best to check it out.
from h2o-3.
Hi @monil1334:
You can use relevel to set the base level of a categorical column.
Say you have a categorical column 2 with levels 'a', 'b', 'c'. However, you want the ordering to be 'c','b','a'. This is what you can do:
data[2] = data[2].relevel('b') # new domain is 'b', 'a', 'c'
data[2] = data[2].relevel('c') # now you have the correct level 'c', 'b', 'a'.
If you like, you can also check out the relevel_by_frequency which will set the domain by how frequent the categorical level appears.
from h2o-3.
@wendycwong This is in R and Python reference If I am understanding it correctly. My code base is in Java, So here is a sample setup of what I am trying to do.
public static final int ACTION = 6;
public static final String[] DOMAIN = new String[]{"A", "B", "C", "D", "E", "F"};
PFile file = new PFile("/Users/jaeger/Downloads/part-00000-ca8e6d00-0b91-4f42-8987-f12ac891077e.c0001.csv.gz");
Key[] fvKeys = new Key[1];
FileVec[] fvs = new FileVec[1];
File[] tempFiles = new File[1];
try (FileVecManager fvm = new FileVecManager(file, false)) {
fvs [0] = fvm._fv;
fvKeys [0] = fvm._fv._key;
tempFiles[0] = fvm._tempFile;
}
ParseSetup ps = ParseSetup.guessSetup(fvKeys, false, 1);
String[][] updateDomains = Bootstrap.setDomain(ps, file.getName()); // which basically sets the domain values for the columns in the ParseSetup
Bootstrap.setWFETypes(ps.getColumnTypes(), ps.getColumnNames());
Key<Frame> frameKey = Key.make(file.getName());
Frame frame = ParseDataset.parse(frameKey, fvKeys, true, ps);
so when I do
String[] domain = frame.vec(ACTION).domain();
I get the domain as because the CSV does not have all the possible values from the domain. This would vary file to file
domain = new String []{ "B", "D", "E", "A"};
But what I want is to have the same DOMAIN as declared above so it is consistent.
from h2o-3.
I thought about this and decided that this is difficult to do. The main reason that we have datasets are to use them to build machine learning models. However, if you add extra domains that are not in the dataset and then you try to build a machine learning model using GLM, there will be problem. For GLM, we have a coefficient for each categorical levels. For the extra domain levels that are not found in the dataset, there is no way to determine the coefficient level. Hence, the Gram matrix will not be invertible and hence the model building process will fail.
I think this is the main reason that we did not allow more domain levels than the ones found in the dataset in the first place.
Wendy
from h2o-3.
@wendycwong thank you for trying. I have a temporary workaround which is a manual process of updating the vec's once the frame is built for the columns for which I am trying to keep the vecs as is. But I had more of a curiosity since there was a todo in the code since 2016 or 2017 to use the domains from the ParseSetup.
from h2o-3.
@monil1334: From just parsing point of view, this seems to be a logical thing to do. However, from machine learning perspective, this can cause the code to crash. For example in GLM, we need to perform operations similar to a matrix inverse. If we include domains not found in the dataset, it will make the matrix not invertible. Having said that, we can probably do a pre model building check to remove predictors associated with the extra domain values.
from h2o-3.
@wendycwong if that would be incorporated somehow it would be greatπ!
from h2o-3.
Related Issues (20)
- Appendix m: updating user guide page to adhere to style guide (max_abs_leafnode_pred, max_active_predictors, max_after_balance_size, max_depth, max_iterations, max_models, max_runtime_secs, max_runtime_secs_per_model, metalearner_algorithm, metalearner_params, metalearner_transform, min_prob, min_rows, min_sdev, min_split_improvement, missing_values_handling, model_id, monotone_constraints, mtries) HOT 1
- Appendix n/o/p: updating user guide page to adhere to style guide (nbins, nbins_cats, nbins_top_level, nfolds, nlambdas, noise, non_negative, ntrees, objective_epsilon, offset_column, out_of_bounds, pca_impl, pca_method, plug_values, pred_noise_bandwidth, prior) HOT 1
- Appendix q/r/s: updating user guide page to adhere to style guide (quantile_alpha, rand_family, random_columns, rate, rate_annealing, rate_decay, remove_collinear_columns, sample_rate, sample_rate_per_class, sample_size, score_each_iteration, score_tree_interval, seed, single_node_mode, smoothing, solver, sort_metric, standardize, start_column, stop_column, stopping_metric, stopping_rounds, stopping_tolerance, stratify_by) HOT 1
- Appendix t/u/v/w/x/y: updating user guide page to adhere to style guide (theta, ties, training_frame, transform, treatment_column, tweedie_link_power, tweedie_power, tweedie_variance_power, uplift_metric, upload_custom_distribution, upload_custom_metric, use_all_factor_levels, user_points, validation_frame, weights_column, x, y) HOT 1
- Implement UMAP
- Implement HDBSCAN
- Job request failed Local server has died unexpectedly. RIP., will retry after 3s HOT 2
- Fix plotting in explain: FigureCanvasAgg is non-interactive, and thus cannot be shown plt.show()
- List tests that needed to be manually verified when changing plotting actions in Python for explain function HOT 1
- Fix as_data_frame and not use csv as a medium HOT 1
- Add use_multi_thread for as_data_frame
- Bug in ICE Plot with R 4.4
- Add support for Websockets to steam.jar
- R 4.4 warning `Did you mean to use "<<-"? ( in method "get_model" for class "models_info")` HOT 1
- Upload H2O-3 3.46.0.3 to CRAN
- Bug in GBM python example
- 3.46.0.3 Release Notes
- Overview video for H2O-3 like DAI
- Make sure H2O-3 runs with both new and older Numpy
- Add to Jenkins test: checking that we can connect to the websocket endpoint.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from h2o-3.