Comments (3)
It may be helpful to add some attributes to this data frame recording relevant metadata:
task_id_cols
ortask_ids
: character vector of columns in the data frame containing task id variablesoutput_type_col
oroutput_type
: string naming column of the data frame containing the output type. For modern hubs, this would be"output_type"
, but having this information accessible via metadata would help with backward compatibilityoutput_type_id_col
oroutput_type_id
: string naming column of the data frame containing the output type id. For modern hubs this would be"output_type_id"
.value_col
orvalue
: string naming column of the data frame containing the value. For modern and historical hubs this would be"value"
.
We could eliminate the last three of these by assuming the "modern" values of those column names.
My motivation for suggesting this is that it seems like this is information that is known about the data once it is loaded in. It makes sense to track it as attributes of the data that are closely associated with the data so that these pieces of metadata can be easily accessed in downstream functionality. For example, our current draft of initial hubEnsembles functionality takes all of these things as arguments that the user would need to specify (albeit with reasonable/common defaults), but in reality there should be no need for a user to track and specify these things -- they are a part of the hub metadata, and could easily be tracked without any need for extra effort by the user.
In R, we might define this as an S3 class (named something like hub_df
or hub_prediction_df
or hub_pred_df
?), or in python as a class. In either case, a constructor would accept the data frame as well as these attributes.
from hubutils.
If we did the proposal I made just above, we would probably have to provide methods for operations like rbind
and bind_rows
that did validations and updates to the metadata of the input data frames, e.g. concatenating and taking the unique task_ids
and checking that the other columns had the same names.
Are there any other operations on hub_df
objects that we'd have to be careful of?
from hubutils.
Hey @elray1,
Your suggestion is certainly possible, although I can foresee some edge cases, e.g. someone renames a column for example using dplyr::rename
or mutate
in order to harmonise a column name with an equivalent column in another forecast object in order to eventually combine the two with rbind
. In this situation, updating of the metadata would also need to occur in the dplyr::rename
step. Given this it, providing enough methods to ensure any attributes are appropriately updates would likely be too difficult/time consuming.
After in person discussions, @elray1 and myself.
Desirables:
- Any columns going in need to come out of a predict function
- Any added columns (i.e. not original task ids must not break the group_by functionality)
as_model_out_df()
function
Function that will take hub model-output data extracted through a hub connection query (with potentially additional information to aid in ensembling or forecasting appended to it by users) and convert to a standardised model_out_df
S3 class object, ready to be input to ensemble()
or plot()
methods. Alternative names considered: as_prediction_df()
, as_hub_df()
.
df
: a data frame returned from ahub_connection
query.task_id_cols
= default NULL. character vector used to define pure task id columns. Overrides any metatada contained inhub_con
if supplied.output_type_col
= default NULL or "output_type"? Name of column indf
that containsoutput_type
information.output_type_id_col
= default NULL or "output_type" Name of column indf
that containsoutput_type_id
information.value_col
= default NULL or "value". Name of column indf
that containsvalue
information.hub_con
= NULLtrim_to_config
= FALSE
Functionality
- Rename any column names with supplied arguments if necessary.
- Split single
team_model_abbr
column intomodel_abbr
andteam_abbr
if necessary (i.e. if old style hub structure where team & model label is in a single directory). This is related and takes care of #63 !!!!! Actually we may be back peddlling!!!! - check that any additional columns have not introduced additional groupings erroneously by comparing number of groups generated by a group_by (or even just
unique()
) call on alldf
columns apart from the value column to number of groups generated by same call but on columns excluding any extra columns contained in the supplieddf
. These extra columns are determined as the set disjoin of alldf
column names (apart from value) and the setmodel_abbr
,team_abbr
,output_type
,output_type_id
+ tasks ids. Task ids are either determined from theconfig_tasks
attribute of thehub_con
object or thetask_id_cols
vector (if not NULL).task_id_cols
overrides information inhub_con
object of supplied. If usingtask_id_cols
for this check issue message. trim_to_config
will trim task_id columns to only those in theconfig_tasks
attribute of thehub_con
object (? ortask_id_cols
vector if provided?)- Many of these checks will be encompassed in a
validate_hub_prediction()
which can be used by other functions also. This addresses the questions in #64.
from hubutils.
Related Issues (20)
- Export `json_datatypes` for use in hubValidations HOT 1
- Errors in flusight example hub quantile values data type HOT 2
- validate a hub's `model-metadata-schema.json` config file
- function to load model metadata HOT 6
- validate existence and correct formatting of `hub-config/model-metadata-schema.json`
- create man page for `model_out_tbl` class
- split this package into three packages for different intended audiences HOT 8
- `create_task_id()` for `origin_date` returns an error if required = Date object HOT 3
- Improve output message of connect_hub() HOT 1
- Update authors HOT 4
- add hard-coded global variables or datasets with common location codes HOT 4
- argument alignment for create_output_type_*() functions HOT 2
- minimum and maximums for output_types HOT 1
- problem specifying value_minimum and value_maximum in create_output_type_sample() HOT 2
- Upgrade Docs to hubStyle
- Replace ~ with \(x) in purrr calls that trigger object_usage_linter errors HOT 2
- [WIP] Package split questions HOT 2
- Add function like `get_task_ids_tbl`
- [ORG NAME CHANGE]: Update repo to hubverse-org organisation name HOT 1
- Function to get ordered levels of `output_type_id`s for pmf or categorical output types
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hubutils.