The goal of this issue is to arrive at a definition for a standard format for represen

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

standard format(s) for a collection of predictions (forecasts or projections) about hubutils HOT 3 CLOSED

elray1 commented on July 30, 2024

standard format(s) for a collection of predictions (forecasts or projections)

from hubutils.

Comments (3)

elray1 commented on July 30, 2024

It may be helpful to add some attributes to this data frame recording relevant metadata:

task_id_cols or task_ids: character vector of columns in the data frame containing task id variables
output_type_col or output_type: string naming column of the data frame containing the output type. For modern hubs, this would be "output_type", but having this information accessible via metadata would help with backward compatibility
output_type_id_col or output_type_id: string naming column of the data frame containing the output type id. For modern hubs this would be "output_type_id".
value_col or value: string naming column of the data frame containing the value. For modern and historical hubs this would be "value".

We could eliminate the last three of these by assuming the "modern" values of those column names.

My motivation for suggesting this is that it seems like this is information that is known about the data once it is loaded in. It makes sense to track it as attributes of the data that are closely associated with the data so that these pieces of metadata can be easily accessed in downstream functionality. For example, our current draft of initial hubEnsembles functionality takes all of these things as arguments that the user would need to specify (albeit with reasonable/common defaults), but in reality there should be no need for a user to track and specify these things -- they are a part of the hub metadata, and could easily be tracked without any need for extra effort by the user.

In R, we might define this as an S3 class (named something like hub_df or hub_prediction_df or hub_pred_df?), or in python as a class. In either case, a constructor would accept the data frame as well as these attributes.

from hubutils.

elray1 commented on July 30, 2024

If we did the proposal I made just above, we would probably have to provide methods for operations like rbind and bind_rows that did validations and updates to the metadata of the input data frames, e.g. concatenating and taking the unique task_ids and checking that the other columns had the same names.

Are there any other operations on hub_df objects that we'd have to be careful of?

from hubutils.

annakrystalli commented on July 30, 2024

Hey @elray1,

Your suggestion is certainly possible, although I can foresee some edge cases, e.g. someone renames a column for example using dplyr::rename or mutate in order to harmonise a column name with an equivalent column in another forecast object in order to eventually combine the two with rbind. In this situation, updating of the metadata would also need to occur in the dplyr::rename step. Given this it, providing enough methods to ensure any attributes are appropriately updates would likely be too difficult/time consuming.

After in person discussions, @elray1 and myself.

Desirables:

Any columns going in need to come out of a predict function
Any added columns (i.e. not original task ids must not break the group_by functionality)

`as_model_out_df()` function

Function that will take hub model-output data extracted through a hub connection query (with potentially additional information to aid in ensembling or forecasting appended to it by users) and convert to a standardised model_out_df S3 class object, ready to be input to ensemble() or plot() methods. Alternative names considered: as_prediction_df(), as_hub_df().

df: a data frame returned from a hub_connection query.
task_id_cols = default NULL. character vector used to define pure task id columns. Overrides any metatada contained in hub_con if supplied.
output_type_col = default NULL or "output_type"? Name of column in df that contains output_type information.
output_type_id_col = default NULL or "output_type" Name of column in df that contains output_type_id information.
value_col = default NULL or "value". Name of column in df that contains value information.
hub_con = NULL
trim_to_config = FALSE

Functionality

Rename any column names with supplied arguments if necessary.
Split single team_model_abbr column into model_abbr and team_abbr if necessary (i.e. if old style hub structure where team & model label is in a single directory). This is related and takes care of #63 !!!!! Actually we may be back peddlling!!!!
check that any additional columns have not introduced additional groupings erroneously by comparing number of groups generated by a group_by (or even just unique()) call on all df columns apart from the value column to number of groups generated by same call but on columns excluding any extra columns contained in the supplied df. These extra columns are determined as the set disjoin of all df column names (apart from value) and the set model_abbr, team_abbr, output_type, output_type_id + tasks ids. Task ids are either determined from the config_tasks attribute of the hub_con object or the task_id_cols vector (if not NULL). task_id_cols overrides information in hub_con object of supplied. If using task_id_cols for this check issue message.
trim_to_config will trim task_id columns to only those in the config_tasks attribute of the hub_con object (? or task_id_cols vector if provided?)
Many of these checks will be encompassed in a validate_hub_prediction() which can be used by other functions also. This addresses the questions in #64.

from hubutils.

standard format(s) for a collection of predictions (forecasts or projections) about hubutils HOT 3 CLOSED

Comments (3)

`as_model_out_df()` function

Functionality

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (3)

as_model_out_df() function

Functionality

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

`as_model_out_df()` function