Giter Club home page Giter Club logo

Comments (3)

elray1 avatar elray1 commented on July 30, 2024

It may be helpful to add some attributes to this data frame recording relevant metadata:

  • task_id_cols or task_ids: character vector of columns in the data frame containing task id variables
  • output_type_col or output_type: string naming column of the data frame containing the output type. For modern hubs, this would be "output_type", but having this information accessible via metadata would help with backward compatibility
  • output_type_id_col or output_type_id: string naming column of the data frame containing the output type id. For modern hubs this would be "output_type_id".
  • value_col or value: string naming column of the data frame containing the value. For modern and historical hubs this would be "value".

We could eliminate the last three of these by assuming the "modern" values of those column names.

My motivation for suggesting this is that it seems like this is information that is known about the data once it is loaded in. It makes sense to track it as attributes of the data that are closely associated with the data so that these pieces of metadata can be easily accessed in downstream functionality. For example, our current draft of initial hubEnsembles functionality takes all of these things as arguments that the user would need to specify (albeit with reasonable/common defaults), but in reality there should be no need for a user to track and specify these things -- they are a part of the hub metadata, and could easily be tracked without any need for extra effort by the user.

In R, we might define this as an S3 class (named something like hub_df or hub_prediction_df or hub_pred_df?), or in python as a class. In either case, a constructor would accept the data frame as well as these attributes.

from hubutils.

elray1 avatar elray1 commented on July 30, 2024

If we did the proposal I made just above, we would probably have to provide methods for operations like rbind and bind_rows that did validations and updates to the metadata of the input data frames, e.g. concatenating and taking the unique task_ids and checking that the other columns had the same names.

Are there any other operations on hub_df objects that we'd have to be careful of?

from hubutils.

annakrystalli avatar annakrystalli commented on July 30, 2024

Hey @elray1,

Your suggestion is certainly possible, although I can foresee some edge cases, e.g. someone renames a column for example using dplyr::rename or mutate in order to harmonise a column name with an equivalent column in another forecast object in order to eventually combine the two with rbind. In this situation, updating of the metadata would also need to occur in the dplyr::rename step. Given this it, providing enough methods to ensure any attributes are appropriately updates would likely be too difficult/time consuming.

After in person discussions, @elray1 and myself.

Desirables:

  • Any columns going in need to come out of a predict function
  • Any added columns (i.e. not original task ids must not break the group_by functionality)

as_model_out_df() function

Function that will take hub model-output data extracted through a hub connection query (with potentially additional information to aid in ensembling or forecasting appended to it by users) and convert to a standardised model_out_df S3 class object, ready to be input to ensemble() or plot() methods. Alternative names considered: as_prediction_df(), as_hub_df().

  • df: a data frame returned from a hub_connection query.
  • task_id_cols = default NULL. character vector used to define pure task id columns. Overrides any metatada contained in hub_con if supplied.
  • output_type_col = default NULL or "output_type"? Name of column in df that contains output_type information.
  • output_type_id_col = default NULL or "output_type" Name of column in df that contains output_type_id information.
  • value_col = default NULL or "value". Name of column in df that contains value information.
  • hub_con = NULL
  • trim_to_config = FALSE

Functionality

  • Rename any column names with supplied arguments if necessary.
  • Split single team_model_abbr column into model_abbr and team_abbr if necessary (i.e. if old style hub structure where team & model label is in a single directory). This is related and takes care of #63 !!!!! Actually we may be back peddlling!!!!
  • check that any additional columns have not introduced additional groupings erroneously by comparing number of groups generated by a group_by (or even just unique()) call on all df columns apart from the value column to number of groups generated by same call but on columns excluding any extra columns contained in the supplied df. These extra columns are determined as the set disjoin of all df column names (apart from value) and the set model_abbr, team_abbr, output_type, output_type_id + tasks ids. Task ids are either determined from the config_tasks attribute of the hub_con object or the task_id_cols vector (if not NULL). task_id_cols overrides information in hub_con object of supplied. If using task_id_cols for this check issue message.
  • trim_to_config will trim task_id columns to only those in the config_tasks attribute of the hub_con object (? or task_id_cols vector if provided?)
  • Many of these checks will be encompassed in a validate_hub_prediction() which can be used by other functions also. This addresses the questions in #64.

from hubutils.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.