Giter Club home page Giter Club logo

onsdigital.es-aggregation-sg's Introduction

es-aggregation-sg

Wranglers

Calculate Column (EntRef/County/..) Wrangler

The wrangler is responsible for preparing the data, invoking the lambda and then sending the data downstream along with the respective notification messages (SNS).

Steps performed:

- Retrieves data From S3 bucket
- Invokes method lambda
- Puts the aggregated data in an S3 bucket
- Sends SNS message

Calculate Top 2 Wrangler

The wrangler is responsible for preparing the data, invoking the method lambda and sending the data downstream along with the respective notification messages (SNS).

Steps performed:

- Retrieves data from S3 bucket
- Converts the data from json to dataframe,
- Ensures the mandatory columns are present and correctly typed
- Appends the new output columns in zero state
- Sends the dataframe to the method
- Ensures the new columns are still present and correctly typed in the returned dataframe
- Serialises the dataframe back to json
- Saves the data in an S3 bucket 
- Notifies via SNS   

Methods

Calculate Enterprise Reference Count Method

Name of Lambda: aggregation_column_method

Summary: This method is responsible for grouping the data by a given column, and region. It then aggregates on the specified column (e.g. enterprise_ref) creating a total (e.g. ent_ref_count) and then renames the column accordingly.

Inputs: event: {"RuntimeVariables":{
aggregated_column - A column to aggregate by. e.g. Enterprise_Reference.
additional_aggregated_column - A column to aggregate by. e.g. Region.
aggregation_type - How we wish to do the aggregation. e.g. sum, count, nunique.
total_columns - The names of the columns to produce aggregations for.
cell_total_column - Name of column to rename total_column.
}}

Outputs: A JSON dict which contains a success marker and the aggregated data with the column count/sum.
e.g. {"success": True/False, None/"error": NA/"Message"}


Calculate Top Two Method

Name of Lambda: aggregation_top2_wrangler

Summary: Takes a DataFrame in json format and calculates the highest and second highest total within each unique combination of the aggregated_column and additional aggregated column (column names are adjustable in the runtime variables). These are then appended as two new columns. The DataFrame is saved to S3 as json and a notification sent on to the next module via SNS.

Inputs: event: {"RuntimeVariables":{
aggregated_column - A column to aggregate by. e.g. Enterprise_Reference.
additional_aggregated_column - A column to aggregate by. e.g. Region.
total_columns - The names of the columns to produce aggregations for.
}}

Outputs: A JSON dict which contains a success marker and the input DataFrame with the following two columns appended: "largest_contributor" and "second_largest_contributor"
e.g. {"success": True/False, None/"error": NA/"Message"}


Combiner

The combiner is used to join the outputs from the 3 aggregations back onto the original data. It is assumed that the imputed(or original if it didnt need imputing) data is stored in an s3 bucket by the imputation module; and that each of the 3 aggregation processes each write their output to S3.
The combiner merely picks up the imputation data and the 3 files from the other aggregation stages from s3. It joins these all together and sends onwards. The result of which is that the next module(disclosure) has the granular input data with the addition of aggregations merged on.

*The exact column can be provided as a runtime variable.

onsdigital.es-aggregation-sg's People

Contributors

dependabot[bot] avatar dom-ford avatar glanvl avatar jordancooke avatar kingmushroom avatar krisrogos avatar lukeglanville avatar mkeating avatar piwington avatar thomashenson avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.