es-disclosure

The disclosure module allows the user to specify the steps which they wish to run against the data. e.g. 1, 2, 5. Currently Stages 1, 2, 5 are implemented but 3 & 4 are currently in development and are present as mocks.

The disclosure methods relies on their being several aggregations produced by the previous step. Refer to aggregation for more information.

Wrangler

Disclosure utilises a single wrangler to orchestrated which method stages are triggered. This is specified via the disclosure_stages runtime variable.

Common Environment Variables

Each wrangler has these variables:
bucket_name: - The name of the bucket used to store data.
method_name: - The method that this wrangler calls.

Runtime variables

These are the runtime variables that need to be present for the module to work correctly.
disclosivity_marker: - Marks if the data is disclosive or not.
publishable_indicator: - Marks if the data should be published or not.
explanation: - The reason why something has been marked as disclosive.
total_column: - The name which is used for the total column in aggregation.
parent_column: - The name of the reference of the parent company
threshold: - The threshold used in the calculation of one of the disclosure calculations.
cell_total_column: - The name given to the cell total column.
top1_column: - The name of the column that holds the largest contributor cell.
top2_column: - The name of the column that holds the second largest contributor cell.
stage5_threshold: - The threshold used in the calculation of one of the disclosure calculations.
disclosure_stages: - The stages of disclosure you wish to run e.g. 1, 2, 5.
in_file_name: - The default input file name to get from s3 (this is the previous methods out_file_name).
out_file_name: - The path and name of the file you wish to save the csv as.
sns_topic_arn: - The sns topic to send summary information to.

General process:

Collect the data from s3
Turn input data into dataframe
Pass input dataframe to the appropriate method
Send returned data from method to s3
Send summary info to sns.

Methods

The methods perform the actual disclosure calculation. Each contains a method called disclosure which contains an apply() method to apply a given test to each row of the dataframe. Once applied, the dataframe is returned.

Stage 1

Name of Lambda:

stage1_method

Intro:

Checks whether the total for the cell is 0 or rounded to 0

Inputs:

Outputs:

final_output: Dict containing either:
{"success": True, "data": < stage 1 output - json >}
{"success": False, "error": < error message - string >}

Stage 2

Name of Lambda:

stage2_method

Intro:

Checks whether the number of different ent refs in a cell is at least as much as a certain threshold.

Inputs:

data: input data.
disclosivity_marker: The name of the column to put 'disclosive' marker.
publishable_indicator: The name of the column to put 'publish' marker.
explanation: The name of the column to put reason for pass/fail.
parent_column: The name of the column holding the count of parent company.
threshold: The threshold above which a row is not disclosive.
total_columns: The names of the column holding the cell totals. Included so that correct disclosure columns used.
contributor_reference: The name of the column holding the contributor id.

Outputs:

final_output: Dict containing either:
{"success": True, "data": < stage 2 output - json >}
{"success": False, "error": < error message - string >}

Stage 3

N/A

Stage 4

N/A

Stage 5

Name of Lambda:

stage5_method

Intro:

Not including the specifics of this test as it may be considered sensitive

Inputs:

data: input data.
disclosivity_marker: The name of the column to put 'disclosive' marker.
publishable_indicator: The name of the column to put 'publish' marker.
explanation: The name of the column to put reason for pass/fail.
cell_total_column: The name of the column holding the cell total.
top1_column: The name of the column largest contributor to the cell.
top2_column: The name of the column second largest contributor to the cell.
total_columns: The names of the columns holding the cell totals. Included so that correct disclosure columns used.
contributor_reference: The name of the column holding the contributor id.

Outputs:

final_output: Dict containing either:
{"success": True, "data": < stage 5 output - json >}
{"success": False, "error": < error message - string >}

uk-gov-mirror / onsdigital.es-disclosure-sg Goto Github PK

onsdigital.es-disclosure-sg's Introduction

es-disclosure

Wrangler

Common Environment Variables

Runtime variables

General process:

Methods

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

onsdigital.es-disclosure-sg's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent