es-disclosure
The disclosure module allows the user to specify the steps which they wish to run against the data. e.g. 1, 2, 5. Currently Stages 1, 2, 5 are implemented but 3 & 4 are currently in development and are present as mocks.
The disclosure methods relies on their being several aggregations produced by the previous step. Refer to aggregation for more information.
Wrangler
Disclosure utilises a single wrangler to orchestrated which method stages are triggered. This is specified via the disclosure_stages runtime variable.
Common Environment Variables
Each wrangler has these variables:
bucket_name: - The name of the bucket used to store data.
method_name: - The method that this wrangler calls.
Runtime variables
These are the runtime variables that need to be present for the module to work correctly.
disclosivity_marker: - Marks if the data is disclosive or not.
publishable_indicator: - Marks if the data should be published or not.
explanation: - The reason why something has been marked as disclosive.
total_column: - The name which is used for the total column in aggregation.
parent_column: - The name of the reference of the parent company
threshold: - The threshold used in the calculation of one of the disclosure calculations.
cell_total_column: - The name given to the cell total column.
top1_column: - The name of the column that holds the largest contributor cell.
top2_column: - The name of the column that holds the second largest contributor cell.
stage5_threshold: - The threshold used in the calculation of one of the disclosure calculations.
disclosure_stages: - The stages of disclosure you wish to run e.g. 1, 2, 5.
in_file_name: - The default input file name to get from s3 (this is the previous methods out_file_name).
out_file_name: - The path and name of the file you wish to save the csv as.
sns_topic_arn: - The sns topic to send summary information to.
General process:
- Collect the data from s3
- Turn input data into dataframe
- Pass input dataframe to the appropriate method
- Send returned data from method to s3
- Send summary info to sns.
Methods
The methods perform the actual disclosure calculation. Each contains a method called disclosure which contains an apply() method to apply a given test to each row of the dataframe. Once applied, the dataframe is returned.
Stage 1
Name of Lambda:
stage1_method
Intro:
Checks whether the total for the cell is 0 or rounded to 0
Inputs:
data: input data.
disclosivity_marker: The name of the column to put 'disclosive' marker.
publishable_indicator: The name of the column to put 'publish' marker.
explanation: The name of the column to put reason for pass/fail.
cell_total_column: The name of the column holding the cell total.
total_columns: The names of the columns holding the cell totals.
contributor_reference: The name of the column holding the contributor id.
Outputs:
final_output: Dict containing either:
{"success": True, "data": < stage 1 output - json >}
{"success": False, "error": < error message - string >}
Stage 2
Name of Lambda:
stage2_method
Intro:
Checks whether the number of different ent refs in a cell is at least as much as a certain threshold.
Inputs:
data: input data.
disclosivity_marker: The name of the column to put 'disclosive' marker.
publishable_indicator: The name of the column to put 'publish' marker.
explanation: The name of the column to put reason for pass/fail.
parent_column: The name of the column holding the count of parent company.
threshold: The threshold above which a row is not disclosive.
total_columns: The names of the column holding the cell totals. Included so that correct disclosure columns used.
contributor_reference: The name of the column holding the contributor id.
Outputs:
final_output: Dict containing either:
{"success": True, "data": < stage 2 output - json >}
{"success": False, "error": < error message - string >}
Stage 3
N/A
Stage 4
N/A
Stage 5
Name of Lambda:
stage5_method
Intro:
Not including the specifics of this test as it may be considered sensitive
Inputs:
data: input data.
disclosivity_marker: The name of the column to put 'disclosive' marker.
publishable_indicator: The name of the column to put 'publish' marker.
explanation: The name of the column to put reason for pass/fail.
cell_total_column: The name of the column holding the cell total.
top1_column: The name of the column largest contributor to the cell.
top2_column: The name of the column second largest contributor to the cell.
total_columns: The names of the columns holding the cell totals. Included so that correct disclosure columns used.
contributor_reference: The name of the column holding the contributor id.
Outputs:
final_output: Dict containing either:
{"success": True, "data": < stage 5 output - json >}
{"success": False, "error": < error message - string >}