README: wlf-sensitivity-analysis

Nikolai Priser 11/15/2021

Summary

Sets the foundation for sensitivity analysis; see downstream comments regarding further development opportunities. Additionally, items identified as “TODO” are pending development and are not currently available.

Key Operations:

Generate degradation samples by scenario for testing.
TODO; Create Input file(s) for client system for screening.
TODO; Parsing of client system output and adjudication of true-positive / false-positive results.
Evaluation of Scenario performance as “sufficient” or “deficient” based on overall performances and client-performance expectations.
TODO; Generation of Model Sensitivity Analysis markdown report summarizing scenario performance results and calibration opportunity results.

Overview

Project Directories

‘proj/r-scripts/’: executable scripts for the project; write transient output to ‘proj/run-files/’.
‘proj/ref-files/’: reference files that must be manually maintained unless explicitly specified.
‘proj/run-files/’: landing location for transient output during script execution.

Execution Order

All scripts are located in ‘proj/run-files/’ and are run through production-mode, unless explicitly noted. Non-production files are run only if param.demo_mode = TRUE.

All scripts are called and executed in sequence by **’proj/run-files/main.R’**. Scripts inherent global variables; see in-line comments for global variables, inherited from ’_main.R’ or predecessor scripts, for each script.

‘_main.R’ sets execution parameters, see Parameter Definitions section for additional details.

pull-parse-enrich_sdn-files.R Pulls OFAC SDN data, enriches geographic data and writes output to ‘proj/run-files/’
pull-parse-enrich_cons-files.R Pulls OFAC CONS data, enriches geographic data and writes output to ‘proj/run-files/’
degrade_basics.R
Executes degradation scenarios applicable for all types of names, as defined in [raw].[SDN_TYPE] field.
**degrade_entities.R
**Executes degradation scenarios for only “entity” (e.g. businesses, organizations) type names, as defined in [raw].[SDN_TYPE] field. See Degradation Scenarios section, below.
degrade_individuals.R
Executes degradation scenarios for only “individual” type names, as defined in [raw].[SDN_TYPE] field. See Degradation Scenarios section, below.
degrade_vessels.R
Executes degradation scenarios for only “vessel” type names, as defined in [raw].[SDN_TYPE] field. See Degradation Scenarios section, below.
degrade_aircraft.R
Executes degradation scenarios for only “aircraft” type names, as defined in [raw].[SDN_TYPE] field. See Degradation Scenarios section, below.
demo_match-names.R [demo script]
Demo script to emulate a client watchlist screening model. Loads screening algorithms from ‘r-scripts/testing_comparsion_algos.R’ - all algorithm functions should follow the following naming syntax: “comparison_algo_##” ensure that local functions are cleared following screening execution.

Compares degtes**t_namevaluesagainstwatchlistwl_dataprepd_name values. Watchlist raw [SDN_NAME] values are ignored because proper comparison requires name-format matching between the two sets of names; accounting for user-defined degradation parameters such as first-middle-last individual name formatting and dropping of middle-name details.
- Output Global Variables:
  - model_output: Dataframe of matching watchlist names; relates to [[deg]] tbl via common [deg_index] field. Many-to-one relationship between deg$deg_index and model_output matches.
(… TODO …)
Script(s) for integrating degradation samples with model output. Pending development - see code-block “Input/Output Data Integration”. Script(s) should be client system specific (e.g. Actimize, Prime, etc.).
model_sensitivity.R
Evaluates model sensitivity by evaluating scenario true-positive match performance for each scenario. Scenarios where the true-positive match rate is above the corresponding parameter threshold are classified as “sufficient”. Similarly, scenarios where the true-positive match rate is below the corresponding parameter threshold are classified as “deficient”.

Scenarios whose performance is between these thresholds are classified as “To Be Determined” and undergo further modeling. Modeling is conducted in through two-phases: (1) true-positive match boundary analysis and (2) modeling of client expectations.

This approach is under the theory that if a decision-boundary model can be developed for model’s performance for a given scenario, that decision-boundary can then be evaluated to determine whether it generally aligns with management expectations. If the decision-boundary aligns with management expectations then the status of the scenario can be updated to “sufficient”; if it does not, it can be updated to “deficient”.

Under the current methodology, distance-metrics calculated by the degDistance() helper function undergo PCA dimensional reduction to address predictor correlation issues and then are modeled using linear discriminant analysis. The use of LDA is purely judgemental and based on visualization of class-separation achieved through PC1 + PC2 plotting; see corresponding code-block titled “Visualizations of Scenario Performance”. LDA modeling under demo-mode conditions produced high AUC models with accuracy rates > 90% on test data with a cut-off threshold of 50%.

Further development is encouraged , specifically around the evaluation of various modeling methodologies (not just LDA) for each scenario to identify models which produce the best results under k-fold cross validation. The use of workflows::workflowsets() is encouraged. Since models are created at the scenario-level, optimal performance may result in instance where scenarios are associated with different model types.

Furthermore, only distance metrics are currently being used to build models. However, “degrade_*.R” scripts contain placeholders for the quantification of pre-degradation names. Once a corresponding helper function is developed and integrated into “degrade_*.R” scripts, corresponding features can and should be integrated into modeling procedures.

Once models are developed, a segmented sample of degradations are written to “path-file”. This contains pre- and post-degradation names for each sample and a “client_feedback” field. This field should be populated with a “yes” / “no” response to indicate whether a the client believes that the degraded name should generate a match against the original name under in a properly calibrate system.

Screening model results are compared to client-feedback using the McNemar paired-sample test to determine whether there is a statistical difference between screening model results and client expectations. If there is no statistical difference then scenario performance is update to “sufficient”; if there is a statistical difference, then scenario performance is updated to “deficient”.

Under demo-mode conditions, scenario-models are utilize a simulate client feedback. Models are refitted with a higher cut-off threshold (e.g. 80%, such that probability of expecting a true-positive match should be greater than >80% for a “yes” response to be generated), resulting in models with a lower AUC score to produce “deficient” results for some scenarios.

Further development opportunities includes classification probability weighted sampling to better tailor the sample to select samples near the decision boundary.
**(… TODO …)
**A script to generate a file performance markdown report that summarizes scenario performance, modeling results, etc.

Parameter Definitions

**TODO** Parameters currently reside in ‘_main.R’; once finalized, parameters should be relocated to the project’s R Environment File.

param.demo_mode
TRUE/FALSE: Run in demo-mode?
**param.refresh_watchlists
**TRUE/FALSE: Should watchlists be refreshed? FALSE-response is ignored if no watchlist data-files are found in the “ref-files/” directory. Additionally, ignored if demo-mode is being run, unless no watchlist data-files are found.
**param.apply_fml
**TRUE/FALSE: Should individual names be reformatted in the “First Middle Last” format? Assumes original names are in “LAST, First Middle” format and individual names are identified via [SDN_TYPE] field; corresponding “_main.R” reformatting logic may need to be updated if non-OFAC lists are integrated.
**param.drop_middle_names
**TRUE/FALSE: Should middle names be dropped?
**param.ignore_alt_names
**TRUE/FALSE: Should AKA entries be ignored for degradation-creation purposes? Corresponding “_main.R” logic may need to be updated if non-OFAC lists are integrated.
**param.perf_thld_high
**[0,1]: Decimal from zero (0) to one (1) representing the cut-off percentage for judging a scenario’s performance to be “sufficient”. Recommended value in the [0.85, 0.95] range.
**param.perf_thld_low
**[0,1]: Decimal from zero (0) to one (1) representing the cut-off percentage for judging a scenario’s performance to be “deficient”. Recommended value in the [0.60, 0.70] range.

Reference Files

**synonyms_business-designations.txt
**Pipe-delimited file of business designation equivalents. All entries on the same line are judged to be equivalent. New entries should added with all punctuation variants (e.g. “LLC”, “L.L.C.”) to comply with ‘degrade_entities.R’’ fixed()-matching logic requirements.
**stripping_map.R
**Returns list-var “strip_map” list with mapping between ASCII alphabetic characters and symbolic equivalents.
**country_name_fix.csv
**Sets equivalencies between OFAC list entry country names and ISO3166 country names to facilitate ‘pull-parse-enrich_sdn-files.R’ and ‘pull-parse-enrich_cons-files.R’ enrichment of geographic data using 3rd party sources.

Degradation Scenarios

All scripts write output to named-list of dataframes called “deg”, which is preallocated with zero-length in ‘_main.R’; Each scenario gets a dedicated code-block and follows the following structure:

Copy prepared watchlist data to [[dat]], apply [SDN_TYPE] filtering if necessary.
E.g. “dat = raw; dat = dat %>% filter(SDN_TYPE = ‘type’)”
Application degradation degraded logic on [prepd_name] to generate [test_name]
Comparison between [prepd_name] and [test_name] to remove any un-altered [test_name] values.
Calculation of degradation benchmark metrics (e.g. levenshtein string distance) between [prepd_name] and [test_name] values using degDistance() function; defined in ‘helper-funcs.R’, see Helper Function section for more details.
Sampling of observations based on user-defined parameters via ‘code’; trims outliers and samples based on reduced [[dat]].[dist_*] metrics. using sample_degradations_simple() function; defined in ‘helper-funcs.R’, see Helper Function section for more details.
Appending of the sampled dataframe to ‘deg’ via “deg = c(deg, "scenario-name" = list(dat))”; each list element must be named with the corresponding scenario name to comply with bind_rows() requirements in ‘_main.R’.

Basics Degradations: ‘degrade_basics.R’
Degradations applicable for all types of names, as defined in [raw].[SDN_TYPE] field.
- **Concatenations
  **Removes all white-space characters from names. (e.g. “John Howard Smith” to “JohnHowardSmith”).
- **Sticky Tokens
  **Randomly removes one or more white-spaces from names (e.g. “John Howard Smith” to “JohnHoward Smith”).
- **Symbol Stripping
  **Randomly replaces one or more characters with symbolic near equivalents, as defined by the “…path-file…” reference file (e.g. “John Howard Smith” to “John H0w@rd $mith”)
- **Long Name Truncation
  **Truncates names over a specified character length to that length; currently set to 35-characters, per FedWire field-length limit.
- **Random Word Truncation
  **Randomly truncates one or more tokens in a name to a random length (e.g. “John Howard Smith” to “Jo Howard Smit”).
- **Random Token Drop
  **Randomly removes one or more tokens in name (e.g. “John Howard Smith” to “John Howard”).
Entity Degradations: ’ degrade_entities.R’
Degradations applicable for only “entity” (e.g. businesses, organizations) type names, as defined in [raw].[SDN_TYPE] field.

Scenarios are known to produce problematic output occationally and requires further logic development; for example, of “CO” is detected in the middle of a name, it may removed by the Remove Business Designation scenario (e.g. “Coca Cola Company” to “ca Cola Company”).
- **Remove Business Designations
  **Removes business designations (e.g. “John Smith Enterprises Limited Liability Company” to “John Smith Enterprises”).
- **Business Designation Synonyms
  **Replaces business designation with synonyms, as defined in the “path-file” reference file. (e.g. “John Smith Enterprises Limited Liability Company” to “John Smith Enterprises LLC”).
- **Business Designation Replacement
  **Replaces business designation with any other except synonyms, as defined in the “path-file” reference file. (e.g. “John Smith Enterprises Limited Liability Company” to “John Smith Enterprises INC”).
Individual Degradations: file-name.R
Degradations applicable for only “individual” type names, as defined in [raw].[SDN_TYPE] field.
- Scenario Name: NONE
  **TODO** No scenarios developed.
Vessels Degradations: file-name.R
Degradations applicable for only “vessel” type names, as defined in [raw].[SDN_TYPE] field.
- Scenario Name: NONE
  **TODO** No scenarios developed.
Aircrafts Degradations: file-name.R
Degradations applicable for only “aircraft” type names, as defined in [raw].[SDN_TYPE] field.
- Scenario Name: NONE
  **TODO** No scenarios developed.

Helper Functions

Global environment functions defined in ‘script/path/’. Utilized throughout the execution pipeline.

**splitName()
**Splits a “LAST, First Middle” formatted name into corresponding components. Assumes that first token in “First Middle” is the first-name and all other tokens correspond to the middle name. This is not accurate for multi-name first-names (e.g. "LAST, First1 First2 Middle Middle), but judged to be a rare-occurrence; Logic can be refined if onomastic data and analysis is integrated.
**prettyName()
**Removes extraneous white-space and applies camel-case capitalization (e.g. “john HoWaRd SMITH” to “John Howard Smith”).
**gramify()
**Generates rolling n-grams for degradation distance calculations (e.g. “Johnathan” to {“Joh”, “ohn”, “hna”, “nat”, … } ).
**degDistance()
**Calculates degradation metrics for model performance modeling purposes.
**sample_degradations_simple()
**Used to filter and sample degradation logic output to generate an samples for screening.

nikolkj / wlf-sensitivity-analysis Goto Github PK

wlf-sensitivity-analysis's Introduction

README: wlf-sensitivity-analysis

Summary

Overview

Project Directories

Execution Order

Parameter Definitions

Reference Files

Degradation Scenarios

Helper Functions

wlf-sensitivity-analysis's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent