PREDICT LAKE

A sandbox datalake to play with in order to figure out:

Elasticsearch indexing
1. What to index
2. How to formate indices
3. How to query indices
4. Etc...
datalake structure, data-flows, etc..

The fakelake folder contains mock data that mimic some data expected to be held within the actual lake.

Filestructure

.
├── dev
│   ├── data_templates       # mock templates used
│   ├── es_mapping_templates # json templates for mapping of ES indices
│   ├── meta_templates       # yaml settings to index
│   │   ├── nshds
│   │   ├── x1
│   │   ├── x2
│   │   └── x3
│   └── utilities
├── elk                      # Docker-compose file to setup env.
├── es_play
├── fakelake                 # data, generated by `dev/spinnup_lake.R`
│   ├── 1_bronze             # fake raw data
│   │   ├── nshds
│   │   ├── x1
│   │   │   ├── data_1
│   │   │   └── data_2
│   │   └── x2
│   │       └── data_1
│   ├── 2_silver             # fake metadata to index
│   │   ├── nshds
│   │   ├── x1
│   │   │   ├── data_1
│   │   │   └── data_2
│   │   └── x2
│   │       └── data_1
│   └── 3_gold               # fake ES indices (.ndjson)
│       ├── nshds
│       ├── x1
│       │   ├── data_1
│       │   └── data_2
│       └── x2
│           └── data_1
└── ref                      # reference used for createing mock-data

As a first attempt I've simply structured the data as:

1_bronze: raw data
2_silver: meta data (generated/curated by PREDICT stewards)
3_gold: Elasticsearch indices (e.g. metadata formated to .ndjson)

(Re-) Generating `fakelake`

The mock data was generated using:

Run: dev/spinnup_lake.R
- some lake settings are read from dev/lake_settings.yaml, as well as the yaml files under dev/meta_templates
- other settings are hardcoded to dev/spinunup_lake.R and dev/utilities/utilities.R
  - initially all settings were hardcoded, and not all have been migrated out to yaml files...

Perhaps the same code, with appropriate changes, may be useful to re-create a new fakelake with more useful settings.

Iterate > update > find useful structure of both data and indices.

Docker image with: Elasticsearch / Kibana / Rstudio

Set up working environment including Elasticsearch, Kibana and Rstudio.

This can for example be done by:

Modify the elk/docker-compose-predict.yaml & run compose up
- modify the r / volumes: value to map local path
Connect to Elasticsearch from R
- this requiers allowing the default 'rstudio'-user read permissions on the Elasticsearch certificates files

Changing file permissions on the Elasticsearch certificates within the R-studio image can for example be achieved from the R-studio terminal, by:

# Navigate to the certificates-folder:  
`cd /usr/share/elasticsearch/`
# `chmod` the `certs` folder:
`rstudio@5556462f98de:/usr/share/elasticsearch$ sudo chmod -R 755 config/`

Test that the connection is working from R:

# Initialize ----
x <- elastic::connect(
  host = "es01", 
  transport_schema = "https", 
  user = "elastic", 
  pwd = "qqpp11--",
  cainfo = "/usr/share/elasticsearch/config/certs/ca/ca.crt"
)
x$ping()

Import indices to Elasticsearch

The suggested indice files (.ndjson) are under fakelake/3_gold/... (they are created by the dev/spinnup_lake.R script)
They can be bulk loaded into Elasticsearch (using dynamic mappings) by running the script: dev/import_indices_to_es.R
- The script uses the Elasticsearch dynamic mapping templates listed in the dev/lake_settings.yaml (specified here: dev/es_mapping_templates)

Kibana dataviews

As an initial setting, I am playing the Kibana dataviews (a.k.a "index-patterns" in ES < 8.x), based on the index naming scheme used (see: dev/lake_settings.yaml)

*events
*--sample--events
*--questionnaire--events
datasets
*--analytes
*--vardict

Play around

Pretending to query the fakelake:

es_play/search_play1.Rmd
output: es_play/search_play1.html

Investigations

Completed

1. Reuse Kibana dashboard when underlaying indices are deleted and replaced

NOTE: I investigated this topic in the belief that Kibana dashboards broke when the underlying indices were deleted and reloaded. It turns out that is incorrect. Dashbords depend on Kibana dataviews (previously known as index-patterns), and as long as dataviews are persistent the dashboards will be too. The dataviews in turn will work as expected as long as the indices’ name patterns are persistent.

Initial lookups to make dashboards persistent:

Solution 1: It is possible to open each dashboard panel and selct Edit lens, and from there select which dataview to use
- Must be done manually for each panel
Solution 2: It is possible to delete all documents in an index, without deleting the actual index (uuid), through "delete by query":
- https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete-by-query.html
- ```
POST my-index-000001/_delete_by_query?conflicts=proceed
{
  "query": {
    "match_all": {}
  }
}
```
- It is said to be an expensive process: https://discuss.elastic.co/t/delete-all-data-from-index-without-deleting-index/87661/2)
Solution 3: It is possible to export the dashbord object as .ndjson, modify the .ndjson file to apply it to new data-views, then import back into Kibana.
- Here is a suggestion: https://kifarunix.com/update-change-kibana-visualization-index-pattern/
  - Requiers a lot of manual steps. Can this be scripted??

2. Use dynamic mapping template when creating new indices?

Is it worth while to set up a dynamic mapping template for each index?

Pros:
- This way it is possible to map for example predict_id as a keyword, and use it as such in queries, without specifying predict_id.keyword (which would do the same if using the default dynamic mapping)
Cons:
- There are likley to exist variables that we have not mapped manually, nor had in consideration when creating the dynamic template. These will be mapped dynamically by Elasticsearch's default settings. Whenever we wish to use one of these variables for "keyword-queries", we must append the .keyword suffix.
  Thus, perhaps it is better to always append the .keyword suffix, even for the variables that we know beforehand we only ever want to use as keywords (e.g. predict_id)???
OR..:
- map every "string" as keyword (except for example abstract, and possibly some additional fields). When necessary to query a keyword in a non-exact way, one can use a fuzzy query. (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html)
```
GET /_search
{
  "query": {
    "fuzzy": {
      "PI.1.name.keyword": {
        "value": "Beatrice elin"
      }
    }
  }
}
```
- This is currently implemented through separate settings in dev/lake_settings.yaml which referes to these mapping templates: dev/es_mapping_templates

Old issues

Old issues based on misconseptions of how Elasticsearch queries work.

Quering for filepaths:

Quering for filepaths

Using the Kibana Dev Tools console:

This works:

GET /_search
{
  "size": 0, 
  "query": {
    "query_string": {
      "default_field": "dataset.filename",
      "query": "/fakelake/1_bronze/nshds/nshds_mock.tsv"
    }
  },
  "aggs": {
    "individuals": {
      "terms": {"field": "predict_id.keyword", "size": 500}
    }
  }
}

This works:

GET /_search
{
  "size": 0,
  "query": {
    "query_string": {
      "default_field": "dataset.filename",
      "query": "x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX"
    }
  }, 
  "aggs": {
    "individuals": {
      "terms": {"field": "predict_id.keyword", "size": 500}
    }
  }
}

This does not work:

GET /_search
{
  "size": 0,
  "query": {
    "query_string": {
      "default_field": "dataset.filename",
      "query": "1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX"
    }
  }, 
  "aggs": {
    "individuals": {
      "terms": {"field": "predict_id.keyword", "size": 500}
    }
  }
}

# Further tests:
# query: "/fakelake/1_bronze/x2/data_1"      -- works
# query: "\/fakelake\/1_bronze\/x2\/data_1*" -- works
# query: "/fakelake/1_bronze/x2/data_1*"     -- works
# query: "/fakelake/1_bronze/x2/data_1/*"    -- error

# query: "x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX"          -- works
# query: "1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error

# ==============================================================================
# More structured tests:
# query: "/fakelake/1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error (original query)
# query: "fakelake/1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok

# query: "/1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok
# query: "1_bronze/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error

# query: "/x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error
# query: "x2/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok

# query: "/data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok
# query: "data_1/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error

# query: "/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error
# query: "mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok

# Conclusion: 
# - works with even number of '/'
# - errors with odd number of '/'

# query: "\/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- error
# query: "\\/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- ok  ==> need double escape of '/'

# original query with double escape:
# query: "\\/fakelake\\/1_bronze\\/x2\\/data_1\\/mock_UMEA-01-19ML+CDT_EDTA_PLASMA_02FEB1792.XLSX" -- OK!

Solution

Forward slash ('/') must be double escaped when used in query
Also, query_string might not be the best option, probably better to use bool instead (see: [Filter and aggregate])

Filter and aggregate

Issue

# --------------------------
# I was hoping this query would only return the individuals included in 
# the file "/fakelake/1_bronze/x1/data_2/mock_Olink_NPX_1791-02-02.csv"
#
# Instead all included individuals in the lake are returned... 
# How do I reformulate the query to return only individuals in file?
GET /_search
{
  "size": 0,
  "query": {
    "query_string": {
      "default_field": "dataset.filename",
      "query": "\\/fakelake\\/1_bronze\\/x1\\/data_2\\/mock_Olink_NPX_1791-02-02.csv"
    }
  },
  "aggs": {
    "individuals": {
      "terms": {"field": "predict_id.keyword", "size": 500}
    }
  }
}

Solution

Apparently query_string performes some sort of text-search that I don't understand.

Using boolean search does what I want:

# --------------------------
GET /_search
{
  "query": {
    "bool": {
      "filter": {
        "term": {
          "dataset.filename.keyword": "/fakelake/1_bronze/x1/data_2/mock_Olink_NPX_1791-02-02.csv"
        }
      }
    }
  }, 
  "aggs": {
    "individuals": {
      "terms": {"field": "predict_id.keyword", "size": 500}
    }    
  }
}

NOTE: this way it's not necessary to escape forward slash characters.

wibom / fakelake Goto Github PK

fakelake's Introduction

PREDICT LAKE

Filestructure

(Re-) Generating fakelake

Docker image with: Elasticsearch / Kibana / Rstudio

Import indices to Elasticsearch

Kibana dataviews

Play around

Investigations

Completed

1. Reuse Kibana dashboard when underlaying indices are deleted and replaced

Initial lookups to make dashboards persistent:

2. Use dynamic mapping template when creating new indices?

Old issues

Quering for filepaths

Solution

Filter and aggregate

Issue

Solution

fakelake's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org

(Re-) Generating `fakelake`