finos / tracdap Goto Github PK

A next-generation data and analytics platform for use in highly regulated environments

License: Apache License 2.0

Java 75.73% PLSQL 0.24% PowerShell 0.05% Shell 0.24% Python 21.80% Batchfile 0.06% JavaScript 1.59% Dockerfile 0.16% HTML 0.07% CSS 0.08%

tracdap's Introduction

A next-generation data and analytics platform for use in highly regulated environments

TRAC D.A.P. brings a step change in performance, insight, flexibility and control compared to conventional analytics platforms. By redrawing the boundary between business and technology, modellers and business users are given easy access to modern, open source tools that can execute at scale, while technology integrations and operational concerns are cleanly separated and consolidated across use cases.

At the core of a platform, a flexible metadata model allows data and models to be catalogued, plugged together and shared across the business. Using the principal of immutability, TRAC allows new data structures and model pipelines to be created, updated and executed at any time without change risk to production workflows, guaranteeing total repeatability, audit and control (TRAC).

Documentation and Packages

Documentation for the TRAC platform is available on our website at tracdap.finos.org.

The following packages are available:

Package	Description
Model runtime for Python	Build models and test them in a sandbox, ready to deploy to the platform
Web API package	Build client apps in JavaScript or TypeScript using the TRAC platform APIs
Platform releases	Packages for the platform services and a standalone sandbox are published with each release on GitHub

Development Status

The current release series (0.4.x) is intended for model development and prototyping. It provides an end-to-end workflow to build and run individual models in a local environment. It also provides the platform APIs needed to build client applications such as web tools or system client system integrations.

The TRAC metadata structures and API calls are mostly complete. Metadata compatibility is ensured within a release series starting from version 0.4.0 - the 0.4.x series will be compatible with 0.4.0 but changes may be introduced in 0.5.0. The metadata model will continue to stabilise before eventually being frozen for version 1.0.0, after which it may be added to but no fields will be removed or changed.

For more information see the development roadmap.

Building models

With TRAC D.A.P. you can build and run production-ready models right on your desktop! All you need is an IDE, Python and the tracdap-runtime Python package. TRAC D.A.P. requires Python 3.8 or later.

The modelling tutorial shows you how to get set up and write your first models. You can write models locally using an IDE or notebook, once the model is working t can be loaded to the platform without modification. TRAC D.A.P. will validate the model and ensure it behaves the same on-platform as it does locally. Of course, the production platform will allow for significantly greater data volumes and compute power!

A full listing of the modelling API is available in the model API reference.

Running the platform

TRAC D.A.P. is designed for easy installation in complex and controlled enterprise environments. The tracdap-platform package is available with each release on our release page and includes a pre-built distribution of each of the platform services and supporting tools, suitable for deploying in a container or on physical or virtual infrastructure. All the packages are platform-agnostic.

A sandbox version of the platform is also available for quick setup in development, testing and demo scenarios. The tracdap-sandbox package is available with each release on our release page and instructions are available in the sandbox quick start guide in our documentation.

Development

We have used the excellent tools from JetBrains to build TRAC D.A.P. After you fork and clone the repository you can open the project in IntelliJ IDEA and use the script dev/ide/copy_settings.sh (Linux/macOS) or dev\ide\copy_settings.bat (Windows) to set up some helpful IDE config, including modules for the non-Java components, run configurations, license templates etc. If you prefer another IDE that is also fine, you may wish to set up a similar set of config in which case we would welcome a PR.

If you need help getting set up to develop features for TRAC D.A.P., please get in touch!

Contributing

Fork it (https://github.com/finos/tracdap/fork)
Create your feature branch (git checkout -b feature/fooBar)
Read our contribution guidelines and Community Code of Conduct
Commit your changes (git commit -am 'Add some fooBar')
Push to the branch (git push origin feature/fooBar)
Create a new Pull Request

NOTE: Commits and pull requests to FINOS repositories will only be accepted from those contributors with an active, executed Individual Contributor License Agreement (ICLA) with FINOS OR who are covered under an existing and active Corporate Contribution License Agreement (CCLA) executed with FINOS. Commits from individuals not covered under an ICLA or CCLA will be flagged and blocked by the FINOS Clabot tool (or EasyCLA). Please note that some CCLAs require individuals/employees to be explicitly named on the CCLA.

Need an ICLA? Unsure if you are covered under an existing CCLA? Email [email protected]

License

Distributed under the Apache License, Version 2.0.

SPDX-License-Identifier: Apache-2.0

tracdap's People

Contributors

Stargazers

Watchers

Forkers

ajtho alexander-thornton isabella232 greg-wiltshire manuelcalve ericforgy thejuanandonly99 martin-traverse harpoo123 mslapek jangruszczynski doytsujin

tracdap's Issues

SECRET_KEY from environment in the start scripts

Feature Request

Description of Problem:

TRAC services / tools use the --secret-key option to pass in the master key for the secrets store. We want to set this in the environment as SECRET_KEY. It should be possible to specify it in env.sh / env.bat for local setups, or pass it in as an environment variable in schedule tools or container jobs.

Potential Solutions:

The start script templates for both Windows and Linux/macOS need updating to respect the SECRET_KEY environment variable. Bot the run and start tasks will need updating. They should check for the env var and if it exists, add --secret-key "${SECRET_KEY}" (or "%SECRET_KEY%" to the start command before the other application args.

We should also update env.sh and env.bat in the sample config in the dist template, to include SECRET_KEY along with the other commented out variables, to make it clear the variable is available to set.

Encoding of schema strings in metadata API

Bug Report

Steps to Reproduce:

Define a csv schema in a model repository using tracdap-runtime==0.4.9

field_name, field_type, label, categorical, business_key, format_code
total_balance, FLOAT, Total drawn balance, false, false,",|.|0|£|"

Pass the csv schema as the output schema definition of a model:

 def define_outputs(self) -> tp.Dict[str, trac.ModelOutputSchema]:
        data_schema = trac.load_schema(schemas, "data_schema.csv")

        return {"data": trac.ModelOutputSchema(data_schema)}

Retrieve the tag for the dataset after the model has run using the metadata API

The schema format code has changed encoding (the schema in the repos is stored as UTF-8)

Expected Result:

The schema returns as in the csv ie. "£"

Actual Result:

Schemas come back with Â£ for pound strings.

Environment:

tracdap-runtime==0.4.9
@finos/tracdap-web-api: "0.4.7"
tracdap-sandbox-0.5.0-rc.1

Documentation for platform developers (python runtime)

Should include overview of the key components of the runtime, the execution flow and design principals in wiki format. Also at least a paragraph of inline documentation at the module and class level.

Particular focus should be given to extending the runtime with storage and model repo plugins. A dedicated wiki page with more "public-facing" documentation for the key interfaces and bits of framework around these plugin mechanisms.

Config plugin for AWS

Provide an implementation of IConfigPlugin, adding support for a protocol named "s3" loading configurations from AWS S3 buckets. The plugin should support argument based key authentication.

Simplify metadata API

Make the metadata API a single interface (or possibly two, MetadataApi and TrustedMetadataApi). Standardise naming of request / response message types.

Documentation for model developers

This is "user facing" documentation of the model API, runtime configuration and error conditions. It should be both clear and comprehensive, and include quick-start guides and examples as well as comprehensive API/option listings.

Config: Move runtime config into protobuf

Equivalent of #77 in the Python runtime.

The top-level config objects are slightly different between the runtime and the core platform. The runtime doesn't know about platform services or the gateway. Many elements are shared, e.g. storage and repository configuration. Job config is a top level config object that is generated by the platform and consumed by the runtime.

For Python, we generate domain classes (Python dataclasses) based on the .proto files. Config should be converted into these domain classes (Protobuf DTOs generated by the regular Python Protoc plugin do not give a good developer experience when used as domain objects).

As for the core platform config, YAML, JSON and proto binary format must be supported. There is an existing module that converts YAML and JSON to the required domain objects, which should be sufficient for those formats. To handle the proto binary protocol, a module is needed that maps in both directions between the Protobuf DTOs and the domain objects. The reverse mapping is needed when the runtime reports job results back to the platform.

searchQuery flow property

Add into the flow definition a property to allow users to define the search query to retrieve the list of schemas/models/data objects that are eligible.

Description of Problem:

By design TRAC is agnostic on how the objects required to execute a flow as a job are found by the user. There needs to be a way in which a user can specify how a list of the candidate objects can be found via the search API.

Without this in place a user interface will typically only have a single method for finding items (normally searching the meta-database for the key of the object in the flow). This approach is inflexible and restrictive, for example it makes it difficult to link processes together where the keys have not been aligned (I want to use the dataset from process X or process Y as candidate inputs but I can't because they are outputted with different keys).

Potential Solutions:

Add a searchQuery parameter to each defined model, data and schema item needed to run a flow. This can replicate the current searchExpression or logical search definition.

Vulnerability and license scanning - JavaScript

These scans will relate to the web API package:

Automated vulnerability scanning
License scanning, to check for license conflicts in dependencies

Both scans to run out of CI and publish results as build artifacts. Probably it is sufficient to run on PR, merge and tag events.

Determining the repository path from a model repository

Currently it is possible to look at the python code repository file tree to determine the location of the root init.py file and use this to set the repository path and entry point of a model.

However according to this post it is no longer needed in Python 3.3+. This means that there will be no method to extract the repository path and entry point of a model.

The issue is to ensure that the repository path and entry point will work both for Python 3.3+ and before. For example if an installation is using Python 3.3 and no init.py file exists to denote the root Python package directory can the repository path be set to null and the entry point to the full path to the .py file (with the class).

Publish model developer docs (e.g. on readthedocs)

Validation: Runtime validation

Validation of all API calls from the model into the TRAC context (a lot of this is already done in the first beta)

out_attrs property in flow definition

Feature Request

Description of Problem:

It is not possible to add key attributes or titles (or any attribute) to the outputs of a job. This makes understanding and presenting the outputs of a calculation difficult.

Potential Solutions:

A out_attrs property added to the flow definition that allows attributes to be added to the data outputs from a job is needed.
This is in addition to the job attributes already in the API that add these attributes to all outputs from a job.

Extend title definitions to beyond parameters in a flow definition

Description of Problem:

Currently in a flow definition the title property of a parameter can be set. Users may want to add bespoke titles in a flow (overriding and native title attribute of the object) for all objects in a flow.

Potential Solutions:

Add the ability to set titles to all object types used as inputs in a flow by extending the current definitions.
Alternatively this could be restricted to just data objects which is where I see the greatest need.

Metadata version flag for forward compatibility

The current metadata schema should be assigned a version number, such as v1 (or v2 to deal with v1 metadata from early prototypes). This version should be recorded against every object definition in the metadata store and included in all public APIs. Additionally, it may be helpful to move the metadata classes under a versioned namespace (e.g. trac.metadata.v1) to avoid conflicts in model and application code in the event of a major version update to the metadata schema.

Major updates to the metadata schema should be very rare. The most likely cause of a major schema update is streaming support, which would probably be a major version update for the whole platform. In this example, old models would still be supported using the v1 metadata and APIs for batch processing, while new models could use the v2 metadata and APIs for either batch or streaming workflows.

Vulnerability and license scanning - Java

These scans will cover the core platform services:

Automated vulnerability scanning
License scanning, to check for license conflicts in dependencies

Both scans to run out of CI and publish results as build artifacts. Probably it is sufficient to run on PR, merge and tag events.

(The same scanning should cover the Java/Scala model runtime in future).

TRAC for healthcare

I'd like to collaborate and see if it's possible to apply TRAC for healthcare financial transactions. Happy to discuss!

It is not possible to search without a search expression

Bug Report

Steps to Reproduce:

Write a search request without a search expression:

const searchRequest = trac.api.MetadataSearchRequest.create({
        tenant: tenant,
        searchParams: {
            searchAsOf: searchAsOf,
            objectType: trac.ObjectType.DATA
        }
    })

const searchRequest = trac.api.MetadataSearchRequest.create({
        tenant: tenant,
        searchParams: {
            searchAsOf: searchAsOf,
            objectType: trac.ObjectType.DATA,
            search: undefined
        }
    })

Expected Result:

Without a search term the request should bring back a full list of objects without filtering applied.

Actual Result:

The request errors.

Environment:

trac-web-api": "file:../../../../dev/trac/trac-web-api-0.3.1.tgz

Search API can not handle no search criteria

When submitting a search API call with

{
    "objectType": "FLOW",

    "search": {}
}

A 500 error is returned. The expected behaviour is that this (or some variant of this) would provide results for all flow objects in TRAC. Without this there is no way to discover what is in TRAC without first knowing the attributes and the values to search for.

Abstract key store mechanism in config manager

Provide an IKeyStore interface in the config manager framework for accessing credentials, certificates and other pieces of sensitive configuration. Create an implementation in the file-based config plugin that wraps Java key stores. Provide a mechanism for referencing keys in an IKeyStore from the main config file (perhaps using URL schemes).

Config plugins for cloud platforms will wrap cloud-native services for key management.

Standard error handling

Basic requirement before model runtime can be released for 0.2.0

Platform <-> runtime communication

Ability for the runtime to pick up and convert the base proto metadata generated by the platform (domain metadata has slightly compacted structure). Generation of metadata outputs needed to pass job results back to the platform.

Publish Python runtime to PyPI

API call to get list of tenants user is authorised to access

An API endpoint is required that returns an array of the tenant names that a user has authorization to access. These could be split by read and read/write access but I don't believe that this split is needed.

The benefit of this is twofold:

It allows the user to see their profile information, including their tenants, in the user interface.
The list of tenant options to let a user pick from can be limited to only those that they are authorized to access rather than presenting them with the full list and having to return API error messages should they select the wrong option.

Required and optional flow definition inputs

Can the flow definition include whether a particular input is required or optional. This is useful in UIs for whether particular inputs need to be set or not.

Categorical attributes and automated indexing

Feature Request

Description of Problem:

Once information is add to objects in TRAC there is no way that a comprehensive list of an attribute's values stored in a tenant can be retrieved.

This provides restrictions on user interfaces. For example if you want to show a list of users that have run jobs so one user can be picked and all of their jobs listed - then this can not be done. It is not possible to get a list of all users from the TRAC API.

Potential Solutions:

Extend the attribute API to enable attributes to be classed as categorical.
Auto index these attribute values.
Expose an API endpoint to enable retrieval of the full list of attribute values.
Since attributes could be quite large and complex it would seem sensible to only enable this for BasicType attributes (string, integer etc.) and for only string lengths below a length limit.

Uniform definition of time

Add a timestamp to every object version and tag version, so object and tag versions can be selected using "as-of" times as well as explicit version numbers.

In tag selectors, allow use of explicit version numbers, as-of times or "latest" to select both object and tag versions, in any combination (e.g. object as-of a certain time with the latest tag). With this facility in tag selectors, reduce the read API to a single call which accepts a tag selector.

Once the APIs are simplified the whole metadata API can be a single interface (or possibly two, one trusted and one un-trusted).

Support alternate character encodings for text data formats

Feature Request

Description of Problem:

When storing/retrieving data via the platform API, support alternate encodings for text foramts. Currently only UTF-8 / UTF-16 are supported for storing, and only UTF-8 for retrieving.
Also in the runtime, it should be possible to specify encoding for input/output datasets. This is relevant when running in dev-mode, i.e. input/output datasets are being accessed directly by developers rather than passed back into the platform.
Nice-to-have - setting encoding on datasets stored in the TRAC platform. These are not normally visible to users as format translation happens when data is present through the platform or model APIs, but could be useful for integration e.g. if direct read access is granted to reporting systems. Since 1 & 2 provide all the required encoding translations, it is really a choice whether to enable configurable encodings in the storage layer or not.

Potential Solutions:

On the platform side, encoding should be availlable as a format option in data read/write/query requests and passed into the data codecs.

On the runtime, for dev mode, encoding should be passed in as a config option. This could be part of the storage config, or a separate config item under dev mode settings.

To set encoding in internal storage, the encoding would need to be set as part of the storage config, which gets passed to data codecs in both the platform and runtime.

Simplify object IDs

Use string for object ID instead of a structured type, for application developers this will be easier to work with. We can validate UUID format as part of our normal input validation.

Use tag headers/selectors everywhere to refer to objects. These are the only metadata types that should hold an object ID directly, so we avoid having string ID fields sprinkled through the metadata model.

Data partitioning

Metadata already understands partitions, however only a single root partition can be used.

TRAC model API for partitions
PartKey utilities (for generating and comparing keys, e.g. for overlapping ranges)
Lazy-load partitioned data sets
Map data views for partitioned data sets (inbound) and extract items (outbound)

Flow link nomenclature is confusing

Feature Request

Description of Problem:

The current link naming convention in a flow is confusing. The use of head and tail although technically correct is misleading as links are from the tail and to the head.

Potential Solutions:

Rename 'tail' to 'start or 'from'
Rename 'head' to 'end' or 'to'

Validation: Data conformity

Validate data against schema:

Enforce schema on load
Trim columns for model inputs (do not supply undeclared columns)
Strict conformity for model outputs
Restrict columns on save to match job definition

Tag update API

Express tag updates in calls to the metadata API as a list of operations or "deltas" to be applied to a tag. This replaces the put-style API for tags as resources in version 0.1 and is much more natural for expressing tagging updates, e.g. "add this classification" or "mark this item as reviewed" which modify individual attributes.

Tag operations are:
create attribute
replace attribute
append attribute (attributes are multi-valued)
delete attribute
clear all attributes
create or replace
create or append

The "clear all" operation can be used to replicate the old behavior, by sending clear all followed by a new set of attributes. "Create or replace" can be used to guarantee setting an attribute to a particular value. "Create or append" can be used to add a classification when other classifications might already be in force.

Validation: Exhaustive validation of runtime config (system and job)

Basic config validation has been included for 0.2.0.

The config system will eventually use proto files to describe the config schema, as config is shipped between components along with metadata. Full exhaustive validation will be added once that is implemented.

Validation: Basic validation of runtime config (system and job)

Provide a basic level of checking to catch the most common config errors during runtime startup

Integer values not valid as default parameter values

Bug Report

Steps to Reproduce:

Define a python model parameter as:

        return trac.define_parameters(
            trac.P("test_parameter", trac.FLOAT, label="Test parameter", default_value=100)
        )

Expected Result:

When loading the model via the orch service this model should load without error.

Actual Result:

2022-12-08 00:21:22.459 INFO  [orch-svc-4-22   ] o.f.t.c.v.Validator - VALIDATION START: [ObjectDefinition STATIC]
2022-12-08 00:21:22.468 ERROR [orch-svc-4-22   ] o.f.t.c.v.Validator - model.parameters[test_parameter].defaultValue.type: Value does not match the expected type
2022-12-08 00:21:22.468 ERROR [orch-svc-4-22   ] o.f.t.c.v.Validator - model.parameters[test_parameter].defaultValue: Value does not match the expected type
2022-12-08 00:21:22.468 ERROR [orch-svc-4-22   ] o.f.t.c.v.Validator - VALIDATION FAILED: [ObjectDefinition STATIC]
2022-12-08 00:21:22.471 ERROR [orch-svc-4-22   ] o.f.t.s.o.s.JobManagementService - Job [JOB-be996614-27df-47db-a026-9a58b32724f6-v1] succeeded but the response could not be processed
org.finos.tracdap.common.exception.EInputValidation: There were multiple validation errors
model.parameters[test_parameter].defaultValue.type: Value does not match the expected type
model.parameters[test_parameter].defaultValue: Value does not match the expected type
	at org.finos.tracdap.common.validation.Validator.doValidation(Validator.java:78) ~[tracdap-lib-validation-0.5.0-rc.4.jar:?]
	at org.finos.tracdap.common.validation.Validator.validateFixedObject(Validator.java:50) ~[tracdap-lib-validation-0.5.0-rc.4.jar:?]
	at org.finos.tracdap.svc.orch.service.JobManagementService.recordJobResult(JobManagementService.java:222) ~[tracdap-svc-orch-0.5.0-rc.4.jar:0.5.0-rc.4]
	at org.finos.tracdap.svc.orch.service.JobManagementService.jobOperation(JobManagementService.java:283) ~[tracdap-svc-orch-0.5.0-rc.4.jar:0.5.0-rc.4]
	at org.finos.tracdap.svc.orch.service.JobManagementService.lambda$pollJobCache$2(JobManagementService.java:113) ~[tracdap-svc-orch-0.5.0-rc.4.jar:0.5.0-rc.4]
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[netty-common-4.1.80.Final.jar:4.1.80.Final]
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153) ~[netty-common-4.1.80.Final.jar:4.1.80.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) ~[netty-common-4.1.80.Final.jar:4.1.80.Final]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) ~[netty-common-4.1.80.Final.jar:4.1.80.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) ~[netty-common-4.1.80.Final.jar:4.1.80.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569) ~[netty-transport-4.1.80.Final.jar:4.1.80.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[netty-common-4.1.80.Final.jar:4.1.80.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.80.Final.jar:4.1.80.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.80.Final.jar:4.1.80.Final]
	at java.lang.Thread.run(Thread.java:831) ~[?:?]

Environment:

tracdap-runtime==0.5.0
@finos/tracdap-web-api": "^0.5.0
tracdap-sandbox-0.5.0-rc.4

Additional Context:

Making the default value to 100.0 fixes this issue but I think that 100 should be a valid float default value.

Validation: Static metadata validation

Validate each item of metadata independently using a recursive framework.

E.g. field definitions can be independently validated. A table definition must contain only valid field definitions and must then pass any validation at the table level. A dataset definition must have a valid schema definition, which for tabular datasets must be a valid table definition etc.

Static validation can apply to the core metadata model and to API request / response messages. It does not include any validation of versioning (e.g. dataset version n+1 must be compatible with version n) or consistency (e.g. a calculation job must reference data/models that are compatible with the calculation flow).

Publish to Maven Central

Gradle build scripts to generate all the required metadata
Manage optional external dependencies (required for SQL drivers)
Access and keys for publishing

https://maven.apache.org/repository/guide-central-repository-upload.html
https://central.sonatype.org/pages/requirements.html#sufficient-metadata

Accenture Atom is already published to Maven Central.
https://github.com/Accenture/atom

Review metadata layout for data/storage and file objects

Core structures already in place in 0.2 beta 1 (part/snap/delta for data, incarnation/copy) for storage.

Review structures and naming.
Confirm physical storage definition also works for FILE objects
Review Python domain code gen to make sure nothing else is needed for these structures

Optional inputs and ouputs enabling

At the moment, declaring x inputs or outputs in the config means all x need to be returned in the define functions and all outputs are expected after the model run.
Enabling optional outputs will help with cases when the outputs (with different schemas) are dependent on model parameters, and a single one (or subset of all) need to be returned (e.g. scenario runs). Currently this needs to be handled from inside the model code, with conditional outputting based on parameters and empty dataframes being returned for the outputs not produced by that run.

Value types in metadata

Mapping between metadata value objects and native Python types
Special handling for Value objects in code gen for domain objects
Special handling for Value types in config parsing and validation

Vulnerability and license scanning - Python

These scans will relate to the Python model runtime:

Automated vulnerability scanning
License scanning, to check for license conflicts in dependencies

Both scans to run out of CI and publish results as build artifacts. Probably it is sufficient to run on PR, merge and tag events.

Type handing for primary data

Map all TRAC primitive types for Python, Pandas and PySpark
Conversion functions
Null handling
Enforce typing on data load and during data conformity

Editable field order in a table schema

Currently schema objects and data objects that are updated with a new version have limitations on what change change in the schema. One limitation is that the field order can not be changed, it is not certain that it can be but can the feasibility of this be assessed and a change made if it is possible.

Desired outcome - field order is mutable between different schema and data versions.

Python Codegen: Use dataclasses for domain objects

Ensure generated doc comments still work in auto complete

Search API response when no matches found

The search API currently returns an empty object if there is no match found. I propose that this is modified to be

{searchResult: []}

which will mean that downstream code will not have to check for the property.

Enable searching on the header properties

Feature Request

Description of Problem:

The header properties such as objectVersion but particularly the objectTimeStamp are not searchable via the API. this means finding items by when they were created is not possible.

Potential Solutions:

Make header properties searchable via the attribute search API.

Initial PySpark support

Implement read/write storage functions to handle both directory and single file formats
Selection logic for deciding when to read inputs as Pandas vs PySpark
Hooks for repartitioning / flattening between storage and presenting data to models
Implement main context methods for get/put PySpark
Implicit conversion - data items loaded as Pandas automatically converted if requested as PySpark, reverse is available with a row limit
Run example PySpark model from doc folder as end-to-end validation

Validation: Static model validation

Validation of model metadata at the point the model definition is generated
Validate again when a model is loaded, that the loaded model matches the definition

Config: Move platform config into Protobuf

Config for the gateway and data service is currently represented using statically defined classes. These static classes need to be replaced with Protobuf message definitions in the trac.config namespace. Config structures for the data service and gateway are close enough that merging should not be a problem.

Metadata service currently uses properties, so it needs to be brought into the config model. Database props can be encoded as a map at the appropriate point in the config tree.

The public interface of the config parser should accept YAML or JSON config, as well as binary proto files. The later is needed because config is passed between services in the same way as metadata (e.g. individual job configs for execution). Protobuf will supply YAML and JSON parsing for free.

One approach that may work for YAML config is to convert it into JSON (i.e. using generic objects, not touching the config structure) and then feed the JSON into Protobuf. Errors could still be reported using their object location (e.g. "Missing required config value [trac.services.meta.port]"). The alternative approach is using reflection and proto descriptors.

finos / tracdap Goto Github PK

tracdap's Introduction

Documentation and Packages

Development Status

Building models

Running the platform

Development

Contributing

License

tracdap's People

Contributors

Stargazers

Watchers

Forkers

tracdap's Issues

Feature Request

Description of Problem:

Potential Solutions:

Bug Report

Steps to Reproduce:

Expected Result:

Actual Result:

Environment:

Add into the flow definition a property to allow users to define the search query to retrieve the list of schemas/models/data objects that are eligible.

Description of Problem:

Potential Solutions:

Feature Request

Description of Problem:

Potential Solutions:

Extend title definitions to beyond parameters in a flow definition

Description of Problem:

Potential Solutions:

Bug Report

Steps to Reproduce:

Expected Result:

Actual Result:

Environment:

Feature Request

Description of Problem:

Potential Solutions:

Feature Request

Description of Problem:

Potential Solutions:

Feature Request

Description of Problem:

Potential Solutions:

Bug Report

Steps to Reproduce:

Expected Result:

Actual Result:

Environment:

Additional Context:

Feature Request

Description of Problem:

Potential Solutions:

Recommend Projects

Recommend Topics

Recommend Org