hurence / historian Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 1.0 233.25 MB

big data timeseries historian

License: Apache License 2.0

Scala 0.17% Java 2.01% Dockerfile 0.01% Groovy 0.22% Shell 0.13% Python 0.02% Jupyter Notebook 97.45%

bigdata compression data-storage data-stream grafana historian sampling spark-streaming time-series

historian's People

Contributors

Stargazers

Watchers

Forkers

mnemsi

historian's Issues

Handle "late records" case

Some timestamped records might reach the Data Historian with high latency (lag of several minutes / days / years). These records should integrate seamlessly in the indexation workflow, they they should be eventually added to the relevant chunk.

For a given timestamps range, should we consider only one chunk, or is it better to handle multiple chunks ?

Get some dashboards packaged together with the samples data

We need to package nice looking, and if possible branded, grafana dashboards in the distribution. This dashboards will also help in the documentation

The server REST API should allow the creation of data points

It should be possible to add points through the REST API.. something like...
https://:XXX/historian-server /v1/points/create
{
"TagName": "openSpaceSensors.Temperature",
"points": [
{

            "TimeStamp": "2020-02-17T20:03:10.000Z",
               
            "Value":   "25",
               
            "Quality": 3
            }
       ]

}

Atomic Compaction possibly via newly created index

When re-compacting existing data with even possibly "late records", there is no impact on queries - that is to say, a new index or part of it might be - like in open analytics - created and when compaction is done and data clean, the newly created index replaces the online one.

Provide tar.gz version of major components

To ease simple developments we need simple tar.gz we can just unzip. Docker is fine but not all customers want docker images for security reasons, more other docker requires some knowledge. A simple pre-configured tar.gz is fine with simple start/stop services.

Exporting data as simple JSON/CSV/Excel formats should be possible

A mechanism to export data should be available. Preferably through a REST API.

Need to log access and status of indexation and compaction jobs (a journal of activities)

For audit and monitoring purposes we need a journal of activities. Especially we need to be able to see if compaction jobs are stucked for any reason (due to unhealthy Solr for example)

Npe with chunk of one Record

Please provide more detail... At least the type of api Request that cause this to reproduce the problem

Monoring et Alerting

to be discussed

Test and document the use of Kura to inject points into the historian

Kura offers most of the connectors we need to integrate the factory to our data historian. We can document the use of Kura and extend Kura if needed with some components we have if needed. This issue depends on the Issue for adding data points through the historian server REST API.

A compaction daemon should compact uploaded data on the fly

A scalable compaction daemon should compact uploaded data on the fly

[Gateway] search api could be improved

Current behaviour

the search endpoint currently does a contain query depending on request input.

Expected behaviour

The seach endpoint response should be more sophisticated so that it returns depend on input in a more flexible way, for exemple for the request :

{ target: 'upper_50' }

the answer should return only metric that seem's like upper_50.

["upper_25","upper_50","upper_75","upper_90","upper_95"]

and not only name containig the input

["upper_501","aupper_50","upper_50"]

Be able to easily process data stored in Historian

Be able to define custom workflows to process data stored in the Historian, and store back this processed data in Historian.
(cf "Tasks" in InfluxDB : https://v2.docs.influxdata.com/v2.0/process-data/ )

load IT monitoring data from csv to solr

adapt the loader to be ablt to load the IT use case data

Re-engineering of the engine to get a light engine

A re-engineering of the engine should allow to rely on a direct connection to the historian server to push time series and allow logisland real time processors to be pluggin the historian server directly. This means changing the class loading mechanism in LogIsland. A a matter of fact this issue has to be pushed onto LogIsland but is there for tracking purposes

add drop into chunk values tags

analyse SAX string and check if there's a 2 or 3 letter drop and add a tag accordingly with correct timestamp

Get some samples CSV files that exhibit anomalies that we can graph and detect for documentation purposes

The documentation needs simple CSV files for education purposes in using the historian. We need if possible data that we will use for testing purposes as well.

Implement a WAL on top of Nitrite or any light database

We need an intermediary storage for WAL informations (journal), if possible outside Solr, since Solr might be unavailable. A light java embedded database like Nitritre could be a possible way to go.

Re-design to eliminate non necessary Big Data components when deploying in non big data contexts

As of today, the real time part of the historian depends on the logisland engine which also depends on a message bus and in some cases with a strong dependency on Kafka. A re-engineering of the engine should allow to rely on a direct connection to the historian server to push time series and allow logisland real time processors to be pluggin the historian server directly.

[Gateway] Grafana search api output should depend on input

Current behaviour

the search endpoint currently does not depend on user input. Moreover it returns all existing metrics.

Expected behaviour

The seach endpoint response should depend on input for exemple for the request :
{ target: 'upper_50' }
the answer should return only metric that seem's like upper_50.
["upper_25","upper_50","upper_75","upper_90","upper_95"]
and not
["hello","temp","pression"]

You can look at the documentation of the simple json plugin if you want more exemples :
https://grafana.com/grafana/plugins/grafana-simple-json-datasource

The max number of target returned should be configurable in the conf file of the Http verticle.

Which algorithm to filter results ?

Only data starting by ?
Only data containing the request ?
The field name on which we make the request if of type String and contain a unique word.

Condition to close this issue

implement a solution
add an integration test

Until we get proper scripts, document the way to clean data with Solr native API

Document the deleteByQuery call on SolR to clean the data.

Gateway should handle conflicting chunks (even if it is not a normal case).

It should handle conflicting chunks. But the conflicting chunks should not be too degraded neither. Therefore we will suppose that this is not possible to have a chunk conflicting with 10000 others. If it was the case we would not guarantee anymore the query.

Scripts to clean the data

The API should encapsulate the Solr deleteByQuery to clean data in a clean and safe way.

Alerts should be stored in the backend

As of today alerts are created via grafana and associated to graphs. We need to store alerts in the back end of the data historian especially since alerts are also created by the real time logisland part.

Rename the Gateway as Historian Server

The Gateway is a bit misleading and this gateway is definitively our REST API to the back-end... so let's rename it to Historian Server

Add Tags that are specific to the global context like incidents but not specific to a time serie

I would need to add tags that are not related to a particular timeserie but to the context of my Fab - for example an incident on a machine. This to be able to correlate problems. I would need to put an idenfier to relate my production machine and a bunch of sensors (in OPC this is a node in a tree).

Ability to define retention policies

Be able to define retention policies based on :

measurement
tags
timestamps
mixture of the above

Have a schema for the SolR collections

We need to have a proper schema for all data historian concepts.

support SQL dialect in the gateway

use that : https://github.com/prestosql/presto/tree/master/presto-parser

Be able to query agregation with grafana

[Grafana] user should be able to see last annotations and to interact with it

Maybe using a panel plugin ?

Annotation should be stored in the backend

Currently annotation belong the grafana graphs and that works. But in the future we want annotations (as well as alerts) to be stored in proper solr collections

We should be able to specify more info than just the metric in graph panels

For exemple user should be able to chose for each metric :

The sampling algorithm to use
The size of buckets to use (but should we allow this parameter if it is conflicting with maxDatapoint of the graph ?)

I think if user specify a bucket size too small then the historian ignores it because it could freeze the historian in bucket size is not adapted to the number of points in the specified time range.

Be able to visualize some tag on graph (annotations)

Annotation configuration for the plugin should be similar to the grafana builtin plugin. Being able to filter on tags.

On gateways side, it should support POST /annotations request with body like

{
  "range": {
    "from": "2016-04-15T13:44:39.070Z",
    "to": "2016-04-15T14:44:39.070Z"
  },
  "rangeRaw": {
    "from": "now-1h",
    "to": "now"
  },
  "limit" : 100,
  "tags": ["tag1", "tag2"]
  "matchAny": true,
   "type": "tags"
}

And should respond with something like

[
  {
    "time": 1581075145188,
    "timeEnd": 1581075145188,
    "text": "bbbb",
    "tags": [
      "tag1"
    ]
  }
]

Here are the specifications :

"time" is required
"timeEnd" is optional (this is used only for events in a range)
"text" is required, it describe the event
"tags" can be empty, but all tags should be returned if there is any

So those four fields should be save in solr documents for now in a separate collection "annotations".
The historian would return them in the expected response format when it receive the corresponding request.

Here a description of the fields of the request :

"range" and "rangeRow" describe the time range in which we want to find annotations. Please use the same method to extract time as done for other endpoints like /query for consistency of the code. SO we should only return annotations whose "time" is in the requested range.
"limit" to limit the max number of annotations to return.
"tags" if the request "type" is "tags" this is used to filter annotation by tags otherwise it is not used.
"matchAny" : if true, we should return any annotation containing at leas one of the tags. If false we should return only annotation containing all the tags.
"type" : It is either "tags" either "all". tags type means we want to filter by tags, "all" type means we will return all annotations.

The solr schéma should be :

time : long
time_end : long
tags : string multivalued
description : text

Condition for closing this issue :

Implement the /annotations endpoint according to the specifications
Add integrations tests
Test the solution with grafana

A set of Spark scripts that can run in standalone as well as distributed mode

We need to design non specific spark jobs or java jobs for most common tasks that can run in standalone as well as distributed mode and can face massive amounts of data as well as small amounts of data.

Minimal installation of Historian should be possible

We should be able to have the data historian made available on a single node with a Solr, the historian server, the Solr back-end and grafana for visualisation. All other components like Kafka, Spark, LogIsland should only be installed for very large volumes and real time or very advanced analytics

Managing alerts with the data historian REST API should be possible

We should be able to create/update/delete alerts through the data historian server API

Packaging of Grafana plugins provided and documented

The grafana plugins have to be available in the tar.gz as well (separate or not) and their deployment should be documented.

The Historian should store and provide the quality on time series values

Generally sensors data come with the level of quality. This quality should be stored along the values and possibly compacted in the same way

Setup grafana dev environments

Follow this tutorial : https://medium.com/@ivanahuckova/how-to-contribute-to-grafana-as-junior-dev-c01fe3064502
I recommend using visual studio code and install the go plugin :

Once the setup is done we will be able to make our own Datasource plugin based on the simple json datasource plugin.

For this we forked the grafana repo (under Hurence user). So if you have already made the tutorial with grafana repo you can follow this tutorial to change remote url to hurence's one : https://help.github.com/en/github/using-git/changing-a-remotes-url

Importing data using simple CSV or Excel import should be possible

Currently the only way to have data injected to the historian is through either big batch spark processes or with real-time injection through Kafka/LogIsland. A simple mechanism to import CSV files / Excel files should be available. Also may be a REST API to interact with (the current gateway is only for consuming data)

Version the data formats for backward compatibility

After this first release we need to have our data format versioned to ensure backward compatibility in upgrades

Handle duplicate points

InfluxDB handles duplicates as a union of the two fields, keeping the latest value(s) indexed as well as previous fields values when they are not present in the last record

cf : https://v2.docs.influxdata.com/v2.0/write-data/best-practices/duplicate-points/

Depending on our record design, we should find a way to handle the duplicates properly

Compaction job to handle unitary records in Solr

Real-time indexation of timestamped records stores unitary records in Solr (one value per record). This is useful to give access to these records in real time.
But there is an underlying need to compact these records into chunks on a regular basis for performance and storage concerns.

The compaction job steps :

read all the relevant unitary data from Solr
process this data into chuncks
delete the unitary data and inject the corresponding compacted data. This operation must be Atomic to prevent from any discrepancy in the indexed data

This job should eventually be compatible with other time series backends (other than Solr)

The trigger to run the compaction job should be configurable, based on :

time (every X minutes)
number of records (every X records)
number of records per partition (every X records per partition)
a mixture of previous criteria

Review the design of a Data Historian record

A Record should be composed of :

"measurement" : a label (ex : "temperature" or "sensor_name", ...)
0**n tags : "0 to n" couples (tag_name, tag_value) used to index data
1**n fields : "1 to n" couples (field_name, field_value) to store actual measured values

(cf Line Protocol of InfluxDB : https://v2.docs.influxdata.com/v2.0/reference/syntax/line-protocol/ )

Search api should never returns duplicate names in response

Current behaviour

The metrics name returned contain duplicates.

Expected behaviour

When I query on /search the response should not contain duplicate name !

How to reproduce

Insert several chunks with the same name. Query search andpoint the response will contain duplicates.

Condition

Fix and test that

add tags into chunks

chunk_tags field (array to be specified with gateway team)

raw value of the sensor + the quality of the measure
https://docs.influxdata.com/influxdb/v1.7/concepts/key_concepts/#sample-data

( cf issue #13 as well )

hurence / historian Goto Github PK

historian's People

Contributors

Stargazers

Watchers

Forkers

historian's Issues

Current behaviour

Expected behaviour

Current behaviour

Expected behaviour

Condition to close this issue

Current behaviour

Expected behaviour

How to reproduce

Condition

Recommend Projects

Recommend Topics

Recommend Org