ggalmazor / lt_downsampling_java8 Goto Github PK

Largest Triangle Three Buckets downsampling algorithm implementation for Java8

License: Other

Java 100.00%

lt_downsampling_java8's Introduction

Largest-Triangle downsampling algorithm implementations for Java8

These implementations are based on the paper "Downsampling Time Series for Visual Representation" by Sveinn Steinarsson from the Faculty of Industrial Engineering, Mechanical Engineering and Computer Science University of Iceland (2013). You can read the paper here

The goal of Largest-Triangle downsampling algorithms for data visualization is to reduce the number of points in a number series without losing important visual features of the resulting graph. It is important to be aware that these algorithms are not numerically correct.

Download

Latest version: 0.1.0

You can add this library into your Maven/Gradle/SBT/Leiningen project thanks to JitPack.io. Follow the instructions here.

Example Gradle instructions

Add this into your build.gradle file:

allprojects {
  repositories {
    maven { url 'https://jitpack.io' }
  }
}

dependencies {
  implementation 'com.github.ggalmazor:lt_downsampling_java8:0.1.0'
}

Largest-Triangle Three-Buckets

This version of the algorithm groups numbers in same sized buckets and then selects from each bucket the point that produces the largest area with points on neighbour buckets.

You can produce a downsampled version of an input series with:

List<Point> input = Arrays.asList(...);
int numberOfBuckets = 200;

List<Point> output = LTThreeBuckets.ofSorted(input, numberOfBuckets);

First and last points of the original series are always in the output. Then, the rest are grouped into the defined amount of buckets and the algorithm chooses the best point from each bucket, resulting in a list of 202 elements.

Notes on Point types

This library requires to provide lists of instances of the Point supertype.
It also provides and uses internally the DoublePoint subtype, which can also be used to feed data to the library.
However, users are free to create implementations of Point that best fit their Domain.

Largest-Triangle Dynamic

Not yet implemented

Example

This is how a raw timeseries with ~5000 data points and downsampled versions (2000, 500, and 250 buckets) look like (graphed by AirTable)

These are closeups for 250, 500, 1000, and 2000 buckets with raw data in the back:

Other java implementations you might want to check

lt_downsampling_java8's People

Contributors

Stargazers

Watchers

Forkers

spzhao scheidan lizhizhou capsuleman

lt_downsampling_java8's Issues

Memory consumption issue

Hi @ggalmazor!
I submitted a PR/opened an issue with a bug in your library about a year ago. Thanks again for your library; it's been incredibly useful in my project. Now, I'm reaching out for a feature request I'd like to implement in your library 😊

Context

In my project, we utilize the down-sampling library to visualize large datasets (approximately 1,000,000 points) on a web interface that can handle only a few thousand points. These points are timeseries with two attributes: timestamp (Date) and measure (long). To integrate with your library, I need to extend this class with com.ggalmazor.ltdownsampling.Point, which uses two BigDecimal. These two additions cost at least 32b x 2 per point.
About a month ago, we encountered a memory issue when multiple requests were made simultaneously. As a quick fix, we increased the application's memory allocation. However, we're now seeking a long-term solution that would be more memory-efficient.

Proposed Implementation

I'm considering creating a new class, lighter than Point, without BigDecimal attributes x and y. Instead, it would have only getters and setters performing on-the-fly conversion between the attributes and the BigDecimal value used by the algorithm. This implementation would introduce a new class without altering the core of the library. The Point class would still be available for backward compatibility.

Potential Blockers

Are there any potential blockers for developing this implementation? I'm thinking about:

Utilizing specific BigDecimal methods
Intensively using BigDecimal values, which might negatively impact algorithm performance when converting values on-the-fly multiple times
Any other considerations?

I'd be thrilled to update the library in this direction and open a PR for these changes. Do you agree with updating the library in this direction?

P.S.: Is there a chance of duplicated arrays during down-sampling (resulting in extra memory usage)?

Last bucket is too wide

Hello @ggalmazor!

Thanks for your implementation of LTTB algorithm, I am currently using it on a project using large timeseries!
In a recent development, I had to downsample a new type of data, and some of them have a size with the same order of magnitude than the number of buckets: ~5500 pts for 1000 buckets for example.
Because the last bucket is taking all the remaining space, it is too wide:

int regular_bucket_size = data.size() / numberOfBucket // 5
int last_bucket_size = data.size() / numberOfBucket + data.size() % numberOfBucket // 505 -> ~10% of data

Which give results like this one.

In this implementation in Python, bucket size are more equally split:
https://git.sr.ht/~javiljoen/lttb-numpy/tree/master/item/src/lttb/lttb.py#L81 (cf: https://numpy.org/doc/stable/reference/generated/numpy.array_split.html)

I would be happy to contribute with a fix soon.

Best regards,
Guillaume Vagner