Giter Club home page Giter Club logo

st_dbscan's People

Contributors

eren-ck avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

st_dbscan's Issues

Provide time series before the spatial attributes

First, thanks for this implementation of this clustering method.
I was trying to use it with spatial attributes (x,y) and for each location I have a time series (and not only one value).
I understood from the paper that it was possible

But from the demo and the comment here

X : 2D numpy array with
.
I understand that only 1 value can be provided as temporal feature (and not a complete time series). Am I wrong ?
Thanks again
Ronan

wrong when use "st_dbscan.fit_frame_split"

hello,eren-ck!
when I use "st_dbscan.fit_frame_split", I set
st_dbscan.fit_frame_split(data, frame_size = 500)
sometimes it goes well,sometimes it goes wrong:
the length of labels don't equal the length of data
how should I solve this question? Thank you!

Usage of squareform

Hello, thanks for a nice and straightforward implementation of the ST-DBSCAN algorithm!

I ended up looking at your source and tried to understand it myself. What you essentially did is that you feed a distance matrix that sets the distances that do not meet the temporal eps to doubled the spatial eps, and then call sklearn's DBSCAN on the distance matrix.

For this block of code in st_dbscan.py:

time_dist = squareform(pdist(X[:, 0].reshape(n, 1), metric=self.metric))
euc_dist = squareform(pdist(X[:, 1:], metric=self.metric))

# filter the euc_dist matrix using the time_dist
dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)

db = DBSCAN(eps=self.eps1, min_samples=self.min_samples, metric='precomputed')
db.fit(dist)

You called squareform twice to form two square matrices for computed time and spatial distances. And then dist will be a third square matrix that has the same dimension as time_dist and euc_dist. This means you will have three matrices with relatively large size in terms of memory usage. This of course depends on the data size. For my data, they all have > 50000 rows (that results approx. 50000 by 50000 matrices, my data is float64, so each matrix is over 16GB of memory) so the algorithm breaks without processing the data in chunks.

What I would do is this:

time_dist = pdist(X[:, 0].reshape(n, 1), metric=self.metric)
euc_dist = pdist(X[:, 1:], metric=self.metric)

# filter the euc_dist matrix using the time_dist
dist = np.where(time_dist <= self.eps2, euc_dist, 2 * self.eps1)

db = DBSCAN(eps=self.eps1, min_samples=self.min_samples, metric='precomputed')
db.fit(squareform(dist))

In this case, only one square matrix is needed.

No issue... Just some questions.

Does it work on more dimension data like 3D (x,y,z)? I tested with 3D data and everything seemed to be fine. I just want you to clarify.

By the way, I've been looking for C++ of ST-DBSCAN as well but no luck. Do you have any idea for C++ version?

Another distance metric

Hi! Thanks so much for this implementation.
I wanted some guidance on how to use a different distance metric than the default euclidean.
I have data with multiple features and wanted to use another distance metric, such as mahalanobis
would the implementation be as under:-
st_dbscan = ST_DBSCAN(eps1 = 0.4, eps2 = 5, min_samples = 5, metric = 'mahalanobis')

I did try the above, but got an error Singular matrix. However, when I checked the correlation, it seems to be ok,

Also, in case I would want to use a different weightage for each of the features while calculating the distance, how should i go about it?
Would be grateful if you could please help out.

Thanks

Using the model for multiple features

Hi there,

First, congrats for the great implementation of ST-DBSCAN. I'm evaluating using it for a research on spatiotemporal clustering of meteorological data (with multiple features).
However, I have a question on your implementation:

  • How should I organize the inputs to the model for multiple features for each data point? As far as I understood from the documentation, it should be:
    [[time_step1, data_point_ID, x, y, feature 1, feature 2, feature n],[time_step2, data_point_ID, x, y, feature 1, feature 2, feature n]]

An example of data point would be: on the first timestep, 300 is a specific city, x is latitude, y is longitude, feature 1 is precipitation, feature 2 is temperature, feature n is solar radiation:
[0, 300, -23.5505, -46.6333, 1271.001, 28.763, 17.971]

Thanks a lot!

Roberto

ValueError: frame_size, frame_overlap not correctly configured

I tested 10000 records of random data with frame_size = 50

    138         if not frame_size > 0.0 or not frame_overlap > 0.0 or frame_size < frame_overlap:
    139             raise ValueError(
--> 140                 'frame_size, frame_overlap not correctly configured.')
    141 
    142         # unique time points

ValueError: frame_size, frame_overlap not correctly configured.

density factor implementation

Hi there,
I was reading the original paper and your implementation and correct if I am wrong but there is no provision for the density factor thus there will be issues identifying adjacent clusters.
Unless this is implemented in the standard DBSCAN algorithm in SKLEARN but I can't find any info in there either.

fit_frame_split - ValueError: Length of values does not match length of index

Hey! As mentioned in #7, there seem to be edge cases where the labels computed by fit_frame_split() don't match the row count of X fed to it. Not quite sure what's causing it at first glance!

The error in question:

ValueError: Length of values (17465) does not match length of index (17612)

The use in question (it is sorted by timestamp ascending before it goes in):

clustering = ST_DBSCAN(eps1=0.25, eps2=250, min_samples=10).fit_frame_split(sub_df.loc[:, ["timestamp","x","y"]].values, 2000)
sub_df["cluster"] = clustering.labels

Attached is the subset CSV of ordered timestamp/x/y data that yielded this for me. Timestamp is unix_millis, x/y are in an arbitrary space for particle data for a side project.

Currently looking into a temporary rewrite of it for the memory constraints I'm currently fighting with (I turned here because with fit(), some very large (>100k) position datasets that are only a couple hundred MB in Pandas turned out via memory_profile to cause up to a 6.8 GB increment in memory use! which eats heap and crashes smaller workers on my compute cluster, etc... probably the darn matrices becoming not-so-sparse).

ST_DBSCAN_2024_03_14.csv

Units or metrics

What would be the units of these values in eps1 = 0.05, eps2 = 10. Are they m/km or s/min

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.