Giter Club home page Giter Club logo

broadbandclustering's Introduction

BroadBand data clustering and bootstraping

In this project, the broadband acoustics data is clustered by KM, AC and other algorithms and methods. The aim is to find a suitable clustering number of species of the scene.

Specifically, this code is to obtain the reflectances of possible species (unsupervised clusters) in the broadband acoustic dataset.

Note that both test data and the larger data will be used to see which one is more resonable to biologists.

For details about the project and tasks, please contact: Mette DalGaard Agersted ([email protected]) & Yi Liu ([email protected])

1. The experimental settings & input:

  • Experimental data
    1. The test dataset (better quality, correct me if I'm wrong)
    2. The large survey data
  • Data clearning
    1. Drop irrelevant items (time/depth,Alo, Ath, PingNo, Range, etc.)
    2. Choose frequencies from 54 to 78 kHz (excluding the lowest frequencies for something 'strange' with the data)
    3. Discard no-value samples for "not a number" Nan in this case
    4. Discard samples that record either too week reflectance (noisy) or too strong (seabed for example that's irrelavent)
  • Preprocessing - Dimension reduction
    1. PCA - for visualization purposes
    2. Original - for clustering (see earlier work that shows KM and AC are capable of handling this high dimensionality)
  • Using algorithms (KM & AC):
    1. *K-Means
    2. AP (Affiliation Propagation) (not scalable)
    3. AAP (Ajusted Affiliation Propagation) (not scalable)
    4. Spectral Clustering (not scalable)
    5. Mean-Shift (not scalable)
    6. *Agglomerative Clustering
    7. DBSCAN
    8. HDBSCAN

2. Progresses

2.1 Spotted KM and AC algorithms out of the seven clustering methods and proceed with experiments on

  • Test data - KM
  • Test data - AC
  • Survey data - KM
  • Survey data - AC

2.2 PCA is not 'very' necessary for better clustering

  • PCA helps improve the clustered results, though not significantly
  • PCA components help a lot in visualizing the distinguishing capability of a clustering method

2.3 AC outperforms KM

  • Idea - Clustering with the original data & visualizing in component space (transformed from original space)
  • AC outperforms significantly KM in
    1. more resonable clusters when scattered in pca space
    2. more resonable boundaries among clusters
    3. more robust result [always the same clustered results with random sampling repetations]

2.4 Elbow algorithm suggest optimal cluster numbers [4, 8]

More specifically,

  • ranging cluster number from 4-8, similar results are obtained with more detailed targets spotted
  • one anomaly target/cluster spotted with most samples locate at the same point, which are ruled out by discarding the top and least portions of values of the data
  • Elbow algorithm only suggests statistically the number of clusters, so more than 8 clusters can be considered, such as 9-12

3. Experiment output

Among the considered clustering algorithms, we observed that AC algorithm performs the best with both the test and site dataset, which are consistent. Based on these results, we obtain the reflectances of frequency of the clusters.

4. To do

Cluster the whole data set and assign a lable to each sample. The labels should be consistent between multiple runs of clustering. In addition, the reflectances of the clusters will also be analyzed, biologically. And their distribution and characteristics along the depth will also be inspected and discussed.

Appendix. Dataset Info.

Here's two differnt data files containing information on target strength at different frequencies for single targets.

  • Each row is one target. All the columns names something like "F_.." are different frequencies and the corresponding values at a given frequency is the target strength for a given target at that given frequency.

  • There are also columns including date, pingNo, Range (distance of target from the transducer), position in the beam (Alo and Ath) and depth. Depth is the last column after all the frequencies are listed.

  • When running the PCA (or AP) just use data from 54 to 78 kHz (excluding the lowest frequencies as there is something strange with the data here).

  • Note that the first column just contains running numbers and does not have a header.

broadbandclustering's People

Contributors

yiliu-coding avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.