Giter Club home page Giter Club logo

umap-js's Introduction

Build Status

UMAP-JS

This is a JavaScript reimplementation of UMAP from the python implementation found at https://github.com/lmcinnes/umap.

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction.

There are a few important differences between the python implementation and the JS port.

  • The optimization step is seeded with a random embedding rather than a spectral embedding. This gives comparable results for smaller datasets. The spectral embedding computation relies on efficient eigenvalue / eigenvector computations that are not easily done in JS.
  • There is no specialized functionality for angular distances or sparse data representations.

Usage

Installation

yarn add umap-js

Synchronous fitting

import { UMAP } from 'umap-js';

const umap = new UMAP();
const embedding = umap.fit(data);

Asynchronous fitting

import { UMAP } from 'umap-js';

const umap = new UMAP();
const embedding = await umap.fitAsync(data, epochNumber => {
  // check progress and give user feedback, or return `false` to stop
});

Step-by-step fitting

import { UMAP } from 'umap-js';

const umap = new UMAP();
const nEpochs = umap.initializeFit(data);
for (let i = 0; i < nEpochs; i++) {
  umap.step();
}
const embedding = umap.getEmbedding();

Supervised projection using labels

import { UMAP } from 'umap-js';

const umap = new UMAP();
umap.setSupervisedProjection(labels);
const embedding = umap.fit(data);

Transforming additional points after fitting

import { UMAP } from 'umap-js';

const umap = new UMAP();
umap.fit(data);
const transformed = umap.transform(additionalData);

Parameters

The UMAP constructor can accept a number of hyperparameters via a UMAPParameters object, with the most common described below. See umap.ts for more details.

Parameter Description default
nComponents The number of components (dimensions) to project the data to 2
nEpochs The number of epochs to optimize embeddings via SGD (computed automatically)
nNeighbors The number of nearest neighbors to construct the fuzzy manifold 15
minDist The effective minimum distance between embedded points, used with spread to control the clumped/dispersed nature of the embedding 0.1
spread The effective scale of embedded points, used with minDist to control the clumped/dispersed nature of the embedding 1.0
random A pseudo-random-number generator for controlling stochastic processes Math.random
distanceFn A custom distance function to use euclidean
const umap = new UMAP({
  nComponents: 2,
  nEpochs: 400,
  nNeighbors: 15,
});

Testing

umap-js uses jest for testing.

yarn test

This is not an officially supported Google product

umap-js's People

Contributors

cannoneyed avatar cjh1 avatar dependabot[bot] avatar drew-diamantoukos avatar fil avatar kevinrobinson avatar productiverage avatar tafsiri avatar tmcw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

umap-js's Issues

Is it possible to optimize on a subset of the data?

Hi, I'm doing an interactive scatter plot where you can zoom in or show subsets of the points. I perform an initial optimization with all the data, and then I'd like to optimize the embeddings for a subset of the data, but also beggining from the embeddings found in the initial optimization.

Any recommendation you have to achieve this?

P.S. Thanks for the library, it's super cool

Assigning labels/classes to unlabeled (-1) data in semi-supervised fit

Hi there,

Thanks for a wonderful library. Amazingly fast! Although I'm not quite sure how it's going to work for the set I'm using just yet.

Am I missing something, or is it just not possible to have labels assigned to previously unlabeled (-1) values?

e.g., say I have a set of 100 labeled vectors, and a thousand or so that are unlabeled (that's not quite right, but it'll do). I've tried appending the thousand, using -1 for their labels, with umap.setSupervisedProjection() then umap.fit(). I've also tried running the above with just the labeled 100, then using umap.transform() with the extra 1000. Either way, everything gets projected into the reduced dimensional space, which is great. But, I don't see a way to extract which label umap identifies as most likely for unlabeled rows.

Short version is that I'm looking for a way to run a quick-n-not-that-dirty classification stage on some data prior to further analysis. And now I'm not sure whether this is something umap just doesn't do, or if I'm overlooking something super obvious, or if it's something that could be done through a step I'm also overlooking.

umap.transform does infinite loop if data.length < nNeighbors*4

var UMAP = require('umap-js').UMAP
var d3 = require('d3')

var steps = d3.range(70).reverse()

function rand2dArray(nrows, ncols){
  return d3.range(nrows).map(_ => d3.range(ncols).map(Math.random))
}

function testLimitedData(nrows, nNeighbors, isTransform=true){
  var umap = new UMAP({nNeighbors})
  umap.fit(rand2dArray(nrows, 5))
  if (isTransform) umap.transform(rand2dArray(1, 5))

  console.log(nrows)
}


// locks up after 60
// steps.forEach(i => testLimitedData(i, 15))

// locks up after 32
// steps.forEach(i => testLimitedData(i, 8))

// locks up after 24
// steps.forEach(i => testLimitedData(i, 6))

// locks up after 16
// steps.forEach(i => testLimitedData(i, 4))

// throws error after 11
steps.forEach(i => testLimitedData(i, 4, false))

TypeError with production build

Uncaught (in promise) TypeError: i.set is not a function
    at Function.value [as identity] (abstractMatrix.js:164)
    at a (index.js:99)
    at Object.e.exports [as default] (index.js:193)
    at S (umap.js:708)
    at t.initializeOptimization (umap.js:488)
    at t.initializeFit (umap.js:170)
    at t.<anonymous> (umap.js:129)
    at umap.js:32
    at Object.next (umap.js:13)
    at umap.js:7

This seems to be resolved by bumping the version of ml-levenberg-marquardt, see #34

How to reset the model?

Hello,

Thanks for the great library - very easy to use!

I have a question about the best way to reset the model between runs. I am trying to create an interactive demonstration and it seems that each time I run the fit that the result is tending towards incredibly clumped outputs even when changing the minDist, spread, neighbors hyperparameters.

My code is here:

https://github.com/jamesb93/umap-vis/blob/main/src/lib/Map.svelte

Each time doStep() is called I am trying to reset things by creating a new UMAP() object but even then...

https://github.com/jamesb93/umap-vis/blob/1f5bc1853ed9f7753555d8a6100170168f9d91df/src/lib/Map.svelte#L163

The model is definitely capturing the changes in the parameters too, I am console.log(umap) to check.

Can you advise on the best way to reinstantiate a UMAP() between runs. I'd like to achieve something similar to the demonstrations that you've made on your own site @ https://pair-code.github.io/understanding-umap/

Feature Request: Accept custom distance function based on labels

I have a distance matrix precalculated for each pair of labels, and I would like to pass that distance instead of eucliden or some other method.

It will e helpful if the distanceFn accepts a function that takes two labels as arguments and return the distance.

Fix flaky travis tests

Sometimes the clustering tests don't work... we'll need to make sure they are deterministic (which they are locally....)

Can `JL lemma` be a replacement for `Spectral embedding`?

Because of the lack of ability of computing eigenvalues/vector efficiently in js, and with the guarantees of the JL lemma and the increased speed of its "random projection", might it be worth using the JL lemma instead of Spectral embedding?

Is the 'getCSR' method for the SparseMatrix incorrect?

I was having a poke around in the code here and I noticed something that looked a bit odd to me about the 'getCSR' method for the SparseMatrix - when it sorts the entries by row and column, it does this:

entries.sort((a, b) => {
    if (a.row === b.row) {
        return a.col - b.col;
    } else {
        return a.row - b.col;
    }
});

.. this looked like it was intended to say "sort the entries by row and then by column" but the code to do that would look like this:

entries.sort((a, b) => {
    if (a.row === b.row) {
        return a.col - b.col;
    } else {
        return a.row - b.row; // Note: This line changed
    }
});

There is a unit test for this method, so I thought that I'd look there to see if it could help me understand what's going but it's only confused me more! The test method 'getCSR' tests against a 3x3 matrix that is fully populated -

const rows = [0, 0, 0, 1, 1, 1, 2, 2, 2];
const cols = [0, 1, 2, 0, 1, 2, 0, 1, 2];
const vals = [1, 2, 3, 4, 5, 6, 7, 8, 9];
const dims = [3, 3];
A = new SparseMatrix(rows, cols, vals, dims);

.. which looks like this:

1 2 3
4 5 6
7 8 9

And so the CSR representation would have values 1 2 3 4 5 6 7 8 9 and indices 0 1 2 0 1 2 0 1 2 and indptr 0 3 6, shouldn't it? (Possibly indptr 0 3 6 9 to enable slicing into rows).

The unit test actually expects values 1, 2, 3, 7, 4, 5, 6, 8, 9 and indices 0, 1, 2, 0, 0, 1, 2, 1, 2 and indptr 0, 3, 4, 7 and the test passes. Is the test correct, though??

I must confess that I wasn't familiar with the CSR format before looking into your code, so it's very possible that I've got the wrong end of the stick. To try to get my head around it, I've been looking at the description for scipy.sparse.csr_matrix page and at this YouTube video.

umap-js throws error if imported in a web worker

Any idea how to use umap-js in a worker? When I try to use it I get an error (other scripts import just fine). A simple example to reproduce the problem is:

index.html

<script>
var worker = new Worker("./worker.js");
</script>

worker.js

importScripts("https://unpkg.com/[email protected]/lib/umap-js.js");

In the console (using Chrome in a Mac) the errors are:

Uncaught null
Script error.

transform api call not returning

I am using umap library for my project. umap.fit and other methods are working fine. But when I transform additional points after fitting, by calling umap.transform, it is hanging and not returning any value. The dataset I am trying is a simple one with less than 50 rows. Any help would be much appreciated.

flattenTree throws error if tree is a leaf

See the following code inside flattenTree:

  const hyperplanes = utils
    .range(nNodes)
    .map(() => utils.zeros(tree.hyperplane!.length));

When a tree is created in makeEuclideanTree, if the number of vectors is not greater than the leaf size, tree.hyperplane is undefined. I see the typescript code checks for that with a Non-null assertion operator, but the transpiled JS code does not:

    var hyperplanes = utils
        .range(nNodes)
        .map(function () { return utils.zeros(tree.hyperplane.length); });

Perhaps there is a problem with the typescript configuration?

Feature request: ability to serialize and deserialize state

Since fitting large data sets can take quite a long time, it would be useful to be able to save the state to disk or elsewhere in memory so computation can be resumed elsewhere.

A couple use cases:

  1. Resume asynchronous fitting if process needs to abort before it's done.
  2. Save results of fitting in a database so additional points can be transformed later, on other devices or in a parallel process (see #11).

Thanks

hamming distance?

Great piece of code - a useful find for our work -- thank you!!

One question: the readme states that the code only works with Euclidean distance. Is it possible for us to replace the "euclidean" function with a hamming distance function, or will that not work for some reason?

Thanks! Nick

Retrieve connectivity

Thanks for this awesome tool!

Is there a way to extract connectivity information from the embedding? I would like to draw the underlying graph as well.
Something compared to umap.plot.connectivity in the Python implementation here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.