pair-code / umap-js Goto Github PK

View Code? Open in Web Editor NEW

345.0 11.0 29.0 2.44 MB

JavaScript implementation of UMAP

License: Apache License 2.0

JavaScript 57.88% TypeScript 42.12%

javascript umap dimensionality-reduction

umap-js's Introduction

UMAP-JS

This is a JavaScript reimplementation of UMAP from the python implementation found at https://github.com/lmcinnes/umap.

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction.

There are a few important differences between the python implementation and the JS port.

The optimization step is seeded with a random embedding rather than a spectral embedding. This gives comparable results for smaller datasets. The spectral embedding computation relies on efficient eigenvalue / eigenvector computations that are not easily done in JS.
There is no specialized functionality for angular distances or sparse data representations.

Usage

Installation

yarn add umap-js

Synchronous fitting

import { UMAP } from 'umap-js';

const umap = new UMAP();
const embedding = umap.fit(data);

Asynchronous fitting

import { UMAP } from 'umap-js';

const umap = new UMAP();
const embedding = await umap.fitAsync(data, epochNumber => {
  // check progress and give user feedback, or return `false` to stop
});

Step-by-step fitting

import { UMAP } from 'umap-js';

const umap = new UMAP();
const nEpochs = umap.initializeFit(data);
for (let i = 0; i < nEpochs; i++) {
  umap.step();
}
const embedding = umap.getEmbedding();

Supervised projection using labels

import { UMAP } from 'umap-js';

const umap = new UMAP();
umap.setSupervisedProjection(labels);
const embedding = umap.fit(data);

Transforming additional points after fitting

import { UMAP } from 'umap-js';

const umap = new UMAP();
umap.fit(data);
const transformed = umap.transform(additionalData);

Parameters

The UMAP constructor can accept a number of hyperparameters via a UMAPParameters object, with the most common described below. See umap.ts for more details.

Parameter	Description	default
`nComponents`	The number of components (dimensions) to project the data to	2
`nEpochs`	The number of epochs to optimize embeddings via SGD	(computed automatically)
`nNeighbors`	The number of nearest neighbors to construct the fuzzy manifold	15
`minDist`	The effective minimum distance between embedded points, used with `spread` to control the clumped/dispersed nature of the embedding	0.1
`spread`	The effective scale of embedded points, used with `minDist` to control the clumped/dispersed nature of the embedding	1.0
`random`	A pseudo-random-number generator for controlling stochastic processes	`Math.random`
`distanceFn`	A custom distance function to use	`euclidean`

const umap = new UMAP({
  nComponents: 2,
  nEpochs: 400,
  nNeighbors: 15,
});

Testing

umap-js uses jest for testing.

yarn test

This is not an officially supported Google product

umap-js's People

Contributors

Stargazers

Watchers

umap-js's Issues

Is it possible to optimize on a subset of the data?

Hi, I'm doing an interactive scatter plot where you can zoom in or show subsets of the points. I perform an initial optimization with all the data, and then I'd like to optimize the embeddings for a subset of the data, but also beggining from the embeddings found in the initial optimization.

Any recommendation you have to achieve this?

P.S. Thanks for the library, it's super cool

Assigning labels/classes to unlabeled (-1) data in semi-supervised fit

Hi there,

Thanks for a wonderful library. Amazingly fast! Although I'm not quite sure how it's going to work for the set I'm using just yet.

Am I missing something, or is it just not possible to have labels assigned to previously unlabeled (-1) values?

e.g., say I have a set of 100 labeled vectors, and a thousand or so that are unlabeled (that's not quite right, but it'll do). I've tried appending the thousand, using -1 for their labels, with umap.setSupervisedProjection() then umap.fit(). I've also tried running the above with just the labeled 100, then using umap.transform() with the extra 1000. Either way, everything gets projected into the reduced dimensional space, which is great. But, I don't see a way to extract which label umap identifies as most likely for unlabeled rows.

Short version is that I'm looking for a way to run a quick-n-not-that-dirty classification stage on some data prior to further analysis. And now I'm not sure whether this is something umap just doesn't do, or if I'm overlooking something super obvious, or if it's something that could be done through a step I'm also overlooking.

umap.transform does infinite loop if data.length < nNeighbors*4

var UMAP = require('umap-js').UMAP
var d3 = require('d3')

var steps = d3.range(70).reverse()

function rand2dArray(nrows, ncols){
  return d3.range(nrows).map(_ => d3.range(ncols).map(Math.random))
}

function testLimitedData(nrows, nNeighbors, isTransform=true){
  var umap = new UMAP({nNeighbors})
  umap.fit(rand2dArray(nrows, 5))
  if (isTransform) umap.transform(rand2dArray(1, 5))

  console.log(nrows)
}


// locks up after 60
// steps.forEach(i => testLimitedData(i, 15))

// locks up after 32
// steps.forEach(i => testLimitedData(i, 8))

// locks up after 24
// steps.forEach(i => testLimitedData(i, 6))

// locks up after 16
// steps.forEach(i => testLimitedData(i, 4))

// throws error after 11
steps.forEach(i => testLimitedData(i, 4, false))

TypeError with production build

Uncaught (in promise) TypeError: i.set is not a function
    at Function.value [as identity] (abstractMatrix.js:164)
    at a (index.js:99)
    at Object.e.exports [as default] (index.js:193)
    at S (umap.js:708)
    at t.initializeOptimization (umap.js:488)
    at t.initializeFit (umap.js:170)
    at t.<anonymous> (umap.js:129)
    at umap.js:32
    at Object.next (umap.js:13)
    at umap.js:7

This seems to be resolved by bumping the version of ml-levenberg-marquardt, see #34

How to reset the model?

Hello,

Thanks for the great library - very easy to use!

I have a question about the best way to reset the model between runs. I am trying to create an interactive demonstration and it seems that each time I run the fit that the result is tending towards incredibly clumped outputs even when changing the minDist, spread, neighbors hyperparameters.

My code is here:

https://github.com/jamesb93/umap-vis/blob/main/src/lib/Map.svelte

Each time doStep() is called I am trying to reset things by creating a new UMAP() object but even then...

https://github.com/jamesb93/umap-vis/blob/1f5bc1853ed9f7753555d8a6100170168f9d91df/src/lib/Map.svelte#L163

The model is definitely capturing the changes in the parameters too, I am console.log(umap) to check.

Can you advise on the best way to reinstantiate a UMAP() between runs. I'd like to achieve something similar to the demonstrations that you've made on your own site @ https://pair-code.github.io/understanding-umap/

Feature Request: Accept custom distance function based on labels

I have a distance matrix precalculated for each pair of labels, and I would like to pass that distance instead of eucliden or some other method.

It will e helpful if the distanceFn accepts a function that takes two labels as arguments and return the distance.

Add tslint

Where should be implemented output_metric='haversine' for spherical embeddings?

I would like to implement output_metric='haversine' for spherical embeddings (if it's not too complicated), as in https://github.com/lmcinnes/umap. Where should I do it?

Include the dist folder instead of the lib folder

Fix flaky travis tests

Sometimes the clustering tests don't work... we'll need to make sure they are deterministic (which they are locally....)

Can `JL lemma` be a replacement for `Spectral embedding`?

Because of the lack of ability of computing eigenvalues/vector efficiently in js, and with the guarantees of the JL lemma and the increased speed of its "random projection", might it be worth using the JL lemma instead of Spectral embedding?

Is the 'getCSR' method for the SparseMatrix incorrect?

I was having a poke around in the code here and I noticed something that looked a bit odd to me about the 'getCSR' method for the SparseMatrix - when it sorts the entries by row and column, it does this:

entries.sort((a, b) => {
    if (a.row === b.row) {
        return a.col - b.col;
    } else {
        return a.row - b.col;
    }
});

.. this looked like it was intended to say "sort the entries by row and then by column" but the code to do that would look like this:

entries.sort((a, b) => {
    if (a.row === b.row) {
        return a.col - b.col;
    } else {
        return a.row - b.row; // Note: This line changed
    }
});

There is a unit test for this method, so I thought that I'd look there to see if it could help me understand what's going but it's only confused me more! The test method 'getCSR' tests against a 3x3 matrix that is fully populated -

const rows = [0, 0, 0, 1, 1, 1, 2, 2, 2];
const cols = [0, 1, 2, 0, 1, 2, 0, 1, 2];
const vals = [1, 2, 3, 4, 5, 6, 7, 8, 9];
const dims = [3, 3];
A = new SparseMatrix(rows, cols, vals, dims);

.. which looks like this:

1 2 3
4 5 6
7 8 9

And so the CSR representation would have values 1 2 3 4 5 6 7 8 9 and indices 0 1 2 0 1 2 0 1 2 and indptr 0 3 6, shouldn't it? (Possibly indptr 0 3 6 9 to enable slicing into rows).

The unit test actually expects values 1, 2, 3, 7, 4, 5, 6, 8, 9 and indices 0, 1, 2, 0, 0, 1, 2, 1, 2 and indptr 0, 3, 4, 7 and the test passes. Is the test correct, though??

I must confess that I wasn't familiar with the CSR format before looking into your code, so it's very possible that I've got the wrong end of the stick. To try to get my head around it, I've been looking at the description for scipy.sparse.csr_matrix page and at this YouTube video.

umap-js throws error if imported in a web worker

Any idea how to use umap-js in a worker? When I try to use it I get an error (other scripts import just fine). A simple example to reproduce the problem is:

index.html

<script>
var worker = new Worker("./worker.js");
</script>

worker.js

importScripts("https://unpkg.com/[email protected]/lib/umap-js.js");

In the console (using Chrome in a Mac) the errors are:

Uncaught null
Script error.

Investigate using a web worker?

Add densmap support

densmap is an iteration on umap and it'd be nice if this library offered it.

See https://github.com/hhcho/densvis

umap-js result change every time rerun with same parameter and dataset

I understand the umap-js version needs random as init function for the position. Is there a way to avoid the random or preset the position? Could you slightly explain why the umap-js not stable as python version and what should do to overcome that?

Import single values / methods from files rather than * from

transform api call not returning

I am using umap library for my project. umap.fit and other methods are working fine. But when I transform additional points after fitting, by calling umap.transform, it is hanging and not returning any value. The dataset I am trying is a simple one with less than 50 rows. Any help would be much appreciated.

flattenTree throws error if tree is a leaf

See the following code inside flattenTree:

  const hyperplanes = utils
    .range(nNodes)
    .map(() => utils.zeros(tree.hyperplane!.length));

When a tree is created in makeEuclideanTree, if the number of vectors is not greater than the leaf size, tree.hyperplane is undefined. I see the typescript code checks for that with a Non-null assertion operator, but the transpiled JS code does not:

    var hyperplanes = utils
        .range(nNodes)
        .map(function () { return utils.zeros(tree.hyperplane.length); });

Perhaps there is a problem with the typescript configuration?

Feature request: ability to serialize and deserialize state

Since fitting large data sets can take quite a long time, it would be useful to be able to save the state to disk or elsewhere in memory so computation can be resumed elsewhere.

A couple use cases:

Resume asynchronous fitting if process needs to abort before it's done.
Save results of fitting in a database so additional points can be transformed later, on other devices or in a parallel process (see #11).

Thanks

hamming distance?

Great piece of code - a useful find for our work -- thank you!!

One question: the readme states that the code only works with Euclidean distance. Is it possible for us to replace the "euclidean" function with a hamming distance function, or will that not work for some reason?

Thanks! Nick