Giter Club home page Giter Club logo

s-space's People

Contributors

davidjurgens avatar fozziethebeat avatar imwenyz avatar johann-petrak avatar oroanto avatar shulhi avatar tobiowo avatar tuxdna avatar younes-abouelnagah avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

s-space's Issues

TFIDF calculated not correctly?

Hi SSpace team,

I believe, TFIDF is not calculated correctly. Am I right?

Why the value in the following is divided by docTermCount[column]?
I think, there should be just tf = value; since
tf stands for term frequency in a certain document and does not stand for probability of the term frequency in the document (division case). At least, wikipedia reffered in the code claims so.

class TfIdfTransform in edu.ucla.sspace.matrix:

public double transform(int row, int column, double value) {
double tf = value / docTermCount[column];
double idf = Math.log(totalDocCount / (termDocCount[row] + 1));
return tf * idf;
}

the same in the following method:
public double transform(int row, DoubleVector column) {
...
}

Cheers,
Luboš

Distributional semantics using contexts rather than documents

Dear developers,

I found that the VsmMain computes the word-document matrix, which concerns the co-occurrences of words and documents. Could I generate distributional representation using the context within a certain size of window (say: 10), and use the PMI, rather than tf-idf as the element in the word-context matrix?

Thanks,
Jiang

Forgotten throw() in SingularValueDecompositionMatlab.factorize() [line 157]

Hi, in the old matlabSVDS-function in the SVD class returns U,S and V in a Matrix-Array. So it is no problem, that a UnsupportedOperationException( "Matlab svds is not correctly installed on this system"); is thrown at the end, because the Array is returned before.
In the new code the matlabSVDS is changed to the factorize-function in the SingularValueDecompositionMatlab class. And this function returns nothing, but at the last line the same Exception is thrown, so every call of this function ends in an exception.

SemanticSpaceIO/load fails when format explicitly specified

So if I save a SemanticSpace specifying a SSpaceFormat, and then load it again specifying the format, I get exceptions, e.g.:

NumberFormatException For input string: "^@s^@03"

for the TEXT format, or

EOFException java.io.DataInputStream.readInt

for BINARY.

If I omit the manually-specified format and let it auto-detect the format, everything works fine. (Looks like there's a few bytes of binary data specifying the format, which only the format-autodetecting code checks for and which trips up the other code.

Also note the docs point to SemanticSpaceUtils when talking about loading/saving, but seems this was moved to SemanticSpaceIO

ArrayIndexOutOfBoundsException at SparseHashMatrix.getRowVector(SparseHashMatrix.java:105) after commit f6038a3501e69c4ce6011de53a0aa2bf877710c2

After commit f6038a3 (or potentially a later commit), code that worked just fine now throws and ArrayIndexOutOfBoundsException.

Here is the exception:

java.lang.ArrayIndexOutOfBoundsException: 4
    at edu.ucla.sspace.matrix.SparseHashMatrix.getRowVector(SparseHashMatrix.java:105)
    at edu.ucla.sspace.matrix.SparseHashMatrix.getRowVector(SparseHashMatrix.java:42)
    at edu.ucla.sspace.common.GenericTermDocumentVectorSpace.getVector(GenericTermDocumentVectorSpace.java:315)
    at edu.ucla.sspace.common.DocumentVectorBuilder.buildVector(DocumentVectorBuilder.java:129)

Here is a minimal groovy script which illustrates the problem:

import edu.ucla.sspace.common.DocumentVectorBuilder;
import edu.ucla.sspace.common.SemanticSpace;
import edu.ucla.sspace.vector.DoubleVector;
import edu.ucla.sspace.vsm.VectorSpaceModel;
import edu.ucla.sspace.vector.DenseVector;


import java.io.BufferedReader;

SemanticSpace candidateListSemSpace = new VectorSpaceModel();
candidateListSemSpace.processDocument(new BufferedReader(new StringReader("This is some text")));

Properties tficfConfig = new Properties();
tficfConfig.put(VectorSpaceModel.MATRIX_TRANSFORM_PROPERTY,"edu.ucla.sspace.matrix.TfIdfTransform");
candidateListSemSpace.processSpace(tficfConfig);
DocumentVectorBuilder tficfVectorBuilder = new DocumentVectorBuilder(candidateListSemSpace);
DoubleVector tficfContextVector = 
tficfVectorBuilder.buildVector(
  new BufferedReader(new StringReader("This is also some text")),
  new DenseVector(candidateListSemSpace.getVectorLength()));

Run this using groovy -cp <neededJars> file.groovy and this will either throw the exception, if run with the latest sspace jar on the classpath, or create a file in /tmp if run with sspace before commit f6038a3...

Checking whether the Sparse SVD is supported by Octave is broken

SVD.isOctaveSupported() works by calling octave, but doesn't check whether svds or eigs are available, which is necessary.

For now, the unit test is set to @ignore, but the functionality needs to be fixed and incorporated.

Also, the SingularValueDecomposition interface should probably be retrofitted to add an isAvailable() method that does all of this resource checking rather than pushing into the static-method based SVD class.

maven failure

dear S-SPace team,

any help?

using :
Apache Maven 3.1.1 (0728685237757ffbf44136acec0402957f723d9a; 2013-09-17 12:22:22-0300)
Maven home: /usr/local/apache-maven/apache-maven-3.1.1
Java version: 1.6.0_32, vendor: Sun Microsystems Inc.
Java home: /usr/local/openjdk6/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "freebsd", version: "9.1-release-p15", arch: "amd64", family: "unix"

openjdk version "1.6.0_32"
OpenJDK Runtime Environment (build 1.6.0_32-b27)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

thanks a lot,
ivo

INFO] ------------------------------------------------------------------------
[INFO] Building S-Space Package 2.0.4
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-enforcer-plugin:1.0:enforce (enforce-maven) @ sspace ---
[INFO]
[INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ sspace ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /usr/home/ivo/S-Space/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ sspace ---
[INFO] Compiling 662 source files to /usr/home/ivo/S-Space/target/classes
[INFO] -------------------------------------------------------------
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] /usr/home/ivo/S-Space/src/main/java/edu/ucla/sspace/graph/Fanmod.java:[175,27] cannot find symbol
symbol : method add(edu.ucla.sspace.graph.isomorphism.TypedIsomorphicGraphCounter<java.lang.Object,edu.ucla.sspace.graph.Multigraph<T,E>>)
location: interface java.util.List<edu.ucla.sspace.graph.isomorphism.TypedIsomorphicGraphCounter<T,edu.ucla.sspace.graph.Multigraph<T,E>>>
[INFO] 1 error
[INFO] -------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 6.024s
[INFO] Finished at: Tue Oct 08 15:47:23 BRT 2013
[INFO] Final Memory: 18M/309M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.3.2:compile (default-compile) on project sspace: Compilation failure
[ERROR] /usr/home/ivo/S-Space/src/main/java/edu/ucla/sspace/graph/Fanmod.java:[175,27] cannot find symbol
[ERROR] symbol : method add(edu.ucla.sspace.graph.isomorphism.TypedIsomorphicGraphCounter<java.lang.Object,edu.ucla.sspace.graph.Multigraph<T,E>>)
[ERROR] location: interface java.util.List<edu.ucla.sspace.graph.isomorphism.TypedIsomorphicGraphCounter<T,edu.ucla.sspace.graph.Multigraph<T,E>>>
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
[ivo@Arpoador] ~/S-Space%

LSA - fails when reducing dimentsions

I get the following error message when using LSAMain even though my command line arguments are correct (-n 100). It works with a smaller portion of the same corpus. What could be wrong?

….
Info: Scaling the entropy of the rows
jan 21, 2014 8:51:52 FM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace
Info: reducing to 100 dimensions
Exception in thread "main" java.lang.IllegalArgumentException: dimensions must be positive
at edu.ucla.sspace.matrix.OnDiskMatrix.(OnDiskMatrix.java:98)
at edu.ucla.sspace.matrix.Matrices.create(Matrices.java:216)
at edu.ucla.sspace.matrix.MatrixIO.readDenseTextMatrix(MatrixIO.java:927)
at edu.ucla.sspace.matrix.MatrixIO.readMatrix(MatrixIO.java:795)
at edu.ucla.sspace.matrix.MatrixIO.readMatrix(MatrixIO.java:762)
at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionOctave.factorize(SingularValueDecompositionOctave.java:137)
at edu.ucla.sspace.lsa.LatentSemanticAnalysis.processSpace(LatentSemanticAnalysis.java:439)
at edu.ucla.sspace.mains.GenericMain.processDocumentsAndSpace(GenericMain.java:514)
at edu.ucla.sspace.mains.GenericMain.run(GenericMain.java:443)
at edu.ucla.sspace.mains.LSAMain.main(LSAMain.java:167)

SVDLIBC generated the incorrect number of dimensions: 3 versus 300

Hi, I'm getting the above error when running LSAMain with the following commands:
-d data/input2.txt data/output/my_lsa_output.sspace

input2.txt is just a very simple text file (for testing) and it contains:
The man walked the dog.
The man took the dog to the park.
The dog went to the park.

System output:
Saving matrix using edu.ucla.sspace.matrix.SvdlibcSparseBinaryMatrixBuilder@5e2de80c
Saw 8 terms, 7 unique
Saw 5 terms, 5 unique
Saw 6 terms, 6 unique
edu.ucla.sspace.lsa.LatentSemanticAnalysis@406a31db processing doc edu.ucla.sspace.util.SparseIntHashArray@2fae8f9
edu.ucla.sspace.lsa.LatentSemanticAnalysis@406a31db processing doc edu.ucla.sspace.util.SparseIntHashArray@3553305b
edu.ucla.sspace.lsa.LatentSemanticAnalysis@406a31db processing doc edu.ucla.sspace.util.SparseIntHashArray@390b4f54
Jan 25, 2015 1:33:24 AM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace
INFO: performing log-entropy transform
Jan 25, 2015 1:33:24 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Computing the total row counts
Jan 25, 2015 1:33:24 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Computing the entropy of each row
Jan 25, 2015 1:33:24 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Scaling the entropy of the rows
Jan 25, 2015 1:33:24 AM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace
INFO: reducing to 300 dimensions
Exception in thread "main" java.lang.RuntimeException: SVDLIBC generated the incorrect number of dimensions: 3 versus 300
at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibC.readSVDLIBCsingularVector(SingularValueDecompositionLibC.java:198)
at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibC.factorize(SingularValueDecompositionLibC.java:161)
at edu.ucla.sspace.lsa.LatentSemanticAnalysis.processSpace(LatentSemanticAnalysis.java:463)
at edu.ucla.sspace.mains.GenericMain.processDocumentsAndSpace(GenericMain.java:514)
at edu.ucla.sspace.mains.GenericMain.run(GenericMain.java:443)
at edu.ucla.sspace.mains.LSAMain.main(LSAMain.java:167)

FYI the environment setup is : 64-bit Windows 7 , svdlibc compiled with cygwin. Is this issue caused by the input file? I've tried using a wiki dump corpus however the issue still exists. Any help is greatly appreciated.

Thank You

Exception using Matlab for LSA

Hi,
if I put the -S MATLAB command, Java throws
IllegalArgumentException: dimensions must be positive
at edu.ucla.sspace.matrix.OnDiskMatrix.(OnDiskMatrix.java:98)
at edu.ucla.sspace.matrix.Matrices.create(Matrices.java:216)
at edu.ucla.sspace.matrix.MatrixIO.readDenseTextMatrix(MatrixIO.java:924)

In Matrices.create(Matrices.java:216) I find the lines:

case SPARSE_ON_DISK:
//return new SparseOnDiskMatrix(rows, cols);
// REMDINER: implement me
return new OnDiskMatrix(rows, cols);

Is the MATLAB matrix format not implemented yet or is it a bug?

The output above the Exception suggests a general problem with reading the Matlab-Output:

Nov 09, 2012 11:21:52 AM edu.ucla.sspace.matrix.MatrixIO readDenseTextMatrix
FINE: reading in text matrix with 15262 rows and 0 cols
Nov 09, 2012 11:21:52 AM edu.ucla.sspace.matrix.MatrixIO readDenseTextMatrix
FINE: reading in text matrix with 100 rows and 0 cols

And Matlab gives the warning:

Warning: Imaginary part of complex variable 'U' not saved to ASCII file.
Warning: Imaginary part of complex variable 'V' not saved to ASCII file.
But the three Matlab output matrices seem normal.

I use Mac 10.7, Matlab 2012a and the SSpace 2.0-Code (but this happened with earlier code, too)

DocumentVectorBuilder ignores USE_TERM_FREQUENCIES_PROPERTY

I tried to use the DocumentVectorBuilder class while honouring TERM frequencies.
This fails, because of a bug (IMHO) in the add Method:

public void add(DoubleVector dest, Vector src, int factor) {
if (src instanceof SparseVector) {
int[] nonZeros = ((SparseVector) src). getNonZeroIndices();
for (int i : nonZeros)
dest.add(i, src.getValue(i).doubleValue());
} else {
for (int i = 0; i < src.length(); ++i)
dest.add(i, src.getValue(i).doubleValue());
}
}

the 'factor' parameter is NOT used at all!?

VF2State - backtrack assert

Hi there,

first of all, great job on porting VF2 to Java!
However, i think there is a bug in the method addPair(int,int) in edu.ucla.sspace.graph.isomorphism.VF2State.

Before incrementing coreLen (line 387), the origCoreLen should be updated:
origCoreLen = corLen

Otherwise line 497 results in an AssertionError, when trying to backtrack after more than one found matches.

Regards,
Julian

ArrayIndexOutOfBoundsException when using LSA

Hi
I am getting the following error as reported in #13.
I realised that the default dimesnion of 300 is causing this and when I set to 1, I am not receiving any erros.
What I do not understand is that I have one input ifle with over 900 documents (one per line) with each document on one line having afew words. Infact in debug I see that there are over 3500 unique words being stored.
Is this normal?
Because later on when I get to run with dimension=1 and than iterate over the space using
System.out.printf("%s maps to %s%n", word, VectorIO.toString(algo.getVector(word)));

I see in the output [0.0] for each word !!!
It looks like I am doing something wrong.

any help is appreciated.

Best regards
Tezcan

Fixing Random Seeds for RandomIndexing

I've been having some trouble fixing the Random seeds used in Random Indexing.

I'd like to have predictable output across runs, so I can run through a bunch of fixed seeds and see how much of an impact the random initialization & other parameters can have on retrieval.

The example:

RandomIndexing ri = new RandomIndexing(new Properties());
ri.RANDOM.setSeed(SEED);

Doesn't give me predictable output, as the RandomIndexVectorGenerator class has a random number source I can't fix for testing.

One way to do this would be to make the random seed an optional property - same as vectorLength.

(IncrementalSemanticAnalysis also uses some of the classes for Random Indexing, that might need the same change, I'll make a pull request with this when I'm done.)

mvn package error

Dear S-Space team,

Hi. I am trying to produce jar libraries as most of functions are described with jars but I am facing some errors and the command, "mvn package", produces error below.

My environment is
Apache Maven 3.0.4
Maven home: /usr/share/maven
Java version: 1.8.0_101, vendor: Oracle Corporation
Java home: /usr/local/java/jdk1.8.0_101/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "3.13.0-32-generic", arch: "i386", family: "unix"

I solved "mvc compile" issue using this link(https://groups.google.com/forum/#!topic/s-space-users/2DLs--h13Fc) but I couldn't find any solutions that I am currently facing with.

I appreciate for any comments.

Regards,

Hangil

Results :

Failed tests: testNoConfig(edu.ucla.sspace.text.IteratorFactoryTests): expected:<[is]> but was:<[my]>
test(edu.ucla.sspace.ri.TestRandomIndexing): expected:<[the, quick, brown, fox, jumps, over, lazy, dog]> but was:<[]>

Tests run: 1339, Failures: 2, Errors: 0, Skipped: 6

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 6.400s
[INFO] Finished at: Mon Sep 05 07:54:10 PDT 2016
[INFO] Final Memory: 13M/32M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.12:test (default-test) on project sspace: There are test failures.
[ERROR]
[ERROR] Please refer to /suny/lsa/S-Space/target/surefire-reports for the individual test results.
[ERROR] -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

Matrix Transformation produces strange Floatingpoint values with commas

Dear Developers,

I'm using a library that calls edu.ucla.sspace.matrix.SVD.svd(File matrix, Algorithm alg, Format format, int dimensions) to perform a SVD.
However, during the execution a java.lang.NumberFormatException occurs on the String "0,273696". Obviously, the comma is causing the problem. But how the comma comes to exist in the first place is unclear to me. The algorithm to use is ANY.

The error:
Exception in thread "main" java.lang.NumberFormatException: For input string: "0,273696"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
at java.lang.Double.parseDouble(Double.java:540)
at edu.ucla.sspace.matrix.MatrixIO.readMatlabSparse(MatrixIO.java:1136)
at edu.ucla.sspace.matrix.MatrixIO.readMatrix(MatrixIO.java:798)
at edu.ucla.sspace.matrix.MatrixIO.readMatrix(MatrixIO.java:723)
at edu.ucla.sspace.matrix.MatrixIO.readMatrixArray(MatrixIO.java:697)
at edu.ucla.sspace.matrix.SVD.svd(SVD.java:426)
at edu.ucla.sspace.matrix.SVD.svd(SVD.java:430)
...

So far I was able to trace the origin of the error to the transformation of a matrix. The matrix itself looks fine. But the transformed variant suddenly contains floatingpoint values with commas instead of points.

The normal matrix:
matrix
The transformed matrix:
transformedmatrix

The Transformation is called in this fashion:

import edu.ucla.sspace.matrix.Transform;
...
Transform transform = new LogEntropyTransform();
File transformedMatrix = transform.transform(termDocumentMatrix,
termDocumentMatrixBuilder.getMatrixFormat());

I hope you can help me.

Yours,
Laura

NoClassDefFoundError: gnu/trove/map/TObjectIntMap

Dear developers,

I passed the "mvn compile" and "mvn test" step. But I met the following problems while running:

java -cp target/classes edu.ucla.sspace.mains.VsmMain -d data/test.txt data/ -o text

java.lang.NoClassDefFoundError: gnu/trove/map/TObjectIntMap
    at edu.ucla.sspace.common.GenericTermDocumentVectorSpace.processDocument(GenericTermDocumentVectorSpace.java:199)
    at edu.ucla.sspace.mains.GenericMain$1.run(GenericMain.java:586)
    at edu.ucla.sspace.util.WorkQueue$CountingRunnable.run(WorkQueue.java:361)
    at edu.ucla.sspace.util.WorkerThread.run(WorkerThread.java:110)
Caused by: java.lang.ClassNotFoundException: gnu.trove.map.TObjectIntMap
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:423)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:356)
    ... 4 more

How could I solve it?

SVDLIBC and SVDLIBJ are broken in different ways

I computed a log-entropy term-document matrix over 92,000 documents from New York Times year 2003. When reducing it via SVD, the three different implementations give wildly and significantly different results:

Octave/Matlab: I consider this to be the most correct. When using the redused U and V matrices in word similarity evaluations and document classification, results were high.
SVDLIBC: Broken when saving to dense binary formates (seems to work fine for dense text). For the most part, the U and V matrices matched those from Octave except that LIBC addeds a new first row to U and V that do not match any of the matrices from Octave/Matlab.
SVDLIBJ: Sometimes this matches Octave/Matlab perfectly. Othertimes it generates mystery magic. This should no longer be used.

Based on these results, and the fact that I didn't find out that dense text formats with LIBC worked correctly until recently, I did a search for java based SVD implementations. Beyond the ones we've already considered, there's Apache's Common Math library and Mahout's SVD implementation. But both are unsatisfactory. Apache Math computes the full svd and does not utilize sparse matrices. Mahout only seems to return the U matrix, as best I can tell.

This probably won't see a solution for a while, but this is just to track any progress we do make.

java.lang.ArrayIndexOutOfBoundsException

Feb 18, 2016 11:28:35 AM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace
INFO: performing log-entropy transform
Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Computing the total row counts
Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Computing the entropy of each row
Feb 18, 2016 11:28:35 AM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Scaling the entropy of the rows
Feb 18, 2016 11:28:35 AM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace
INFO: reducing to 300 dimensions
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
at edu.ucla.sspace.matrix.DiagonalMatrix.checkIndices(DiagonalMatrix.java:78)
at edu.ucla.sspace.matrix.DiagonalMatrix.get(DiagonalMatrix.java:94)
at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibJ.factorize(SingularValueDecompositionLibJ.java:89)

The number of words in my corpus turns to be 6000+. Is the code unable to reduce the size of the vector to 300 from 6000+. What is the solution ?

17 Errors when testing with Maven

Hi,
I encountered two problems when getting started with S-Space. My operating system is Debian, and I have recently installed Matlab and Maven.

You have mentioned in your tutorial that after compiling with Maven, the output looks like this:

user@machine$ mvn compile
[INFO] Scanning for projects...
[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building S-Space Package 2.0
[INFO] ------------------------------------------------------------------------
[INFO] Compiling 495 source files to /home/stevens35/devel/S-Space/target/classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 12.309s
[INFO] Finished at: Thu Oct 27 08:56:28 PDT 2011
[INFO] Final Memory: 24M/361M
[INFO] ------------------------------------------------------------------------

whereas my output is

[INFO]
[INFO] ------------------------------------------------------------------------
[INFO] Building S-Space Package 2.0.4
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-enforcer-plugin:1.0:enforce (enforce-maven) @ sspace ---
[INFO]
[INFO] --- maven-resources-plugin:2.3:resources (default-resources) @ sspace ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/tobi/S-Space/src/main/resources
[INFO]
[INFO] --- maven-compiler-plugin:2.3.2:compile (default-compile) @ sspace ---
[INFO] Compiling 5 source files to /home/tobi/S-Space/target/classes
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 3.854s
[INFO] Finished at: Sun Feb 15 19:22:07 CET 2015
[INFO] Final Memory: 12M/89M
[INFO] ------------------------------------------------------------------------

where only 5 source files have been compiled to /S-Space/target/classes.

Also, when testing, I get 17 errors, which occur when an iterator is used:

Tests in error:
testIterator(edu.ucla.sspace.matrix.SvdlibcDenseTextFileIteratorTests): For input string: "1,500000"

testRemove(edu.ucla.sspace.matrix.SvdlibcDenseTextFileIteratorTests): Unexpected exception, expected<java.lang.UnsupportedOperationException> but was<java.lang.NumberFormatException>

testEmptyNext(edu.ucla.sspace.matrix.SvdlibcDenseTextFileIteratorTests): Unexpected exception, expected<java.util.NoSuchElementException> but was<java.lang.NumberFormatException>

testIterator(edu.ucla.sspace.matrix.MatlabSparseFileIteratorTests): For input string: "2,300000"

testRemove(edu.ucla.sspace.matrix.MatlabSparseFileIteratorTests): Unexpected exception, expected<java.lang.UnsupportedOperationException> but was<java.lang.NumberFormatException>

testEmptyNext(edu.ucla.sspace.matrix.MatlabSparseFileIteratorTests): Unexpected exception, expected<java.util.NoSuchElementException> but was<java.lang.NumberFormatException>

testTransform(edu.ucla.sspace.matrix.SvdlibcSparseTextFileTransformerTest): For input string: "2,000000"

testIterator(edu.ucla.sspace.matrix.ClutoSparseFileIteratorTests): For input string: "1,500000"

testRemove(edu.ucla.sspace.matrix.ClutoSparseFileIteratorTests): Unexpected exception, expected<java.lang.UnsupportedOperationException> but was<java.lang.NumberFormatException>

testEmptyNext(edu.ucla.sspace.matrix.ClutoSparseFileIteratorTests): Unexpected exception, expected<java.util.NoSuchElementException> but was<java.lang.NumberFormatException>

testTransform(edu.ucla.sspace.matrix.MatlabSparseFileTransformerTest): For input string: "2,000000"

testIterator(edu.ucla.sspace.matrix.SvdlibcSparseTextFileIteratorTests): For input string: "1,500000"

testRemove(edu.ucla.sspace.matrix.SvdlibcSparseTextFileIteratorTests): Unexpected exception, expected<java.lang.UnsupportedOperationException> but was<java.lang.NumberFormatException>

testEmptyNext(edu.ucla.sspace.matrix.SvdlibcSparseTextFileIteratorTests): Unexpected exception, expected<java.util.NoSuchElementException> but was<java.lang.NumberFormatException>

testIterator(edu.ucla.sspace.matrix.DenseTextFileIteratorTests): For input string: "1,500000"

testRemove(edu.ucla.sspace.matrix.DenseTextFileIteratorTests): Unexpected exception, expected<java.lang.UnsupportedOperationException> but was<java.lang.NumberFormatException>

testEmptyNext(edu.ucla.sspace.matrix.DenseTextFileIteratorTests): Unexpected exception, expected<java.util.NoSuchElementException> but was<java.lang.NumberFormatException>

I guess it is just a simple bug, but I don't find it.
Thank you for your help!

Tobias

Bug in SemanticSpaceIO.writeText

When using a SemanticSpace with sparse vectors, SemanticSpaceIO.writeText() incorrectly saves the vectors in sparse format rather than dense format. This is due to calling VectorIO.toString(), which treats SparseVector instances specially.

The fix is to either remove the dependency on VectorIO so that all the values are written, or to update VectorIO to refactor the SparseVector functionality into a separate method (perhaps a SparseVector overload?).

In the current state, I don't think the .sspace files write in this hybrid format are readable, though it's amazing that we hadn't noticed that for so long.

Temporal Random Indexing

Hi,

I'm trying to run the ft-tri.jar with the command:
java -server -Xmx4g -jar fd-tri-v2.0.jar --timespan=1y --verbose --threads=4 --vectorLength=10000 --tokenFilter=exclude=stoplist.txt --docFile=corpus.txt --interestingTokenList=wordsToWatch.txt --savePartitions --printInterestingTokenNeighbors=50 --printInterestingTokenShifts output

and I have some problem with the output.
the script returns me

  • a .sspace file containing all the vectors
  • a file named "word to watch"-"date".txt containing 50 neighbors
  • en empty file .temporal-changes.txt in which I'm interested

What's wrong? how can I get the .temporal-changes output?

Thanks for the help.

Java 7 warns against Raw Types

Compiling with Java 7 now creates numerous warnings of the style

/home/stevens35/devel/S-Space/src/edu/ucla/sspace/index/DefaultPermutationFunction.java:135: warning: [rawtypes] found raw type: Vector
[javac] public Vector permute(Vector v , int numPermutations) {

This primarily occurs for unparameterized uses of Vector and Class.

We should decided on how to address these warnings at some point, but they don't block any progress.

class not find

Hi all,

I tried to test the LSA class. From the instruction of Getting Started wiki page, I run the command from the S-Space directory:
user@machine$ java -cp classes edu.ucla.sspace.mains.LSAMain

But I received an error:
Error: Could not find or load main class edu.ucla.sspace.mains.LSAMain

In addition, when I run the LSAMain.java, I got this:

Saving matrix using edu.ucla.sspace.matrix.SvdlibcSparseBinaryMatrixBuilder@12edcd21
Exception in thread "WorkerThread-1" java.util.NoSuchElementException
at edu.ucla.sspace.util.CombinedIterator.next(CombinedIterator.java:119)
at edu.ucla.sspace.mains.GenericMain$1.run(GenericMain.java:582)
at edu.ucla.sspace.util.WorkQueue$CountingRunnable.run(WorkQueue.java:361)
at edu.ucla.sspace.util.WorkerThread.run(WorkerThread.java:110)
Saw 16 terms, 15 unique
Saw 17 terms, 16 unique
Saw 15 terms, 15 unique
edu.ucla.sspace.lsa.LatentSemanticAnalysis@33918ac4 processing doc edu.ucla.sspace.util.SparseIntHashArray@20abfa41
edu.ucla.sspace.lsa.LatentSemanticAnalysis@33918ac4 processing doc edu.ucla.sspace.util.SparseIntHashArray@48b45e8f
edu.ucla.sspace.lsa.LatentSemanticAnalysis@33918ac4 processing doc edu.ucla.sspace.util.SparseIntHashArray@3d5662c6
May 11, 2015 1:52:29 PM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace
INFO: performing log-entropy transform
May 11, 2015 1:52:29 PM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Computing the total row counts
May 11, 2015 1:52:29 PM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Computing the entropy of each row
May 11, 2015 1:52:29 PM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Scaling the entropy of the rows
May 11, 2015 1:52:29 PM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace
INFO: reducing to 2 dimensions
output File: my-lsa-output.sspace

Even it produces an output file, it has an error
Exception in thread "WorkerThread-1" java.util.NoSuchElementException

Could you please tell me how to fix this error?
Thank you.

Fresh pull failing 6 test and 2 errors

Platform: OSX 10.7.4

Failed tests: testSpearman2(edu.ucla.sspace.common.SimilarityTest): expected:<-0.55> but was:<-0.5>
testProcessDocument(edu.ucla.sspace.wordsi.GeneralContextExtractorTest)
testProcessDocumentWithHeader(edu.ucla.sspace.wordsi.GeneralContextExtractorTest): expected:<[CHICKEN:]> but was:<[]>
testProcessDocument(edu.ucla.sspace.wordsi.psd.PseudoWordContextExtractorTest)
testProcessDocumentWithDefaultSeparator(edu.ucla.sspace.wordsi.semeval.SemEvalContextExtractorTest)
testProcessDocumentWithNonDefaultSeparator(edu.ucla.sspace.wordsi.semeval.SemEvalContextExtractorTest)

Tests in error:
testMatrixReduction(edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibCTest)
testSvdlibcCluto(edu.ucla.sspace.matrix.SvdTests): SVDLIBC is not correctly installed on this system

Tests run: 1304, Failures: 6, Errors: 2, Skipped: 0

Probability LSA

Dear all,

Does S-Space support for probability LSA (PLSA)? If yes, could you please tell me which file I should consider and test this method?

Thanks a lot.

NumberFormatException when attempting to run LatentSymanticAnalysis class

Hi, I'm facing an error while calling the LSA class as a library. The error was thrown during the processSpace() call.

void initialize() throws IOException {
    //....
    LatentSemanticAnalysis lsa = new LatentSemanticAnalysis(3);

    File input = new File("data/input2.txt");

    BufferedReader br = new BufferedReader(new FileReader(input));

    lsa.processDocument(br);

    lsa.processSpace(System.getProperties()); // <--- Error happens within this method

System Output:
Initializing MyLSAmain
Saving matrix using edu.ucla.sspace.matrix.SvdlibcSparseBinaryMatrixBuilder@60e53b93
Saw 19 terms, 8 unique
edu.ucla.sspace.lsa.LatentSemanticAnalysis@7adf9f5f processing doc edu.ucla.sspace.util.SparseIntHashArray@85ede7b
Jan 26, 2015 2:03:01 PM edu.ucla.sspace.common.GenericTermDocumentVectorSpace processSpace
INFO: performing log-entropy transform
Jan 26, 2015 2:03:01 PM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Computing the total row counts
Jan 26, 2015 2:03:01 PM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Computing the entropy of each row
Jan 26, 2015 2:03:01 PM edu.ucla.sspace.matrix.LogEntropyTransform$LogEntropyGlobalTransform
INFO: Scaling the entropy of the rows
Jan 26, 2015 2:03:01 PM edu.ucla.sspace.lsa.LatentSemanticAnalysis processSpace
INFO: reducing to 3 dimensions
Exception in thread "main" java.lang.NumberFormatException: For input string: "nan"
at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110)
at java.lang.Double.parseDouble(Double.java:538)
at edu.ucla.sspace.matrix.MatrixIO.readDenseSVDLIBCtext(MatrixIO.java:994)
at edu.ucla.sspace.matrix.MatrixIO.readMatrix(MatrixIO.java:809)
at edu.ucla.sspace.matrix.MatrixIO.readMatrix(MatrixIO.java:762)
at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibC.factorize(SingularValueDecompositionLibC.java:153)
at edu.ucla.sspace.lsa.LatentSemanticAnalysis.processSpace(LatentSemanticAnalysis.java:463)
at edu.ucla.sspace.mains.MyMain.initialize(MyMain.java:62)
at edu.ucla.sspace.mains.MyMain.(MyMain.java:23)
at edu.ucla.sspace.mains.MyMain.main(MyMain.java:33)

This is a follow up to #58 where I've managed to run LSAMain successfully. Am i missing something?
Thanks

Ability to pass pre-tokenized text to processDocument

Just a small feature request: it'd be nice if it were possible to pass in pre-tokenized Iterator<String> to SemanticSpace.processDocument, for cases where SSpace is being used as one component in a bigger system and the tokenization has already been handled elsewhere. Perhaps I'm missing something but at the moment it seems I have to work around this by supplying a BufferedReader which joins together my tokens only for them to be re-tokenized again the other end.

Removing the hardcoded calls to static methods of IteratorFactory might also have the benefit of allowing different tokenizers or tokenization settings to be used without having to modify global state via the static IteratorFactory.setProperties.

File text is not readable?

Dear all,
I test the output format with "TEXT" and "SPARSE_TEXT" but they all bring the files which are not readable. They are the same as the binary files.
I try to read the code in the file SemanticSpaceIO but it does not work.
Please tell me how to show the content of these files?
Thanks a lot.

Model creation fails

Hi there

We are using s-space to create a semantic space model from a corpus of scientific articles (1.5 GB).
The command we run is:

java -server -Xms50g -Xmx64g -cp classes edu.ucla.sspace.mains.IsaMain
-s 3 -h 50 -i 0.0003 -d
/homes/abraham/pubmed_doc-per-file_lower_nopunct.txt ISA_Model_PMC

The error we get is:

ISA_Model_PMC/IncrementSemanticAnalysis--1800v-3w-noPermutations.sspace
java.io.UTFDataFormatException: encoded string too long: 168579 bytes
at java.io.DataOutputStream.writeUTF(DataOutputStream.java:364)
at java.io.DataOutputStream.writeUTF(DataOutputStream.java:323)
at
edu.ucla.sspace.common.SemanticSpaceIO.writeSparseBinary(SemanticSpaceIO.java:587)
at
edu.ucla.sspace.common.SemanticSpaceIO.save(SemanticSpaceIO.java:351)
at
edu.ucla.sspace.mains.GenericMain.saveSSpace(GenericMain.java:500)
at edu.ucla.sspace.mains.GenericMain.run(GenericMain.java:480)
at edu.ucla.sspace.mains.IsaMain.main(IsaMain.java:194)

Can you please help? We tried with a small subset of 237 articles, and that didn't give any errors.

9 failures and 23 errors on a fresh pull

Hello,

I get 9 failures and 23 errors when I run mvn test on a freshly cloned S-Space. I know some of these are not serious from a closed issue on a similar topic, but I got several more - I just thought I should report it. I mark the new failures with NEW at the beginning of the line (compared to issue #15).

I am on Ubuntu 12.04.1 LTS, compiling with JAVA_HOME set to /usr/lib/svm/java-7-openjdk-amd6 (somehow it does not work with java-6).

BTW, I think it would be great to add into the "Getting Started" some info that one has to get jama.jar and jaws-bin.jar before running opt/add_non_maven_jars.sh (also, this script is not in ., but in opt/).

Failed tests:
testSpearmanDoubleTies(edu.ucla.sspace.common.SimilarityTest): expected:<-0.55> but was:<-0.6324555320336759>
NEW ---> writeDenseTextMatrixFile(edu.ucla.sspace.matrix.MatrixIOTest): expected:<3> but was:<0>
NEW ---> testSetAddSumIsZero(edu.ucla.sspace.vector.SparseHashIntegerVectorTests): expected:<0> but was:<1>
testProcessDocument(edu.ucla.sspace.wordsi.psd.PseudoWordContextExtractorTest)
testProcessDocumentWithDefaultSeparator(edu.ucla.sspace.wordsi.semeval.SemEvalContextExtractorTest)
testProcessDocumentWithNonDefaultSeparator(edu.ucla.sspace.wordsi.semeval.SemEvalContextExtractorTest)
testProcessDocument(edu.ucla.sspace.wordsi.GeneralContextExtractorTest)
testProcessDocumentWithHeader(edu.ucla.sspace.wordsi.GeneralContextExtractorTest): expected:<[CHICKEN:]> but was:<[]>
NEW ---> test(edu.ucla.sspace.ri.TestRandomIndexing): expected:<[the, quick, brown, fox, jumps, over, lazy, dog]> but was:<[]>

Tests in error:
testIterator(edu.ucla.sspace.matrix.SvdlibcSparseTextFileIteratorTests): For input string: "1,500000"
testRemove(edu.ucla.sspace.matrix.SvdlibcSparseTextFileIteratorTests): Unexpected exception, expected<java.lang.UnsupportedOperationException> but was<java.lang.NumberFormatException>
testEmptyNext(edu.ucla.sspace.matrix.SvdlibcSparseTextFileIteratorTests): Unexpected exception, expected<java.util.NoSuchElementException> but was<java.lang.NumberFormatException>
testIterator(edu.ucla.sspace.matrix.MatlabSparseFileIteratorTests): For input string: "2,300000"
testRemove(edu.ucla.sspace.matrix.MatlabSparseFileIteratorTests): Unexpected exception, expected<java.lang.UnsupportedOperationException> but was<java.lang.NumberFormatException>
testEmptyNext(edu.ucla.sspace.matrix.MatlabSparseFileIteratorTests): Unexpected exception, expected<java.util.NoSuchElementException> but was<java.lang.NumberFormatException>
readTransposedMatrixFromDenseText(edu.ucla.sspace.matrix.MatrixIOTest): line 2 contains an inconsistent number of columns. 0 columns expected versus 1 found.
readMatrixFromDenseText(edu.ucla.sspace.matrix.MatrixIOTest): line 2 contains an inconsistent number of columns. 0 columns expected versus 1 found.
readMatrixArrayFromDenseText(edu.ucla.sspace.matrix.MatrixIOTest): line 2 contains an inconsistent number of columns. 0 columns expected versus 1 found.
testTransform(edu.ucla.sspace.matrix.MatlabSparseFileTransformerTest): For input string: "2,000000"
testTransform(edu.ucla.sspace.matrix.SvdlibcSparseTextFileTransformerTest): For input string: "2,000000"
testSvdlibcCluto(edu.ucla.sspace.matrix.SvdTests): SVDLIBC is not correctly installed on this system
testMatrixReduction(edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibCTest): Use of this class requires the SVDLIBC command line program, which is either not installed on this system or is not available to be executed from the command line. Check that your PATH settings are correct or see http://tedlab.mit.edu/~dr/SVDLIBC/ to download and install the program.
testMatrixReduction(edu.ucla.sspace.matrix.factorization.SingularValueDecompositionOctaveTest): dimensions must be positive
testIterator(edu.ucla.sspace.matrix.SvdlibcDenseTextFileIteratorTests): For input string: "1,500000"
testRemove(edu.ucla.sspace.matrix.SvdlibcDenseTextFileIteratorTests): Unexpected exception, expected<java.lang.UnsupportedOperationException> but was<java.lang.NumberFormatException>
testEmptyNext(edu.ucla.sspace.matrix.SvdlibcDenseTextFileIteratorTests): Unexpted exception, expected<java.util.NoSuchElementException> but was<java.lang.NumbFormatException>
testIterator(edu.ucla.sspace.matrix.ClutoSparseFileIteratorTests): For input sing: "1,500000"
testRemove(edu.ucla.sspace.matrix.ClutoSparseFileIteratorTests): Unexpected exption, expected<java.lang.UnsupportedOperationException> but was<java.lang.NumbeormatException>
testEmptyNext(edu.ucla.sspace.matrix.ClutoSparseFileIteratorTests): Unexpectedxception, expected<java.util.NoSuchElementException> but was<java.lang.NumberFortException>
testIterator(edu.ucla.sspace.matrix.DenseTextFileIteratorTests): For input strg: "1,500000"
testRemove(edu.ucla.sspace.matrix.DenseTextFileIteratorTests): Unexpected exceion, expected<java.lang.UnsupportedOperationException> but was<java.lang.NumberFmatException>
testEmptyNext(edu.ucla.sspace.matrix.DenseTextFileIteratorTests): Unexpected eeption, expected<java.util.NoSuchElementException> but was<java.lang.NumberFormaxception>

Tests run: 1321, Failures: 9, Errors: 23, Skipped: 0

Statistics.entropy computes Entropy incorrectly (or at least non-intuitively)

The current code for computing entropy in the Statistics utility class counts up the number of times elements in an array occur and then computes the entropy over the frequency that those array elements occur. However, entropy is often called when the given array already contains counts of individual events, where each column represents events the counts for a particular type of event. Therefore, entropy should be treating the input arrays as un-normalized probability distributions, rather than listings of particular events.

For example. Consider the array counting the number of times "the" co-occurs with some words in a particular sentence:

double[] array =  {2d, 5d, 4d, 3d, 6d, 4d, 3d};
Statistics.entropy(array);
// Returns 2.2359263506290326

In R, using the following standard entropy definition, we'd get:

shannon.entropy <- function(p)  {
    p.norm <- p[p>0]/sum(p)
   -sum(log2(p.norm)*p.norm)
}
shannon.entropy(c(2.0, 5.0, 4.0, 3.0, 6.0, 4.0, 3.0))
// Returns 2.731584

Secondly, the entropy function uses log base 2 without documentation, this should either be noted in the javadoc or be made available as a parameter.

Subgraph Isomorphism

Hi, is there anyway to calculate subgraph isomorphism?
I've searched along the source code but there's only VF2State which is for graph isomorphism, not subgraph isomorphism.

Thank you.

error on LatentSemanticAnalysis

I'm developing java application and I have your library with LatentSemanticAnalysis. When I call lsaDiff2.processSpace(System.getProperties()); the output is this exception.

java.lang.ArrayIndexOutOfBoundsException
at edu.ucla.sspace.matrix.DiagonalMatrix.checkIndices(DiagonalMatrix.java:78)
at edu.ucla.sspace.matrix.DiagonalMatrix.get(DiagonalMatrix.java:85)
at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionLibJ.factorize(SingularValueDecompositionLibJ.java:89)
at edu.ucla.sspace.lsa.LatentSemanticAnalysis.processSpace(LatentSemanticAnalysis.java:360)
at unal.edu.co.semantic.SemanticAnalysis.semanticAnalysisDiff(SemanticAnalysis.java:117)
at unal.edu.co.semantic.SemanticAnalysis.(SemanticAnalysis.java:58)
at unal.edu.co.persistence.CommitDiffTest.testLSI(CommitDiffTest.java:133)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

Please, help me.

Update error message for SVDLIBC to check that it's installed

When the SVD driver can't find the SVDLIBC executable, it usually has a message like "java.io.IOException: Cannot run program "svd": error=20, Not a directory" in the Throwable. It would probably help the end user's if we caught the exception, looked at the message and tried reporting something more reasonable (e.g., "couldn't find the executable") so they could try debugging on their own, rather than propagating the IOException to the top and having them wonder what went wrong.

Euclidian distance is calculated incorrectly for sparse vectors

Hi,
I believe there is a small bug in edu.ucla.sspace.Similarity,
method: "public static double euclideanDistance(DoubleVector a, DoubleVector b)"
sqrt of sum before return is missing in one brach of the method:

if (a instanceof SparseVector && b instanceof SparseVector) {
SparseVector svA = (SparseVector)a;
SparseVector svB = (SparseVector)b;
int[] aNonZero = svA.getNonZeroIndices();
int[] bNonZero = svB.getNonZeroIndices();
HashSet sparseIndicesA = new HashSet(
aNonZero.length);
double sum = 0;
for (int nonZero : aNonZero) {
sum += Math.pow((a.get(nonZero) - b.get(nonZero)), 2);
sparseIndicesA.add(nonZero);
}
for (int nonZero : bNonZero)
if (!sparseIndicesA.contains(nonZero))
sum += Math.pow(b.get(nonZero), 2);
return sum;
}

Can I limit the vocabulary?

Dear developers,

I did not find an option to limit the vocabulary. For example, I don't want to learn representations for words which occurs less than 50 in my corpus.
The reason is that if I use all the words (or exclude the stop words), the vocabulary will be very large, which is undesired.

I am wondering whether there is a convenient way for doing this?
Thanks very much,
Jiang

HAL is incorrectly impemented

In HyperspaceAnalogueToLanguage.java, there is a link to the Lund's paper which describes the HAL algorithm. However, the processDocument method does not follow the description. Instead, it does some magic (hard to say.. for every word, the method stores its weigted forerunners (occurrences of forerunners are wrongly weighted when default linear weighting is used) and (incorrectly) followers into one matrix X, the numbers from X are used to create the 2nd part of the final matrix XY..).

To prove my claim: Try HAL on one document containing "the horse raced past the barn fell perion" from the paper with the standard HAL settings (set "-o text" to see the result in text file). In the result, You can even see higher numbers than 6 (the highest expected value for the correct version for the Lund's sentence) which is not correct for the default window size of 5 and the linear weighting).

I believe (following the Lund's paper) the correct implementation of the processDocument method shoud do the following: create matrix X where the weighted forerunners are stored.. and then, store the weighted followers implying from X to some Y.. the resulting matrix is the concatenation of X (forerunners of focused words) and Y (followers of focused words).

My code for, I believe, the correct processDocument method implementation follows:

public void  processDocument(BufferedReader document) throws IOException {
    Queue<String> nextWords = new ArrayDeque<String>();
    Queue<String> prevWords = new ArrayDeque<String>();

    Iterator<String> documentTokens = 
        IteratorFactory.tokenizeOrdered(document);

    String focus = null;

    // Rather than updating the matrix every time an occurrence is seen,
    // keep a thread-local count of what needs to be modified in the matrix
    // and update after the document has been processed.  This saves
    // potential contention from concurrent writes.
    Map<Pair<Integer>,Double> matrixEntryToCount = 
        new HashMap<Pair<Integer>,Double>();

    //Load the first windowSize words into the Queue        
    for(int i = 0;  i < windowSize && documentTokens.hasNext(); i++)
        nextWords.offer(documentTokens.next());

    while(!nextWords.isEmpty()) {

        // Load the top of the nextWords Queue into the focus word
        focus = nextWords.remove();

        // Add the next word to nextWords queue (if possible)
        if (documentTokens.hasNext()) {        
            String windowEdge = documentTokens.next();
            nextWords.offer(windowEdge);
        }            

        // If the filter does not accept this word, skip the semantic
        // processing, continue with the next word
        if (focus.equals(IteratorFactory.EMPTY_TOKEN)) {
        // shift the window
            prevWords.offer(focus);
            if (prevWords.size() > windowSize)
                prevWords.remove();
            continue;
        }

        int focusIndex = getIndexFor(focus);

        // in prevWords the first words are the most far from the focused word!
        // the prevWords does not have to be full (having the windowSize size)
        int wordDistance = windowSize - (windowSize - prevWords.size()); // in front of the focus word
        for (String before : prevWords) {
            // skip adding co-occurence values for words that are not
            // accepted by the filter
            if (!before.equals(IteratorFactory.EMPTY_TOKEN)) {
                int index = getIndexFor(before);

                // LK change: store the co-occurrence into focusIndex x index indices
                // Weight the word appropriately based on distance
                Pair<Integer> p = new Pair<Integer>(focusIndex, index);
                double value = weighting.weight(wordDistance, windowSize);
                Double curCount = matrixEntryToCount.get(p);
                matrixEntryToCount.put(p, (curCount == null)
                                       ? value : value + curCount);
            }
            wordDistance--;
        }

        // last, put this focus word in the prev words and shift off the
        // front if it is larger than the window
        prevWords.offer(focus);
        if (prevWords.size() > windowSize)
            prevWords.remove();
    }

    // Once the document has been processed, update the co-occurrence matrix
    // accordingly.
    for (Map.Entry<Pair<Integer>,Double> e : matrixEntryToCount.entrySet()){
        Pair<Integer> p = e.getKey();
        cooccurrenceMatrix.addAndGet(p.x, p.y, e.getValue());
    }
    // LK change: hack to ensure n * 2n matrix size..
    // without this the size can be n * (2n-1)
    if (cooccurrenceMatrix.get(termToIndex.size() - 1, termToIndex.size() - 1) == 0.0) {
        cooccurrenceMatrix.set(termToIndex.size() - 1, termToIndex.size() - 1, 0.0);
    }
}

java.lang.IndexOutOfBoundsException

Hi,

I got IndexOutOfBoundsException when attempting to compute SVD for a large matrix (3404 x 21951). The matrix is created using edu.ucla.sspace.matrix.SparseOnDiskMatrix. The numDimensions is set to maximum size (21951) of matrix so that i can get the diagonal matrix S for the large matrix. It runs OK for small matrix. Any idea about why this exception happens ? thanks, Jerry

Exception in thread "main" java.lang.IndexOutOfBoundsException: 8268500
at java.nio.DirectDoubleBufferS.get(DirectDoubleBufferS.java:253)
at edu.ucla.sspace.matrix.OnDiskMatrix.get(OnDiskMatrix.java:161)
at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionMatlab.factorize(SingularValueDecompositionMatlab.java:141)
at edu.ucla.sspace.matrix.factorization.SingularValueDecompositionMatlab.factorize(SingularValueDecompositionMatlab.java:63)
...

Cluto cluster assignments

I'm using ClutoClustering, and get the Assignments object after performing cluster().
Calling Assignments.clusters() and Assignments.getSparseCentroids() (and also getCentroids(), I suppose), result in an IndexOutOfBounds error.
This is caused by the fact that for some input vectors, the assigned cluster id is -1, as CLUTO prefers not to assign them to any of the clusters. Please handle those cases...

Generic Main should fail fast on converting options to enums.

Running

java -Xmx2g -cp target/sspace-2.0.3-jar-with-dependencies.jar edu.ucla.sspace.mains.LSAMain -d test.doc -o dense_text .

processes the corpus using SVD and then attempts to convert the value of -o (dense_text) to a edu.ucla.sspace.common.SemanticSpaceIO.SSpaceFormat enum after the processing only to fail, since it's an incorrect value. This should fail before any processing to save time and computational resources.

Fatal error in S-space saving for LSA?!

Hi,
I think there is a bug in handling the sspace-file. In the AbstractSVD-class the dataClasses-methode is used to combine the U- and the S-Matrix:
dataClasses.set(r, c, U.get(r, c) *
singularValues[c]);

By loading the sspace file and computing the similarity this preprocessed data is used, more precisely the formula q_U_S projects the query q in the semantic space. But this is wrong: The correct formula is q_U_S^(-1) with inverted eigenvalues.
This bug leads to significant wrong results for LSA.

In the classFeatures()-function at the same place an analogous calculation appears, but I don´t know, if it is a bug, too. Because I am not aware of the usage of this methode.

Incorrect application of transform in LSA projection

When you project a document to the LSA space, the original transform is not applied (the "transformed" variable is unused):

DoubleVector transformed = transform.transform(docVec);
// Represent the document as a 1-column matrix
Matrix queryAsMatrix = new ArrayMatrix(1, numDims);
for (int nz : docVec.getNonZeroIndices())
queryAsMatrix.set(0, nz, docVec.get(nz));

Spectral Clustering usage

Hi,

I'm finding it difficult to create an appropriate generator to instantiate Spectral Clustering. Is there an example somewhere?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.