jsksxs360 / word2vec Goto Github PK
View Code? Open in Web Editor NEW对 ansj 编写的 Word2VEC_java 的进一步包装,同时实现了常用的词语相似度和句子相似度计算。
License: Apache License 2.0
对 ansj 编写的 Word2VEC_java 的进一步包装,同时实现了常用的词语相似度和句子相似度计算。
License: Apache License 2.0
请问 如果要适配大多数其他国家的语言,要怎么做?
是要自己找到语料库,然后训练? 不同国家,分词方法还不一样吗?
如何训练java版的模型(是否还有做分词),得需要多大的语料库呢?
想做某个受限领域内的问答系统,因为语料库比较少,可以用这个模型来搞么?(感谢回答)
您好!请问计算句子相似度的fastSentenceSimilarity()
和sentenceSimilarity()
,两个方法是参考了什么文献呢?
您好,对您的代码很感兴趣,但是当我执行下面的代码:
package Hello;
import java.io.IOException;
import java.util.List;
import java.util.Set;
import me.xiaosheng.util.Segment;
import me.xiaosheng.word2vec.*;
public class hello {
public static void main(String[] args) throws Exception
{
Word2Vec vec = new Word2Vec();
try {
vec.loadGoogleModel("/home/ztgong/work/language/datasets/wiki_chinese_word2vec(Google).model");
} catch (IOException e) {
e.printStackTrace();
}
String s1 = "苏州有多条公路正在施工,造成局部地区汽车行驶非常缓慢。";
String s2 = "苏州最近有多条公路在施工,导致部分地区交通拥堵,汽车难以通行。";
String s3 = "苏州是一座美丽的城市,四季分明,雨量充沛。";
//分词,获取词语列表
List<String> wordList1 = Segment.getWords(s1);
List<String> wordList2 = Segment.getWords(s2);
List<String> wordList3 = Segment.getWords(s3);
//句子相似度(所有词语权值设为1)
System.out.println("s1|s1: " + vec.sentenceSimilarity(wordList1, wordList1));
System.out.println("s1|s2: " + vec.sentenceSimilarity(wordList1, wordList2));
System.out.println("s1|s3: " + vec.sentenceSimilarity(wordList1, wordList3));
//句子相似度(名词、动词权值设为1,其他设为0.8)
float[] weightArray1 = Segment.getPOSWeightArray(Segment.getPOS(s1));
float[] weightArray2 = Segment.getPOSWeightArray(Segment.getPOS(s2));
float[] weightArray3 = Segment.getPOSWeightArray(Segment.getPOS(s3));
System.out.println("s1|s1: " + vec.sentenceSimilarity(wordList1, wordList1, weightArray1, weightArray1));
System.out.println("s1|s2: " + vec.sentenceSimilarity(wordList1, wordList2, weightArray1, weightArray2));
System.out.println("s1|s3: " + vec.sentenceSimilarity(wordList1, wordList3, weightArray1, weightArray3));
}
}
显示错误如下:
Exception in thread "main" java.lang.NoClassDefFoundError: org/ansj/recognition/Recognition
at Hello.hello.main(hello.java:34)
Caused by: java.lang.ClassNotFoundException: org.ansj.recognition.Recognition
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 1 more
该怎么修复呢?
谢谢!!
请问您的计算句子相似度的算法是什么?
Word2Vec中的loadModel变量在public void loadGoogleModel(String modelPath)中设置的为true,但是在public void loadJavaModel(String modelPath)中设置为true,在训练好后的java模型中,然后加载public void loadJavaModel(String modelPath)的方法,会导致null,作者你看一下是否是变量设置为false的原因
Word2Vec.trainJavaModel("data/train.txt", "data/test.model");
你好, data/train.txt 和 data/test.model 能给个样例吗。
例如:我有10句话,分词之后,在train.txt是什么样子的。
把相近的词空格分开,放到同一行? 还是10句话,一句一行,词用空格
请问这个可不可以适用于英文,英文的使用跟中文类似吗,有没有推荐的英文语料库供训练
[抱拳]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.