liu-nlper / sltk Goto Github PK
View Code? Open in Web Editor NEW序列化标注工具,基于PyTorch实现BLSTM-CNN-CRF模型,CoNLL 2003 English NER测试集F1值为91.10%(word and char feature)。
序列化标注工具,基于PyTorch实现BLSTM-CNN-CRF模型,CoNLL 2003 English NER测试集F1值为91.10%(word and char feature)。
请问大家有遇到这种问题吗?
即txt文件的首部需要有'词表大小 向量维度'信息。这个大家怎么解决的呀
Traceback (most recent call last):
File "../test.py", line 77, in
targets_list = sl_model.predict(sample_batched)
File "/xunku/SLTK-master/TorchNN/layers/bilstm_crf.py", line 106, in predict
path_score, best_paths = self.crf(lstm_feats, mask)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/xunku/SLTK-master/TorchNN/layers/crf.py", line 182, in forward
path_score, best_path = self._viterbi_decode(feats, mask)
File "/xunku/SLTK-master/TorchNN/layers/crf.py", line 130, in _viterbi_decode
mask = 1 + (-1) * mask
RuntimeError: value cannot be converted to type uint8_t without overflow: -1
不知道这是怎么回事?
您好,我用自己的文件测试时报错。文件只有两列 word tag 所以配置文件用的word.yml,数据原本是BIO标注,用您的工具转换成BIESO,非常感谢! torch 0.4.1
错误信息如下:
读取文件...
./data/output.txt
: 619
./data/output.txt
: 619
抽取预训练词向量...
特征word
使用预训练词向量./data/resources/glove.6B.100d.txt
:
C:\Users\ma\AppData\Local\Programs\Python\Python35\lib\site-packages\gensim\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
精确匹配: 2038 / 2715
模糊匹配: 356 / 2715
OOV: 321 / 2715
convert data to hdf5...
./data/output.txt.hdf5
: 619
./data/output.txt.hdf5
: 619
SLModel(
(word_feature_layer): WordFeature(
(feature_embedding_list): ModuleList(
(0): Embedding(2716, 100)
)
)
(char_feature_layer): CharFeature(
(char_embedding): Embedding(64, 30)
(char_encoders): ModuleList(
(0): Conv3d(1, 30, kernel_size=(1, 3, 30), stride=(1, 1, 1))
)
)
(dropout_feature): Dropout(p=0.5)
(rnn_layer): RNN(
(rnn): LSTM(130, 100, bidirectional=True)
)
(dropout_rnn): Dropout(p=0.5)
(crf_layer): CRF()
(hidden2tag): Linear(in_features=200, out_features=8, bias=True)
)
learning rate: 0.015
Epoch 1 / 1000: 557 / 557
Traceback (most recent call last):
File "G:/phd/8.8/SLTK-master/main.py", line 584, in
main()
File "G:/phd/8.8/SLTK-master/main.py", line 578, in main
train_model(configs)
File "G:/phd/8.8/SLTK-master/main.py", line 539, in train_model
model_trainer.fit()
File "G:\phd\8.8\SLTK-master\sltk\train\sequence_labeling_trainer.py", line 88, in fit
logits = self.model(**feed_tensor_dict)
File "C:\Users\ma\AppData\Local\Programs\Python\Python35\lib\site-packages\torch\nn\modules\module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "G:\phd\8.8\SLTK-master\sltk\nn\modules\sequence_labeling_model.py", line 130, in forward
word_feature = torch.cat([word_feature, char_feature], 2)
RuntimeError: invalid argument 0: Tensors must have same number of dimensions: got 3 and 2 at c:\new-builder_2\win-wheel\pytorch\aten\src\th\generic/THTensorMath.cpp:3607
不是很明白,谢谢您的指点。
@liu-nlper 您好,我运行的时候提示没有这个文件:
FileNotFoundError: [Errno 2] No such file or directory: './data/alphabet/word.pkl'
请问什么是特征词汇表呢?是指自己外部找的词典吗?不知道词汇表需要什么样的形式,麻烦了
再次请教po主:我在做中文标注的任务训练,发现使用预训练词向量的匹配结果中,OOV的占比很大,是不是因为中文的词向量很多是分词之后的两字词、三字词四字词,而训练语料train.txt中的标记都是单字,所以导致OOV比较多啊?下面这种情况是否正常?是否可以继续训练呢?
抽取预训练词向量...
特征word
使用预训练词向量./data/word2vec.txt
:
精确匹配: 3365 / 7099
模糊匹配: 4 / 7099
OOV: 3730 / 7099
先谢过啦~~~
@liu-nlper 您好,我正在做NER任务,然后用训练好的模型找一份raw数据中实体时,会要求有相应名称的hdf5文件,但是当我把raw数据名称改成已有hdf5文件名时,效果极差,不过测试的数据是另外一个领域的啊,但是也存在相同实体。
不知道原因在于训练数据和最终要找的数据属于不同领域(有交叉实体),还是hdf5文件问题?望解答
我也有一些数字和布尔功能,并希望将其添加到模型中,但不希望为它们嵌入
Any Suggestions?
您好,我在参看代码的时候发现,crf.py 中的 neg_log_likelihood_loss 函数里有:
if self.average_batch:
return (forward_score - gold_score) / batch_size
return forward_score - gold_score
而在调用它的 sequence_labeling_model.py 中的 loss 函数里也有:
if not self.use_crf:
batch_size, max_len = feats.size(0), feats.size(1)
lstm_feats = feats.view(batch_size * max_len, -1)
tags = tags.view(-1)
return self.loss_function(lstm_feats, tags)
else:
loss_value = self.loss_function(feats, mask, tags)
print ('loss_value:', loss_value)
if self.average_batch:
batch_size = feats.size(0)
loss_value /= float(batch_size)
return loss_value
这样是不是就多求了一次平均呢?
“词向量下载地址: glove.6B.zip,词向量需修改为word2vec词向量格式,即txt文件的首部需要有'词表大小 向量维度'信息。”
请问po主,文件首部格式是怎样的? 我用的是100维的glove。不知道这里怎么改。。。
多谢!
您好,您的代码是不是没有提供计算模型在dev、test数据集上F1 score的计算呢?是如何判定您的代码构建的模型准确度的?
您好,请问您在评测实验结果的时候使用的是官方发布的conlleval.pl文件吗?
重复了好几次实验无法得到论文中的结果,希望能得到您的帮助。
感谢!
用微博做训练语料。如果一句话的字的数量在130个,则这句话的训练时间需要几分钟。设置的batch size =1, 在CPU上训练的时候。这是正常现象吗?
Hi, I run your code on pytorch 0.3. I found in your data set , I can run the code successfully. But in my own dataset or the data created by the head 200 lines in your dataset, the code run failed with the Error:
"RuntimeError: dimension specified as 0 but tensor has no dimensions"
Line 224 in cc65f33
如题:目前的test下的infer方法是直接写成了文件,能否有对test结果的统计report啊?包括验证集也是。刚接触pytorch,还不太了解,请po主多指教~
如果一个batch里数据恰好为1,则 torch.cat
操作会将第1维裁剪掉,导致word_feature
与char_feature
维度不一致报错
您好!我想替换您的数据,用10MB左右的训练数据跑。但是每次运行都会显存不够。请问该如何解决呢?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.