liu-nlper / sltk Goto Github PK

View Code? Open in Web Editor NEW

362.0 14.0 84.0 683 KB

序列化标注工具，基于PyTorch实现BLSTM-CNN-CRF模型，CoNLL 2003 English NER测试集F1值为91.10%（word and char feature）。

Python 85.46% Shell 0.31% Perl 14.23%

pytorch bilstm-crf bilstm crf sequence-labeling

sltk's Issues

[Errno 2] No such file or directory: './data/resources/glove.6B.100d.bin'

请问大家有遇到这种问题吗？

运行 ./test.sh 时报 RuntimeError: value cannot be converted to type uint8_t without overflow: -1

Traceback (most recent call last):
File "../test.py", line 77, in
targets_list = sl_model.predict(sample_batched)
File "/xunku/SLTK-master/TorchNN/layers/bilstm_crf.py", line 106, in predict
path_score, best_paths = self.crf(lstm_feats, mask)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "/xunku/SLTK-master/TorchNN/layers/crf.py", line 182, in forward
path_score, best_path = self._viterbi_decode(feats, mask)
File "/xunku/SLTK-master/TorchNN/layers/crf.py", line 130, in _viterbi_decode
mask = 1 + (-1) * mask
RuntimeError: value cannot be converted to type uint8_t without overflow: -1

不知道这是怎么回事？

报错问题

您好，我用自己的文件测试时报错。文件只有两列 word tag 所以配置文件用的word.yml，数据原本是BIO标注，用您的工具转换成BIESO，非常感谢！ torch 0.4.1

错误信息如下：
读取文件...
./data/output.txt: 619
./data/output.txt: 619
抽取预训练词向量...
特征word使用预训练词向量./data/resources/glove.6B.100d.txt:
C:\Users\ma\AppData\Local\Programs\Python\Python35\lib\site-packages\gensim\utils.py:1209: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
精确匹配: 2038 / 2715
模糊匹配: 356 / 2715
OOV: 321 / 2715
convert data to hdf5...
./data/output.txt.hdf5: 619
./data/output.txt.hdf5: 619
SLModel(
(word_feature_layer): WordFeature(
(feature_embedding_list): ModuleList(
(0): Embedding(2716, 100)
)
)
(char_feature_layer): CharFeature(
(char_embedding): Embedding(64, 30)
(char_encoders): ModuleList(
(0): Conv3d(1, 30, kernel_size=(1, 3, 30), stride=(1, 1, 1))
)
)
(dropout_feature): Dropout(p=0.5)
(rnn_layer): RNN(
(rnn): LSTM(130, 100, bidirectional=True)
)
(dropout_rnn): Dropout(p=0.5)
(crf_layer): CRF()
(hidden2tag): Linear(in_features=200, out_features=8, bias=True)
)
learning rate: 0.015
Epoch 1 / 1000: 557 / 557
Traceback (most recent call last):
File "G:/phd/8.8/SLTK-master/main.py", line 584, in
main()
File "G:/phd/8.8/SLTK-master/main.py", line 578, in main
train_model(configs)
File "G:/phd/8.8/SLTK-master/main.py", line 539, in train_model
model_trainer.fit()
File "G:\phd\8.8\SLTK-master\sltk\train\sequence_labeling_trainer.py", line 88, in fit
logits = self.model(**feed_tensor_dict)
File "C:\Users\ma\AppData\Local\Programs\Python\Python35\lib\site-packages\torch\nn\modules\module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "G:\phd\8.8\SLTK-master\sltk\nn\modules\sequence_labeling_model.py", line 130, in forward
word_feature = torch.cat([word_feature, char_feature], 2)
RuntimeError: invalid argument 0: Tensors must have same number of dimensions: got 3 and 2 at c:\new-builder_2\win-wheel\pytorch\aten\src\th\generic/THTensorMath.cpp:3607

不是很明白，谢谢您的指点。

什么是特征词汇表呢？

@liu-nlper 您好，我运行的时候提示没有这个文件：
FileNotFoundError: [Errno 2] No such file or directory: './data/alphabet/word.pkl'
请问什么是特征词汇表呢？是指自己外部找的词典吗？不知道词汇表需要什么样的形式，麻烦了

mask的数据类型要求是Byte类型

mask的数据类型要求是Byte类型，但是我把LongTensor转换成ByteTensor依然报错，使用numpy转换也报错，请问一下作者，您当时的时候是怎么处理的吗，
期待您的回复，谢谢

再次请教：在中文标注任务中，使用预训练词向量的OOV数量很大，是否正常？

再次请教po主：我在做中文标注的任务训练，发现使用预训练词向量的匹配结果中，OOV的占比很大，是不是因为中文的词向量很多是分词之后的两字词、三字词四字词，而训练语料train.txt中的标记都是单字，所以导致OOV比较多啊？下面这种情况是否正常？是否可以继续训练呢？
抽取预训练词向量...
特征word使用预训练词向量./data/word2vec.txt:
精确匹配: 3365 / 7099
模糊匹配: 4 / 7099
OOV: 3730 / 7099
先谢过啦~~~

利用训练好模型做NER任务时，无hdf5文件问题

@liu-nlper 您好，我正在做NER任务，然后用训练好的模型找一份raw数据中实体时，会要求有相应名称的hdf5文件，但是当我把raw数据名称改成已有hdf5文件名时，效果极差，不过测试的数据是另外一个领域的啊，但是也存在相同实体。
不知道原因在于训练数据和最终要找的数据属于不同领域（有交叉实体），还是hdf5文件问题？望解答

如何添加单词的数字和布尔特征

我也有一些数字和布尔功能，并希望将其添加到模型中，但不希望为它们嵌入
Any Suggestions?

crf的loss部分疑似进行了两次batch average

您好，我在参看代码的时候发现，crf.py 中的 neg_log_likelihood_loss 函数里有：
if self.average_batch:
return (forward_score - gold_score) / batch_size
return forward_score - gold_score
而在调用它的 sequence_labeling_model.py 中的 loss 函数里也有：
if not self.use_crf:
batch_size, max_len = feats.size(0), feats.size(1)
lstm_feats = feats.view(batch_size * max_len, -1)
tags = tags.view(-1)
return self.loss_function(lstm_feats, tags)
else:
loss_value = self.loss_function(feats, mask, tags)
print ('loss_value:', loss_value)
if self.average_batch:
batch_size = feats.size(0)
loss_value /= float(batch_size)
return loss_value
这样是不是就多求了一次平均呢？

关于glove词向量格式

“词向量下载地址: glove.6B.zip，词向量需修改为word2vec词向量格式，即txt文件的首部需要有'词表大小向量维度'信息。”
请问po主，文件首部格式是怎样的？我用的是100维的glove。不知道这里怎么改。。。
多谢！

有关代码模型准确率的问题

您好，您的代码是不是没有提供计算模型在dev、test数据集上F1 score的计算呢？是如何判定您的代码构建的模型准确度的？

Gensim version 3.8.3 is required

CONLL2003数据集实验结果评测文件是不是用的conlleval.pl

您好，请问您在评测实验结果的时候使用的是官方发布的conlleval.pl文件吗？
重复了好几次实验无法得到论文中的结果，希望能得到您的帮助。
感谢！

训练稍微长点的句子就特别慢

用微博做训练语料。如果一句话的字的数量在130个，则这句话的训练时间需要几分钟。设置的batch size =1, 在CPU上训练的时候。这是正常现象吗？

RuntimeError: dimension specified as 0 but tensor has no dimensions

Hi, I run your code on pytorch 0.3. I found in your data set , I can run the code successfully. But in my own dataset or the data created by the head 200 lines in your dataset, the code run failed with the Error:
"RuntimeError: dimension specified as 0 but tensor has no dimensions"

what does `mask` mean?

SLTK/TorchNN/layers/crf.py

Line 224 in cc65f33

def neg_log_likelihood_loss(self, feats, mask, tags):

请教po主：能否增加对valid集和test集的测试结果统计啊?

如题：目前的test下的infer方法是直接写成了文件，能否有对test结果的统计report啊？包括验证集也是。刚接触pytorch，还不太了解，请po主多指教~

invalid argument 0: Tensors must have same number of dimensions: got 3 and 2

如果一个batch里数据恰好为1，则 torch.cat操作会将第1维裁剪掉，导致word_feature与char_feature维度不一致报错

SLTK/sltk/nn/modules/sequence_labeling_model.py

Line 130 in b3edc58

word_feature = torch.cat([word_feature, char_feature], 2)

大数据集显存不足的问题

您好！我想替换您的数据，用10MB左右的训练数据跑。但是每次运行都会显存不够。请问该如何解决呢？

liu-nlper / sltk Goto Github PK

sltk's Issues

Recommend Projects

Recommend Topics

Recommend Org