zjy-ucas / chinesener Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 572.0 14.77 MB

A neural network model for Chinese named entity recognition

Perl 17.35% Python 82.65%

chinesener's People

Contributors

Stargazers

Watchers

Forkers

xyz8 zuofengli zilongzhong lihongqiang lxj0276 demianzhang xsongx leezqcst zxsted whuwy dt1219 shenyong123 rain-y falconzyx meccy jdc08161063 stevenlol chenmoshushi hydercps vangogh0318 zhaochangtt benjamesbabala computermomo coder3344 csea0820 cklsoft darcy0511 henrywoodotc louguozhe momoz44 lyfree132 hanyong0519 zhangzhaocs colinsongf mqrshiyan cauchyzhou wesamalnabki oiwio huyun-cs decade2014 doraemonshare wangpingqian fulquan wangmingxjtu datar-ai zhengyu19921215 wwf5067 ryfan-rs jlsche dennisliu94 hzylmf yangqiokay suzhidong hanksantford niu2niu2niu catcatrun zhuangh zgf523 fei161 generalzh jinhlov zhanglv0209 vickzhang fucheng830 zzgo12 searchmodel abc3436645 xitongdashi ufukhurriyetoglu tryy y111x zenzenzencoding zhiyuanding gokunwu zchenack winnerineast wuyou61 ppjerry gikerch zhangyipin liyi193328 icepeak2015 prayeryd qdj0511 1000-7 lium226 nenebear-jay semsevens michaelfeng87 xiaran xiaxyun theoqian grainw deepthinkliu wzw1994 facingwaller xuchang-eva linhx13 jingchun01 huandalu

chinesener's Issues

请问 def main(_): 这个下划线代表什么意思？

请问 def main(_): 这个下划线代表什么意思？
如果把下划线删掉变成main(): 又运行出错了

About 'DOCSTART' in loader.py

I don't know why you add this. It is never used.

pre-trained embedding not used in input layer?

Hi,
I noticed that the pre-trained embedding file was not used in "embedding layer", just used a lookup function to generation character embedding and seg embedding. The pre-trained embedding only used in the char_to_id generation. I want to know whether I misunderstand this. If so, why not use the pre-trained embedding to generate the input. Thanks!

embedding层中使用seg的作用是什么？

self.seg_lookup = tf.get_variable(
name="seg_embedding",
shape=[self.num_segs, self.seg_dim],
initializer=self.initializer)
在embedding层中加入这几行代码，并且 embed = tf.concat(embedding, axis=-1)加入这行代码的作用是什么？

readme的“chinese”写成了“chainese”

请教一下, 报错absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --clip before flags were parsed.

File "/home/PycharmProjects/NER/ChineseNER-master/main.py", line 54, in
assert FLAGS.clip < 5.1, "gradient clip should't be too much"
File "/usr/local/lib/python3.5/dist-packages/absl/flags/_flagvalues.py", line 488, in getattr
raise _exceptions.UnparsedFlagAccessError(error_message)
absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --clip before flags were parsed.

增量训练

Hi：
目前模型支持三个实体，如果我要扩展到更多实体，则需要增加相应语料进行训练，但是这样随着扩展的实体越来越多，训练的耗时也会相应增加，请问我增加实体类别后如何做到增量训练？来减少训练的时间。

model.py 中predict函数里循环越界

for i in range(len(batch))是不是应该改成for i in range(len(str_lines))，不然对于一些短文本的句子会出现越界情况

如果我想测试一句话的结果，怎么弄？

是不是需要修改colleval文件里的输出，发现是用perl写的，好麻烦

减小数据集后，报错：ValueError: setting an array element with a sequence.

Traceback (most recent call last):

File "", line 1, in
runfile('E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj/main.py', wdir='E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj')

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj/main.py", line 246, in
train()

File "E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj/main.py", line 192, in train
step, batch_loss = model.run_step(sess, True, batch)

File "E:\【重点代码】ChineseNER-master-bishe\Gradu_Prj\model.py", line 221, in run_step
feed_dict)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
run_metadata_ptr)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1097, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\numpy\core\numeric.py", line 492, in asarray
return array(a, dtype, copy=False, order=order)

ValueError: setting an array element with a sequence.

将example.train， example.test， example.dev三个文件中的句子删除一部分后，转变成txt文档保存，但运行时出错。

outputs, final_states = tf.nn.bidirectional_dynamic_rnn()每次运行到这一句就报NotImplementedError: Abstract method这个错误

您好，我想学习一下您的代码，试运行了一下遇到了解决不了的错误。之前rnn_cell_impl.LSTMStateTuple这一句提示找不到方法，我通过改用tf.contrib.rnn.LSTMStateTuple解决了错误，但是每当运行到outputs, final_states = tf.nn.bidirectional_dynamic_rnn(lstm_cell["forward"],lstm_cell["backward"],lstm_inputs,dtype=tf.float32,sequence_length=lengths)这一句的时候就报NotImplementedError: Abstract method错误，找不到错误原因，希望您能帮助我谢谢

關於wiki_100.utf8檔案

您好打擾了，我目前看到對於中文大都是使用word2vec的'詞'向量，但對於中文NER來說目前主流算法都是以'字'來看，因此想請問一下您的"字向量"是如何訓練出來的呢?是否有什麼資料可以參考呢?

这么程序只能识别人名，地名，组织机构名吗？

main.py 运行报错

您好，我在命令行运行main.py文件，出现上图所示错误，请问需要如何解决呢？

你好，想问下字向量是怎么训练得到的？

是类似Word2vec那种方式？基于的语料库是？非常感谢。

Is data in 'data' dir a complete dataset?数据集是部分还是全量的？

我想知道数据集的来源和完整程度，谢谢
I want to know the source of your data, and whether the dataset is complete?

怎么构建自己的数据,比如说:要用这个模型识别时间词和数词,最后格式是BIO格式,现在文本语料也有了,有什么方法让数据达到输入模型的需求.

测试时，sentences 的格式有什么要求

line = input("请输入测试句子:")
print line
result = model.evaluate_line(sess, input_from_line(line, char_to_id), id_to_tag)

请问对输入的测试句子有什么格式要求？
输入中文：北京*** 报错
输入数字：3232132312 报错

请教一下数据集的来源

Hi @zjy-ucas
感谢你的分享。
有一个小问题，请问一下你的训练集是自己如何获得的，准确性如何？
感谢~

TF1.2 restore bug

Hi,
Did you encount the bug like:

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [4341,100] rhs shape= [3637,100]

when run python main.py

ValueError: setting an array element with a sequence.

在dev数据集的时候报上述错误，dev数据集中数组长度不一致导致，有一个分组列表长度有99是72，一个是73导致

wiki_100.utf8的作用

模型中使用wiki_100中提供的向量，对于英文如chanel会切分成c，h，a, n ， e，l，有办法改进英文的输入吗？

model.py 第97行 with tf.variable_scope("char_embedding" if not name else name)

这里的"char_embedding" if not name else name 是对的吗？没有见过这样的语法呀

使用默认参数训练时出错

如题，训练模型时，出现了下面的错误调试：
Building prefix dict from the default dictionary ... Loading model from cache C:\Users\cloudy\AppData\Local\Temp\jieba.cache Loading model cost 1.237 seconds. Prefix dict has been built succesfully. Found 4313 unique words (979180 in total) Loading pretrained embeddings from wiki_100.utf8... Found 13 unique named entity tags 20864 / 0 / 4636 sentences in train / dev / test. Traceback (most recent call last): File "main.py", line 225, in <module> tf.app.run(main) File "D:\Anaconda3\envs\keras\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run _sys.exit(main(argv)) File "main.py", line 219, in main train() File "main.py", line 150, in train train_manager = BatchManager(train_data, FLAGS.batch_size) File "C:\Users\cloudy\Desktop\ChineseNER\data_utils.py", line 285, in __init__ self.batch_data = self.sort_and_pad(data, batch_size) File "C:\Users\cloudy\Desktop\ChineseNER\data_utils.py", line 293, in sort_and_pad batch_data.append(self.pad_data(sorted_data[i*batch_size: (i+1)*batch_size])) TypeError: slice indices must be integers or None or have an __index__ method
自己找了几个方法，没有解决，希望帮我解决一下，感激不尽！

有个问题请教一下，word2vec中<UNK>的矩阵是怎么计算出来的呀，谢谢

关于版本的问题

ChineseNER这个包很好用，可以拿到很高的分数
但是有一个问题，如果把上面的python3改成python2，发现会报错：
例如python3 main.py 改为 python2 main.py
会有如下问题，想问一下有什么解决方案吗？
Caused by op u'char_embedding/concat', defined at:
File "main.py", line 232, in
tf.app.run(main)
File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "main.py", line 227, in main
evaluate_line()
File "main.py", line 199, in evaluate_line
model = create_model(sess, Model, FLAGS.ckpt_path, load_word2vec, config, id_to_char, logger)
File "/data00/home/dengjiangdong/workspace/lab_basic_ner_v1/utils.py", line 174, in create_model
model = Model_class(config)
File "/data00/home/dengjiangdong/workspace/lab_basic_ner_v1/model.py", line 54, in init
embedding = self.embedding_layer(self.char_inputs, self.seg_inputs, config)
File "/data00/home/dengjiangdong/workspace/lab_basic_ner_v1/model.py", line 110, in embedding_layer
embed = tf.concat(embedding, axis=-1)
File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1048, in concat
name=name)
File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 495, in _concat_v2
name=name)
File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[0] = [1,10,100] vs. shape[1] = [1,6,20]
[[Node: char_embedding/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](char_embedding/embedding_lookup, char_embedding/seg_embedding/embedding_lookup, char_embedding/concat/axis)]]

data文件夹下dev和test的作用分别是什么？

data文件夹下dev和test的作用分别是什么？为什么测试时要分别测试dev和test中的数据？

Why should we expand the shape of logits to [self.num_tags + 1, self.num_tags + 1] ?

For example, when defining the loss function, you expand logits and targets to [self.num_tags + 1, self.num_tags + 1].

def loss_layer(self, project_logits, lengths, name=None):
    """
    calculate crf loss
    :param project_logits: [1, num_steps, num_tags]
    :return: scalar loss
    """
    with tf.variable_scope("crf_loss"  if not name else name):
        small = -1000.0
        # pad logits for crf loss
        start_logits = tf.concat(
            [small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])], axis=-1)
        pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32)
        logits = tf.concat([project_logits, pad_logits], axis=-1)
        logits = tf.concat([start_logits, logits], axis=1)
        targets = tf.concat(
            [tf.cast(self.num_tags*tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1)
        self.trans = tf.get_variable(
            "transitions",
            shape=[self.num_tags + 1, self.num_tags + 1],
            initializer=self.initializer)
        log_likelihood, self.trans = crf_log_likelihood(
            inputs=logits,
            tag_indices=targets,
            transition_params=self.trans,
            sequence_lengths=lengths+1)
        return tf.reduce_mean(-log_likelihood)

But in fact, the model works fine with the original logits and targets as the code following, so what's the purpose of doing so? thx!

def loss_layer(self, project_logits, lengths, name=None):
    self.trans = tf.get_variable(
        "transitions",
        shape=[self.num_tags, self.num_tags],
        initializer=self.initializer)
    log_likelihood, self.trans = crf_log_likelihood(
        inputs=self.logits,
        tag_indices=self.targets,
        transition_params=self.trans,
        sequence_lengths=lengths)
    return tf.reduce_mean(-log_likelihood)

为什么使用GPU跑，而速度反而会变慢？

不知道大家有没有试过用GPU去运行程序，速度很慢，甚至比CPU还慢。不知道是什么原因。

既然使用char embedding作为输入，为什么使用jieba进行切词？

麻烦大神解答下疑惑，感激~

NameError: name 'os' is not defined

Hello! I run your code , but found errors below:

Traceback (most recent call last):
File "F:/yyhaker/software/project/NamedEntityRecognition/src/ChineseNER/main.py", line 225, in
if name == "main":
File "D:\perhack\Anaconda3\envs\my_pytorch\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "F:/yyhaker/software/project/NamedEntityRecognition/src/ChineseNER/main.py", line 219, in main
clean(FLAGS)
File "F:/yyhaker/software/project/NamedEntityRecognition/src/ChineseNER/main.py", line 114, in train
# create maps if not exist
NameError: name 'os' is not defined

I have install the os module, and it can run correctly! What's wrong with it?

关于句子实体词提取结果

请问是python main.py后在输入句子提示后输入句子，就能看到计算出的结果了吗？
感觉分出来的效果不是很理想，请问是不是有什么别的方式，谢谢！

请输入测试句子:老张开车去东北玩。
结果：
[{'end': 3, 'start': 1, 'type': 'PER', 'word': '老张开'},
{'end': 4, 'start': 1, 'type': 'PER', 'word': '车'},
{'end': 5, 'start': 4, 'type': 'LOC', 'word': '去'},
{'end': 6, 'start': 5, 'type': 'LOC', 'word': '东'},
{'end': 7, 'start': 6, 'type': 'LOC', 'word': '北'},
{'end': 8, 'start': 7, 'type': 'LOC', 'word': '玩'},
{'end': 9, 'start': 8, 'type': 'LOC', 'word': '。'}]

预测时，结果word、start、end与实际不符。

句子：他的检验报告等。
标注：“报告”
位置：4, 6
使用model.py中的evaluate_line方法会出现以下情况：

word：报告 start：3 end：6
word：验报告 start：4 end：6
word：检验报告 start：4 end：6

参数意义解释

可以解释下main.py中参数设置及其意义吗？

针对 tensorflow 1.10 需要的改动

tensorflow 1.10 中已经将 rnn_cell 从 tensorflow.python.ops 移除，功能类似的是 tensorflow.contrib.rnn 。可以把 model.py 中的第四行改为 import tensorflow.contrib.rnn as rnn_cell （不负责任的做法）。
tf.concat() 的参数顺序被做了调整，所有的 rnn_inputs = tf.concat(2, [rnn_inputs, self.features]) 应被改为 rnn_inputs = tf.concat([rnn_inputs, self.features], 2) 。
tf.batch_matmul() 已经被移除, 应改为 tf.matmul()。

<UNK>投入模型的embedding是随机初始化的吗？

def create_model(session, Model_class, path, load_vec, config, id_to_char, logger):

# create model, reuse parameters if exists
model = Model_class(config)
ckpt = tf.train.get_checkpoint_state(path)
if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):
    logger.info("Reading model parameters from %s" % ckpt.model_checkpoint_path)
    model.saver.restore(session, ckpt.model_checkpoint_path)
else:
    logger.info("Created model with fresh parameters.")
    session.run(tf.global_variables_initializer())
    if config["pre_emb"]:
        emb_weights = session.run(model.char_lookup.read_value())
        emb_weights = load_vec(config["emb_file"],id_to_char, config["char_dim"], emb_weights)
        session.run(model.char_lookup.assign(emb_weights))
        logger.info("Load pre-trained embedding.")
return model

`
麻烦了，谢谢

input_from_line has bug

def input_from_line(line, char_to_id):
    """
    Take sentence data and return an input for
    the training or the evaluation function.
    """
    line = full_to_half(line)
    line = replace_html(line)
    inputs = list()
    inputs.append([line])
    line.replace(" ", "$")
    inputs.append([[char_to_id[char] if char in char_to_id else char_to_id["<UNK>"]
                   for char in line]])
    inputs.append([get_seg_features(line)])
    inputs.append([[]])
    return inputs

line.replace(" ", "$")

has no effect, line is unchanged
change to

line = re.sub('\s', '$', line)

关于model.py中project_layer输入向量的维度

您好，在model.py代码project_layer方法中，第138行注释， :param lstm_outputs: [batch_size, num_steps, emb_size] .project_layer是介于bilstm层和logits层之间，它的输入应该是bilstm_layer的输出:[batch_size,num_steps,2*lstm].您看我这样理解对吗?

请问怎么做成接口调用？

我好像绕不过tf.app.run()这个坎，没有办法做一个接口，调用main中的实体识别功能

演示训练语料咨询

你好！请教一个小白问题，origin_data里的训练及预测语料是用什么工具整理成那种格式的，能提供下代码吗？谢谢！

将clean和train改为ture后训练出现AttributeError: config_file,请问怎么办?

运行时报错UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

Traceback (most recent call last):
File "main.py", line 227, in
if name == "main":
File "C:\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run
_sys.exit(main(argv))
File "main.py", line 221, in main
clean(FLAGS)
File "main.py", line 187, in train

File "main.py", line 87, in evaluate
ner_results = model.evaluate(sess, data, id_to_tag)
File "C:\pyproject\ChineseNER-master\utils.py", line 66, in test_ner
eval_lines = return_report(output_file)
File "C:\pyproject\ChineseNER-master\conlleval.py", line 284, in return_report
counts = evaluate(f)
File "C:\pyproject\ChineseNER-master\conlleval.py", line 74, in evaluate
for line in iterable:
File "C:\Python35\lib\codecs.py", line 711, in next
return next(self.reader)
File "C:\Python35\lib\codecs.py", line 642, in next
line = self.readline()
File "C:\Python35\lib\codecs.py", line 555, in readline
data = self.read(readsize, firstline=True)
File "C:\Python35\lib\codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

关于数据集

Hi,您好，请教一下sighan.dev这个数据集和训练集和测试集有什么不同？

'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

Traceback (most recent call last):
File "E:\python2.7\pycharm\PyCharm 4.5.5\helpers\pydev\pydevd.py", line 2358, in
globals = debugger.run(setup['file'], None, None, is_module)
File "E:\python2.7\pycharm\PyCharm 4.5.5\helpers\pydev\pydevd.py", line 1778, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "E:\python2.7\pycharm\PyCharm 4.5.5\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "ChineseNER-master/main.py", line 225, in
tf.app.run(main)
File "tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "ChineseNER-master/main.py", line 219, in main
train()
File "ChineseNER-master/main.py", line 185, in train
best = evaluate(sess, model, "dev", dev_manager, id_to_tag, logger)
File "ChineseNER-master/main.py", line 85, in evaluate
eval_lines = test_ner(ner_results, FLAGS.result_path)
File "ChineseNER-master\utils.py", line 66, in test_ner
eval_lines = return_report(output_file)
File "ChineseNER-master\conlleval.py", line 282, in return_report
counts = evaluate(f)
File "ChineseNER-master\conlleval.py", line 74, in evaluate
for line in iterable:
File "tensorflow\lib\codecs.py", line 713, in next
return next(self.reader)
File "tensorflow\lib\codecs.py", line 644, in next
line = self.readline()
File "tensorflow\lib\codecs.py", line 557, in readline
data = self.read(readsize, firstline=True)
File "tensorflow\lib\codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

我的是tensorflow 1.3版本，请问下大家有没有遇到类似问题？有何解决方法。

Could you explain for me about the structure of words in Chinese? Is it similar to English (a sentence consists of several words and a word is a combination of several characters)?
Which dataset did you post in the data folder: example.train/ .dev/ .test?
In Python: can we use the below code to extract words, characters from a sentence:

for word in sentence:
    for char in word:
        # do something
    if word.lower() == word:
        # do something
    if word[0].upper() == word:
        # do something

Thank you in advance!

zjy-ucas / chinesener Goto Github PK

chinesener's People

Contributors

Stargazers

Watchers

Forkers

chinesener's Issues

ValueError: setting an array element with a sequence.

Hello! I run your code , but found errors below:

针对 tensorflow 1.10 需要的改动

Recommend Projects

Recommend Topics

Recommend Org