zjy-ucas / chinesener Goto Github PK

A neural network model for Chinese named entity recognition

Perl 17.35% Python 82.65%

chinesener's Issues

请问怎么做成接口调用？

我好像绕不过tf.app.run()这个坎，没有办法做一个接口，调用main中的实体识别功能

关于model.py中project_layer输入向量的维度

您好，在model.py代码project_layer方法中，第138行注释， :param lstm_outputs: [batch_size, num_steps, emb_size] .project_layer是介于bilstm层和logits层之间，它的输入应该是bilstm_layer的输出:[batch_size,num_steps,2*lstm].您看我这样理解对吗?

TF1.2 restore bug

Hi,
Did you encount the bug like:

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [4341,100] rhs shape= [3637,100]

when run python main.py

减小数据集后，报错：ValueError: setting an array element with a sequence.

Traceback (most recent call last):

File "", line 1, in
runfile('E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj/main.py', wdir='E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj')

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj/main.py", line 246, in
train()

File "E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj/main.py", line 192, in train
step, batch_loss = model.run_step(sess, True, batch)

File "E:\【重点代码】ChineseNER-master-bishe\Gradu_Prj\model.py", line 221, in run_step
feed_dict)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
run_metadata_ptr)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1097, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\numpy\core\numeric.py", line 492, in asarray
return array(a, dtype, copy=False, order=order)

ValueError: setting an array element with a sequence.

将example.train， example.test， example.dev三个文件中的句子删除一部分后，转变成txt文档保存，但运行时出错。

input_from_line has bug

def input_from_line(line, char_to_id):
    """
    Take sentence data and return an input for
    the training or the evaluation function.
    """
    line = full_to_half(line)
    line = replace_html(line)
    inputs = list()
    inputs.append([line])
    line.replace(" ", "$")
    inputs.append([[char_to_id[char] if char in char_to_id else char_to_id["<UNK>"]
                   for char in line]])
    inputs.append([get_seg_features(line)])
    inputs.append([[]])
    return inputs

line.replace(" ", "$")

has no effect, line is unchanged
change to

line = re.sub('\s', '$', line)

data文件夹下dev和test的作用分别是什么？

data文件夹下dev和test的作用分别是什么？为什么测试时要分别测试dev和test中的数据？

outputs, final_states = tf.nn.bidirectional_dynamic_rnn()每次运行到这一句就报NotImplementedError: Abstract method这个错误

您好，我想学习一下您的代码，试运行了一下遇到了解决不了的错误。之前rnn_cell_impl.LSTMStateTuple这一句提示找不到方法，我通过改用tf.contrib.rnn.LSTMStateTuple解决了错误，但是每当运行到outputs, final_states = tf.nn.bidirectional_dynamic_rnn(lstm_cell["forward"],lstm_cell["backward"],lstm_inputs,dtype=tf.float32,sequence_length=lengths)这一句的时候就报NotImplementedError: Abstract method错误，找不到错误原因，希望您能帮助我谢谢

如果我想测试一句话的结果，怎么弄？

是不是需要修改colleval文件里的输出，发现是用perl写的，好麻烦

你好，想问下字向量是怎么训练得到的？

是类似Word2vec那种方式？基于的语料库是？非常感谢。

参数意义解释

可以解释下main.py中参数设置及其意义吗？

project_layer 为什么用了两个 hidden layer ？

model.py 文件第 135 行的 project_layer 方法，定义了一个 hidden 层维度为 [self.lstm_dim*2, self.lstm_dim] ，然后定义 pred 维度为 [self.lstm_dim, self.num_tags]。

为什么不直接从只定义一个 hidden layer，维度为 [self.lstm_dim*2, self.num_tags] ？

main.py 运行报错

您好，我在命令行运行main.py文件，出现上图所示错误，请问需要如何解决呢？

model.py 中predict函数里循环越界

for i in range(len(batch))是不是应该改成for i in range(len(str_lines))，不然对于一些短文本的句子会出现越界情况

请教一下, 报错absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --clip before flags were parsed.

File "/home/PycharmProjects/NER/ChineseNER-master/main.py", line 54, in
assert FLAGS.clip < 5.1, "gradient clip should't be too much"
File "/usr/local/lib/python3.5/dist-packages/absl/flags/_flagvalues.py", line 488, in getattr
raise _exceptions.UnparsedFlagAccessError(error_message)
absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --clip before flags were parsed.

Why should we expand the shape of logits to [self.num_tags + 1, self.num_tags + 1] ?

For example, when defining the loss function, you expand logits and targets to [self.num_tags + 1, self.num_tags + 1].

def loss_layer(self, project_logits, lengths, name=None):
    """
    calculate crf loss
    :param project_logits: [1, num_steps, num_tags]
    :return: scalar loss
    """
    with tf.variable_scope("crf_loss"  if not name else name):
        small = -1000.0
        # pad logits for crf loss
        start_logits = tf.concat(
            [small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])], axis=-1)
        pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32)
        logits = tf.concat([project_logits, pad_logits], axis=-1)
        logits = tf.concat([start_logits, logits], axis=1)
        targets = tf.concat(
            [tf.cast(self.num_tags*tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1)
        self.trans = tf.get_variable(
            "transitions",
            shape=[self.num_tags + 1, self.num_tags + 1],
            initializer=self.initializer)
        log_likelihood, self.trans = crf_log_likelihood(
            inputs=logits,
            tag_indices=targets,
            transition_params=self.trans,
            sequence_lengths=lengths+1)
        return tf.reduce_mean(-log_likelihood)

But in fact, the model works fine with the original logits and targets as the code following, so what's the purpose of doing so? thx!

def loss_layer(self, project_logits, lengths, name=None):
    self.trans = tf.get_variable(
        "transitions",
        shape=[self.num_tags, self.num_tags],
        initializer=self.initializer)
    log_likelihood, self.trans = crf_log_likelihood(
        inputs=self.logits,
        tag_indices=self.targets,
        transition_params=self.trans,
        sequence_lengths=lengths)
    return tf.reduce_mean(-log_likelihood)

数据集

为什么我用我自己的数据集训练，会报下面这个错误？
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

增量训练

Hi：
目前模型支持三个实体，如果我要扩展到更多实体，则需要增加相应语料进行训练，但是这样随着扩展的实体越来越多，训练的耗时也会相应增加，请问我增加实体类别后如何做到增量训练？来减少训练的时间。

运行时报错UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

Traceback (most recent call last):
File "main.py", line 227, in
if name == "main":
File "C:\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run
_sys.exit(main(argv))
File "main.py", line 221, in main
clean(FLAGS)
File "main.py", line 187, in train

File "main.py", line 87, in evaluate
ner_results = model.evaluate(sess, data, id_to_tag)
File "C:\pyproject\ChineseNER-master\utils.py", line 66, in test_ner
eval_lines = return_report(output_file)
File "C:\pyproject\ChineseNER-master\conlleval.py", line 284, in return_report
counts = evaluate(f)
File "C:\pyproject\ChineseNER-master\conlleval.py", line 74, in evaluate
for line in iterable:
File "C:\Python35\lib\codecs.py", line 711, in next
return next(self.reader)
File "C:\Python35\lib\codecs.py", line 642, in next
line = self.readline()
File "C:\Python35\lib\codecs.py", line 555, in readline
data = self.read(readsize, firstline=True)
File "C:\Python35\lib\codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

既然使用char embedding作为输入，为什么使用jieba进行切词？

麻烦大神解答下疑惑，感激~

请教一下数据集的来源

Hi @zjy-ucas
感谢你的分享。
有一个小问题，请问一下你的训练集是自己如何获得的，准确性如何？
感谢~

Is data in 'data' dir a complete dataset?数据集是部分还是全量的？

我想知道数据集的来源和完整程度，谢谢
I want to know the source of your data, and whether the dataset is complete?

'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

Traceback (most recent call last):
File "E:\python2.7\pycharm\PyCharm 4.5.5\helpers\pydev\pydevd.py", line 2358, in
globals = debugger.run(setup['file'], None, None, is_module)
File "E:\python2.7\pycharm\PyCharm 4.5.5\helpers\pydev\pydevd.py", line 1778, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "E:\python2.7\pycharm\PyCharm 4.5.5\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "ChineseNER-master/main.py", line 225, in
tf.app.run(main)
File "tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "ChineseNER-master/main.py", line 219, in main
train()
File "ChineseNER-master/main.py", line 185, in train
best = evaluate(sess, model, "dev", dev_manager, id_to_tag, logger)
File "ChineseNER-master/main.py", line 85, in evaluate
eval_lines = test_ner(ner_results, FLAGS.result_path)
File "ChineseNER-master\utils.py", line 66, in test_ner
eval_lines = return_report(output_file)
File "ChineseNER-master\conlleval.py", line 282, in return_report
counts = evaluate(f)
File "ChineseNER-master\conlleval.py", line 74, in evaluate
for line in iterable:
File "tensorflow\lib\codecs.py", line 713, in next
return next(self.reader)
File "tensorflow\lib\codecs.py", line 644, in next
line = self.readline()
File "tensorflow\lib\codecs.py", line 557, in readline
data = self.read(readsize, firstline=True)
File "tensorflow\lib\codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

我的是tensorflow 1.3版本，请问下大家有没有遇到类似问题？有何解决方法。

About 'DOCSTART' in loader.py

I don't know why you add this. It is never used.

ValueError: setting an array element with a sequence.

在dev数据集的时候报上述错误，dev数据集中数组长度不一致导致，有一个分组列表长度有99是72，一个是73导致

使用默认参数训练时出错

如题，训练模型时，出现了下面的错误调试：
Building prefix dict from the default dictionary ... Loading model from cache C:\Users\cloudy\AppData\Local\Temp\jieba.cache Loading model cost 1.237 seconds. Prefix dict has been built succesfully. Found 4313 unique words (979180 in total) Loading pretrained embeddings from wiki_100.utf8... Found 13 unique named entity tags 20864 / 0 / 4636 sentences in train / dev / test. Traceback (most recent call last): File "main.py", line 225, in <module> tf.app.run(main) File "D:\Anaconda3\envs\keras\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run _sys.exit(main(argv)) File "main.py", line 219, in main train() File "main.py", line 150, in train train_manager = BatchManager(train_data, FLAGS.batch_size) File "C:\Users\cloudy\Desktop\ChineseNER\data_utils.py", line 285, in __init__ self.batch_data = self.sort_and_pad(data, batch_size) File "C:\Users\cloudy\Desktop\ChineseNER\data_utils.py", line 293, in sort_and_pad batch_data.append(self.pad_data(sorted_data[i*batch_size: (i+1)*batch_size])) TypeError: slice indices must be integers or None or have an __index__ method
自己找了几个方法，没有解决，希望帮我解决一下，感激不尽！

测试时，sentences 的格式有什么要求

line = input("请输入测试句子:")
print line
result = model.evaluate_line(sess, input_from_line(line, char_to_id), id_to_tag)

请问对输入的测试句子有什么格式要求？
输入中文：北京*** 报错
输入数字：3232132312 报错

embedding层中使用seg的作用是什么？

self.seg_lookup = tf.get_variable(
name="seg_embedding",
shape=[self.num_segs, self.seg_dim],
initializer=self.initializer)
在embedding层中加入这几行代码，并且 embed = tf.concat(embedding, axis=-1)加入这行代码的作用是什么？

model.py 第97行 with tf.variable_scope("char_embedding" if not name else name)

这里的"char_embedding" if not name else name 是对的吗？没有见过这样的语法呀

有个问题请教一下，word2vec中<UNK>的矩阵是怎么计算出来的呀，谢谢

NameError: name 'os' is not defined

Hello! I run your code , but found errors below:

Traceback (most recent call last):
File "F:/yyhaker/software/project/NamedEntityRecognition/src/ChineseNER/main.py", line 225, in
if name == "main":
File "D:\perhack\Anaconda3\envs\my_pytorch\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "F:/yyhaker/software/project/NamedEntityRecognition/src/ChineseNER/main.py", line 219, in main
clean(FLAGS)
File "F:/yyhaker/software/project/NamedEntityRecognition/src/ChineseNER/main.py", line 114, in train
# create maps if not exist
NameError: name 'os' is not defined

I have install the os module, and it can run correctly! What's wrong with it?

将clean和train改为ture后训练出现AttributeError: config_file,请问怎么办?

关于数据集

Hi,您好，请教一下sighan.dev这个数据集和训练集和测试集有什么不同？

预测时，结果word、start、end与实际不符。

句子：他的检验报告等。
标注：“报告”
位置：4, 6
使用model.py中的evaluate_line方法会出现以下情况：

word：报告 start：3 end：6
word：验报告 start：4 end：6
word：检验报告 start：4 end：6

针对 tensorflow 1.10 需要的改动

tensorflow 1.10 中已经将 rnn_cell 从 tensorflow.python.ops 移除，功能类似的是 tensorflow.contrib.rnn 。可以把 model.py 中的第四行改为 import tensorflow.contrib.rnn as rnn_cell （不负责任的做法）。
tf.concat() 的参数顺序被做了调整，所有的 rnn_inputs = tf.concat(2, [rnn_inputs, self.features]) 应被改为 rnn_inputs = tf.concat([rnn_inputs, self.features], 2) 。
tf.batch_matmul() 已经被移除, 应改为 tf.matmul()。

关于句子实体词提取结果

请问是python main.py后在输入句子提示后输入句子，就能看到计算出的结果了吗？
感觉分出来的效果不是很理想，请问是不是有什么别的方式，谢谢！

请输入测试句子:老张开车去东北玩。
结果：
[{'end': 3, 'start': 1, 'type': 'PER', 'word': '老张开'},
{'end': 4, 'start': 1, 'type': 'PER', 'word': '车'},
{'end': 5, 'start': 4, 'type': 'LOC', 'word': '去'},
{'end': 6, 'start': 5, 'type': 'LOC', 'word': '东'},
{'end': 7, 'start': 6, 'type': 'LOC', 'word': '北'},
{'end': 8, 'start': 7, 'type': 'LOC', 'word': '玩'},
{'end': 9, 'start': 8, 'type': 'LOC', 'word': '。'}]

请问 def main(_): 这个下划线代表什么意思？

请问 def main(_): 这个下划线代表什么意思？
如果把下划线删掉变成main(): 又运行出错了

wiki_100.utf8的作用

模型中使用wiki_100中提供的向量，对于英文如chanel会切分成c，h，a, n ， e，l，有办法改进英文的输入吗？

为什么使用GPU跑，而速度反而会变慢？

不知道大家有没有试过用GPU去运行程序，速度很慢，甚至比CPU还慢。不知道是什么原因。

FileNotFoundError: [Errno 2] No such file or directory: 'config_file'

提示没有config_file，这个文件从哪里得到啊？

<UNK>投入模型的embedding是随机初始化的吗？

def create_model(session, Model_class, path, load_vec, config, id_to_char, logger):

# create model, reuse parameters if exists
model = Model_class(config)
ckpt = tf.train.get_checkpoint_state(path)
if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):
    logger.info("Reading model parameters from %s" % ckpt.model_checkpoint_path)
    model.saver.restore(session, ckpt.model_checkpoint_path)
else:
    logger.info("Created model with fresh parameters.")
    session.run(tf.global_variables_initializer())
    if config["pre_emb"]:
        emb_weights = session.run(model.char_lookup.read_value())
        emb_weights = load_vec(config["emb_file"],id_to_char, config["char_dim"], emb_weights)
        session.run(model.char_lookup.assign(emb_weights))
        logger.info("Load pre-trained embedding.")
return model

`
麻烦了，谢谢

请问为什么换了自己的数据集之后会报错呢？

About structure of word/character in Chinese

Could you explain for me about the structure of words in Chinese? Is it similar to English (a sentence consists of several words and a word is a combination of several characters)?
Which dataset did you post in the data folder: example.train/ .dev/ .test?
In Python: can we use the below code to extract words, characters from a sentence:

for word in sentence:
    for char in word:
        # do something
    if word.lower() == word:
        # do something
    if word[0].upper() == word:
        # do something

Thank you in advance!

pre-trained embedding not used in input layer?

Hi,
I noticed that the pre-trained embedding file was not used in "embedding layer", just used a lookup function to generation character embedding and seg embedding. The pre-trained embedding only used in the char_to_id generation. I want to know whether I misunderstand this. If so, why not use the pre-trained embedding to generate the input. Thanks!

关于版本的问题

ChineseNER这个包很好用，可以拿到很高的分数
但是有一个问题，如果把上面的python3改成python2，发现会报错：
例如python3 main.py 改为 python2 main.py
会有如下问题，想问一下有什么解决方案吗？
Caused by op u'char_embedding/concat', defined at:
File "main.py", line 232, in
tf.app.run(main)
File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "main.py", line 227, in main
evaluate_line()
File "main.py", line 199, in evaluate_line
model = create_model(sess, Model, FLAGS.ckpt_path, load_word2vec, config, id_to_char, logger)
File "/data00/home/dengjiangdong/workspace/lab_basic_ner_v1/utils.py", line 174, in create_model
model = Model_class(config)
File "/data00/home/dengjiangdong/workspace/lab_basic_ner_v1/model.py", line 54, in init
embedding = self.embedding_layer(self.char_inputs, self.seg_inputs, config)
File "/data00/home/dengjiangdong/workspace/lab_basic_ner_v1/model.py", line 110, in embedding_layer
embed = tf.concat(embedding, axis=-1)
File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1048, in concat
name=name)
File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 495, in _concat_v2
name=name)
File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[0] = [1,10,100] vs. shape[1] = [1,6,20]
[[Node: char_embedding/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](char_embedding/embedding_lookup, char_embedding/seg_embedding/embedding_lookup, char_embedding/concat/axis)]]

zjy-ucas / chinesener Goto Github PK

chinesener's Issues

ValueError: setting an array element with a sequence.

Hello! I run your code , but found errors below:

针对 tensorflow 1.10 需要的改动

Recommend Projects

Recommend Topics

Recommend Org