Giter Club home page Giter Club logo

chinesener's Issues

关于model.py中project_layer输入向量的维度

您好,在model.py代码project_layer方法中,第138行注释, :param lstm_outputs: [batch_size, num_steps, emb_size] .project_layer是介于bilstm层和logits层之间,它的输入应该是bilstm_layer的输出:[batch_size,num_steps,2*lstm].您看我这样理解对吗?

TF1.2 restore bug

Hi,
Did you encount the bug like:

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [4341,100] rhs shape= [3637,100]

when run python main.py

减小数据集后,报错:ValueError: setting an array element with a sequence.

Traceback (most recent call last):

File "", line 1, in
runfile('E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj/main.py', wdir='E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj')

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj/main.py", line 246, in
train()

File "E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj/main.py", line 192, in train
step, batch_loss = model.run_step(sess, True, batch)

File "E:\【重点代码】ChineseNER-master-bishe\Gradu_Prj\model.py", line 221, in run_step
feed_dict)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
run_metadata_ptr)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1097, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\numpy\core\numeric.py", line 492, in asarray
return array(a, dtype, copy=False, order=order)

ValueError: setting an array element with a sequence.

将example.train, example.test, example.dev三个文件中的句子删除一部分后,转变成txt文档保存,但运行时出错。

input_from_line has bug

def input_from_line(line, char_to_id):
    """
    Take sentence data and return an input for
    the training or the evaluation function.
    """
    line = full_to_half(line)
    line = replace_html(line)
    inputs = list()
    inputs.append([line])
    line.replace(" ", "$")
    inputs.append([[char_to_id[char] if char in char_to_id else char_to_id["<UNK>"]
                   for char in line]])
    inputs.append([get_seg_features(line)])
    inputs.append([[]])
    return inputs
line.replace(" ", "$")

has no effect, line is unchanged
change to

line = re.sub('\s', '$', line)

?

outputs, final_states = tf.nn.bidirectional_dynamic_rnn()每次运行到这一句就报NotImplementedError: Abstract method这个错误

您好,我想学习一下您的代码,试运行了一下遇到了解决不了的错误。之前rnn_cell_impl.LSTMStateTuple这一句提示找不到方法,我通过改用tf.contrib.rnn.LSTMStateTuple解决了错误,但是每当运行到outputs, final_states = tf.nn.bidirectional_dynamic_rnn(lstm_cell["forward"],lstm_cell["backward"],lstm_inputs,dtype=tf.float32,sequence_length=lengths)这一句的时候就报NotImplementedError: Abstract method错误,找不到错误原因,希望您能帮助我谢谢

project_layer 为什么用了两个 hidden layer ?

model.py 文件第 135 行的 project_layer 方法,定义了一个 hidden 层维度为 [self.lstm_dim*2, self.lstm_dim] ,然后定义 pred 维度为 [self.lstm_dim, self.num_tags]

为什么不直接从只定义一个 hidden layer,维度为 [self.lstm_dim*2, self.num_tags] ?

main.py 运行报错

main py

您好,我在命令行运行main.py文件,出现上图所示错误,请问需要如何解决呢?

请教一下, 报错absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --clip before flags were parsed.

File "/home/PycharmProjects/NER/ChineseNER-master/main.py", line 54, in
assert FLAGS.clip < 5.1, "gradient clip should't be too much"
File "/usr/local/lib/python3.5/dist-packages/absl/flags/_flagvalues.py", line 488, in getattr
raise _exceptions.UnparsedFlagAccessError(error_message)
absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --clip before flags were parsed.

Why should we expand the shape of logits to [self.num_tags + 1, self.num_tags + 1] ?

For example, when defining the loss function, you expand logits and targets to [self.num_tags + 1, self.num_tags + 1].

def loss_layer(self, project_logits, lengths, name=None):
    """
    calculate crf loss
    :param project_logits: [1, num_steps, num_tags]
    :return: scalar loss
    """
    with tf.variable_scope("crf_loss"  if not name else name):
        small = -1000.0
        # pad logits for crf loss
        start_logits = tf.concat(
            [small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])], axis=-1)
        pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32)
        logits = tf.concat([project_logits, pad_logits], axis=-1)
        logits = tf.concat([start_logits, logits], axis=1)
        targets = tf.concat(
            [tf.cast(self.num_tags*tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1)
        self.trans = tf.get_variable(
            "transitions",
            shape=[self.num_tags + 1, self.num_tags + 1],
            initializer=self.initializer)
        log_likelihood, self.trans = crf_log_likelihood(
            inputs=logits,
            tag_indices=targets,
            transition_params=self.trans,
            sequence_lengths=lengths+1)
        return tf.reduce_mean(-log_likelihood)

But in fact, the model works fine with the original logits and targets as the code following, so what's the purpose of doing so? thx!

def loss_layer(self, project_logits, lengths, name=None):
    self.trans = tf.get_variable(
        "transitions",
        shape=[self.num_tags, self.num_tags],
        initializer=self.initializer)
    log_likelihood, self.trans = crf_log_likelihood(
        inputs=self.logits,
        tag_indices=self.targets,
        transition_params=self.trans,
        sequence_lengths=lengths)
    return tf.reduce_mean(-log_likelihood)

数据集

为什么我用我自己的数据集训练,会报下面这个错误?
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

增量训练

Hi:
目前模型支持三个实体,如果我要扩展到更多实体,则需要增加相应语料进行训练,但是这样随着扩展的实体越来越多,训练的耗时也会相应增加,请问我增加实体类别后如何做到增量训练?来减少训练的时间。

运行时报错UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

Traceback (most recent call last):
File "main.py", line 227, in
if name == "main":
File "C:\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run
_sys.exit(main(argv))
File "main.py", line 221, in main
clean(FLAGS)
File "main.py", line 187, in train

File "main.py", line 87, in evaluate
ner_results = model.evaluate(sess, data, id_to_tag)
File "C:\pyproject\ChineseNER-master\utils.py", line 66, in test_ner
eval_lines = return_report(output_file)
File "C:\pyproject\ChineseNER-master\conlleval.py", line 284, in return_report
counts = evaluate(f)
File "C:\pyproject\ChineseNER-master\conlleval.py", line 74, in evaluate
for line in iterable:
File "C:\Python35\lib\codecs.py", line 711, in next
return next(self.reader)
File "C:\Python35\lib\codecs.py", line 642, in next
line = self.readline()
File "C:\Python35\lib\codecs.py", line 555, in readline
data = self.read(readsize, firstline=True)
File "C:\Python35\lib\codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

Traceback (most recent call last):
File "E:\python2.7\pycharm\PyCharm 4.5.5\helpers\pydev\pydevd.py", line 2358, in
globals = debugger.run(setup['file'], None, None, is_module)
File "E:\python2.7\pycharm\PyCharm 4.5.5\helpers\pydev\pydevd.py", line 1778, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "E:\python2.7\pycharm\PyCharm 4.5.5\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "ChineseNER-master/main.py", line 225, in
tf.app.run(main)
File "tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "ChineseNER-master/main.py", line 219, in main
train()
File "ChineseNER-master/main.py", line 185, in train
best = evaluate(sess, model, "dev", dev_manager, id_to_tag, logger)
File "ChineseNER-master/main.py", line 85, in evaluate
eval_lines = test_ner(ner_results, FLAGS.result_path)
File "ChineseNER-master\utils.py", line 66, in test_ner
eval_lines = return_report(output_file)
File "ChineseNER-master\conlleval.py", line 282, in return_report
counts = evaluate(f)
File "ChineseNER-master\conlleval.py", line 74, in evaluate
for line in iterable:
File "tensorflow\lib\codecs.py", line 713, in next
return next(self.reader)
File "tensorflow\lib\codecs.py", line 644, in next
line = self.readline()
File "tensorflow\lib\codecs.py", line 557, in readline
data = self.read(readsize, firstline=True)
File "tensorflow\lib\codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

我的是tensorflow 1.3版本,请问下大家有没有遇到类似问题?有何解决方法。

使用默认参数训练时出错

如题,训练模型时,出现了下面的错误调试:
Building prefix dict from the default dictionary ... Loading model from cache C:\Users\cloudy\AppData\Local\Temp\jieba.cache Loading model cost 1.237 seconds. Prefix dict has been built succesfully. Found 4313 unique words (979180 in total) Loading pretrained embeddings from wiki_100.utf8... Found 13 unique named entity tags 20864 / 0 / 4636 sentences in train / dev / test. Traceback (most recent call last): File "main.py", line 225, in <module> tf.app.run(main) File "D:\Anaconda3\envs\keras\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run _sys.exit(main(argv)) File "main.py", line 219, in main train() File "main.py", line 150, in train train_manager = BatchManager(train_data, FLAGS.batch_size) File "C:\Users\cloudy\Desktop\ChineseNER\data_utils.py", line 285, in __init__ self.batch_data = self.sort_and_pad(data, batch_size) File "C:\Users\cloudy\Desktop\ChineseNER\data_utils.py", line 293, in sort_and_pad batch_data.append(self.pad_data(sorted_data[i*batch_size: (i+1)*batch_size])) TypeError: slice indices must be integers or None or have an __index__ method
自己找了几个方法,没有解决,希望帮我解决一下,感激不尽!

测试时,sentences 的格式有什么要求

line = input("请输入测试句子:")
print line
result = model.evaluate_line(sess, input_from_line(line, char_to_id), id_to_tag)

请问对 输入的测试句子 有什么格式要求?
输入中文:北京*** 报错
输入数字:3232132312 报错

embedding层中使用seg的作用是什么?

self.seg_lookup = tf.get_variable(
name="seg_embedding",
shape=[self.num_segs, self.seg_dim],
initializer=self.initializer)
在embedding层中加入这几行代码,并且 embed = tf.concat(embedding, axis=-1)加入这行代码的作用是什么?

NameError: name 'os' is not defined

Hello! I run your code , but found errors below:


Traceback (most recent call last):
File "F:/yyhaker/software/project/NamedEntityRecognition/src/ChineseNER/main.py", line 225, in
if name == "main":
File "D:\perhack\Anaconda3\envs\my_pytorch\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "F:/yyhaker/software/project/NamedEntityRecognition/src/ChineseNER/main.py", line 219, in main
clean(FLAGS)
File "F:/yyhaker/software/project/NamedEntityRecognition/src/ChineseNER/main.py", line 114, in train
# create maps if not exist
NameError: name 'os' is not defined



I have install the os module, and it can run correctly! What's wrong with it?

关于数据集

Hi,您好,请教一下sighan.dev这个数据集和训练集和测试集有什么不同?

预测时,结果word、start、end与实际不符。

句子:他的检验报告等。
标注:“报告”
位置:4, 6
使用model.py中的evaluate_line方法会出现以下情况:

  1. word:报告 start:3 end:6
  2. word:验报告 start:4 end:6
  3. word:检验报告 start:4 end:6

针对 tensorflow 1.10 需要的改动

针对 tensorflow 1.10 需要的改动

  • tensorflow 1.10 中已经将 rnn_celltensorflow.python.ops 移除,功能类似的是 tensorflow.contrib.rnn 。可以把 model.py 中的第四行改为 import tensorflow.contrib.rnn as rnn_cell (不负责任的做法)。

  • tf.concat() 的参数顺序被做了调整,所有的 rnn_inputs = tf.concat(2, [rnn_inputs, self.features]) 应被改为 rnn_inputs = tf.concat([rnn_inputs, self.features], 2)

  • tf.batch_matmul() 已经被移除, 应改为 tf.matmul()

关于句子实体词提取结果

请问是python main.py后在输入句子提示后输入句子,就能看到计算出的结果了吗?
感觉分出来的效果不是很理想,请问是不是有什么别的方式,谢谢!

请输入测试句子:老张开车去东北玩。
结果:
[{'end': 3, 'start': 1, 'type': 'PER', 'word': '老张开'},
{'end': 4, 'start': 1, 'type': 'PER', 'word': '车'},
{'end': 5, 'start': 4, 'type': 'LOC', 'word': '去'},
{'end': 6, 'start': 5, 'type': 'LOC', 'word': '东'},
{'end': 7, 'start': 6, 'type': 'LOC', 'word': '北'},
{'end': 8, 'start': 7, 'type': 'LOC', 'word': '玩'},
{'end': 9, 'start': 8, 'type': 'LOC', 'word': '。'}]

wiki_100.utf8的作用

模型中使用wiki_100中提供的向量,对于英文如chanel会切分成c,h,a, n , e,l,有办法改进英文的输入吗?

<UNK>投入模型的embedding是随机初始化的吗?

`

def create_model(session, Model_class, path, load_vec, config, id_to_char, logger):

# create model, reuse parameters if exists
model = Model_class(config)
ckpt = tf.train.get_checkpoint_state(path)
if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):
    logger.info("Reading model parameters from %s" % ckpt.model_checkpoint_path)
    model.saver.restore(session, ckpt.model_checkpoint_path)
else:
    logger.info("Created model with fresh parameters.")
    session.run(tf.global_variables_initializer())
    if config["pre_emb"]:
        emb_weights = session.run(model.char_lookup.read_value())
        emb_weights = load_vec(config["emb_file"],id_to_char, config["char_dim"], emb_weights)
        session.run(model.char_lookup.assign(emb_weights))
        logger.info("Load pre-trained embedding.")
return model

`
麻烦了,谢谢

About structure of word/character in Chinese

  • Could you explain for me about the structure of words in Chinese? Is it similar to English (a sentence consists of several words and a word is a combination of several characters)?
  • Which dataset did you post in the data folder: example.train/ .dev/ .test?
  • In Python: can we use the below code to extract words, characters from a sentence:
for word in sentence:
    for char in word:
        # do something
    if word.lower() == word:
        # do something
    if word[0].upper() == word:
        # do something

Thank you in advance!

pre-trained embedding not used in input layer?

Hi,
I noticed that the pre-trained embedding file was not used in "embedding layer", just used a lookup function to generation character embedding and seg embedding. The pre-trained embedding only used in the char_to_id generation. I want to know whether I misunderstand this. If so, why not use the pre-trained embedding to generate the input. Thanks!

关于版本的问题

  1. ChineseNER这个包很好用,可以拿到很高的分数
  2. 但是有一个问题,如果把上面的python3改成python2,发现会报错:
    例如python3 main.py 改为 python2 main.py
    会有如下问题,想问一下有什么解决方案吗?
    Caused by op u'char_embedding/concat', defined at:
    File "main.py", line 232, in
    tf.app.run(main)
    File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
    File "main.py", line 227, in main
    evaluate_line()
    File "main.py", line 199, in evaluate_line
    model = create_model(sess, Model, FLAGS.ckpt_path, load_word2vec, config, id_to_char, logger)
    File "/data00/home/dengjiangdong/workspace/lab_basic_ner_v1/utils.py", line 174, in create_model
    model = Model_class(config)
    File "/data00/home/dengjiangdong/workspace/lab_basic_ner_v1/model.py", line 54, in init
    embedding = self.embedding_layer(self.char_inputs, self.seg_inputs, config)
    File "/data00/home/dengjiangdong/workspace/lab_basic_ner_v1/model.py", line 110, in embedding_layer
    embed = tf.concat(embedding, axis=-1)
    File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1048, in concat
    name=name)
    File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 495, in _concat_v2
    name=name)
    File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
    File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
    File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in init
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[0] = [1,10,100] vs. shape[1] = [1,6,20]
[[Node: char_embedding/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](char_embedding/embedding_lookup, char_embedding/seg_embedding/embedding_lookup, char_embedding/concat/axis)]]

關於wiki_100.utf8檔案

您好打擾了,我目前看到對於中文大都是使用word2vec的'詞'向量,但對於中文NER來說目前主流算法都是以'字'來看,因此想請問一下您的"字向量"是如何訓練出來的呢?是否有什麼資料可以參考呢?

演示训练语料咨询

你好!请教一个小白问题,origin_data里的训练及预测语料是用什么工具整理成那种格式的,能提供下代码吗?谢谢!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.