Giter Club home page Giter Club logo

chinesener's People

Contributors

zjy-ucas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

chinesener's Issues

pre-trained embedding not used in input layer?

Hi,
I noticed that the pre-trained embedding file was not used in "embedding layer", just used a lookup function to generation character embedding and seg embedding. The pre-trained embedding only used in the char_to_id generation. I want to know whether I misunderstand this. If so, why not use the pre-trained embedding to generate the input. Thanks!

embedding层中使用seg的作用是什么?

self.seg_lookup = tf.get_variable(
name="seg_embedding",
shape=[self.num_segs, self.seg_dim],
initializer=self.initializer)
在embedding层中加入这几行代码,并且 embed = tf.concat(embedding, axis=-1)加入这行代码的作用是什么?

请教一下, 报错absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --clip before flags were parsed.

File "/home/PycharmProjects/NER/ChineseNER-master/main.py", line 54, in
assert FLAGS.clip < 5.1, "gradient clip should't be too much"
File "/usr/local/lib/python3.5/dist-packages/absl/flags/_flagvalues.py", line 488, in getattr
raise _exceptions.UnparsedFlagAccessError(error_message)
absl.flags._exceptions.UnparsedFlagAccessError: Trying to access flag --clip before flags were parsed.

增量训练

Hi:
目前模型支持三个实体,如果我要扩展到更多实体,则需要增加相应语料进行训练,但是这样随着扩展的实体越来越多,训练的耗时也会相应增加,请问我增加实体类别后如何做到增量训练?来减少训练的时间。

减小数据集后,报错:ValueError: setting an array element with a sequence.

Traceback (most recent call last):

File "", line 1, in
runfile('E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj/main.py', wdir='E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj')

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)

File "E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj/main.py", line 246, in
train()

File "E:/【重点代码】ChineseNER-master-bishe/Gradu_Prj/main.py", line 192, in train
step, batch_loss = model.run_step(sess, True, batch)

File "E:\【重点代码】ChineseNER-master-bishe\Gradu_Prj\model.py", line 221, in run_step
feed_dict)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 895, in run
run_metadata_ptr)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1097, in _run
np_val = np.asarray(subfeed_val, dtype=subfeed_dtype)

File "E:\anaconda INSTALL\envs\tensorflow\lib\site-packages\numpy\core\numeric.py", line 492, in asarray
return array(a, dtype, copy=False, order=order)

ValueError: setting an array element with a sequence.

将example.train, example.test, example.dev三个文件中的句子删除一部分后,转变成txt文档保存,但运行时出错。

outputs, final_states = tf.nn.bidirectional_dynamic_rnn()每次运行到这一句就报NotImplementedError: Abstract method这个错误

您好,我想学习一下您的代码,试运行了一下遇到了解决不了的错误。之前rnn_cell_impl.LSTMStateTuple这一句提示找不到方法,我通过改用tf.contrib.rnn.LSTMStateTuple解决了错误,但是每当运行到outputs, final_states = tf.nn.bidirectional_dynamic_rnn(lstm_cell["forward"],lstm_cell["backward"],lstm_inputs,dtype=tf.float32,sequence_length=lengths)这一句的时候就报NotImplementedError: Abstract method错误,找不到错误原因,希望您能帮助我谢谢

關於wiki_100.utf8檔案

您好打擾了,我目前看到對於中文大都是使用word2vec的'詞'向量,但對於中文NER來說目前主流算法都是以'字'來看,因此想請問一下您的"字向量"是如何訓練出來的呢?是否有什麼資料可以參考呢?

main.py 运行报错

main py

您好,我在命令行运行main.py文件,出现上图所示错误,请问需要如何解决呢?

测试时,sentences 的格式有什么要求

line = input("请输入测试句子:")
print line
result = model.evaluate_line(sess, input_from_line(line, char_to_id), id_to_tag)

请问对 输入的测试句子 有什么格式要求?
输入中文:北京*** 报错
输入数字:3232132312 报错

TF1.2 restore bug

Hi,
Did you encount the bug like:

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [4341,100] rhs shape= [3637,100]

when run python main.py

wiki_100.utf8的作用

模型中使用wiki_100中提供的向量,对于英文如chanel会切分成c,h,a, n , e,l,有办法改进英文的输入吗?

使用默认参数训练时出错

如题,训练模型时,出现了下面的错误调试:
Building prefix dict from the default dictionary ... Loading model from cache C:\Users\cloudy\AppData\Local\Temp\jieba.cache Loading model cost 1.237 seconds. Prefix dict has been built succesfully. Found 4313 unique words (979180 in total) Loading pretrained embeddings from wiki_100.utf8... Found 13 unique named entity tags 20864 / 0 / 4636 sentences in train / dev / test. Traceback (most recent call last): File "main.py", line 225, in <module> tf.app.run(main) File "D:\Anaconda3\envs\keras\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run _sys.exit(main(argv)) File "main.py", line 219, in main train() File "main.py", line 150, in train train_manager = BatchManager(train_data, FLAGS.batch_size) File "C:\Users\cloudy\Desktop\ChineseNER\data_utils.py", line 285, in __init__ self.batch_data = self.sort_and_pad(data, batch_size) File "C:\Users\cloudy\Desktop\ChineseNER\data_utils.py", line 293, in sort_and_pad batch_data.append(self.pad_data(sorted_data[i*batch_size: (i+1)*batch_size])) TypeError: slice indices must be integers or None or have an __index__ method
自己找了几个方法,没有解决,希望帮我解决一下,感激不尽!

关于版本的问题

  1. ChineseNER这个包很好用,可以拿到很高的分数
  2. 但是有一个问题,如果把上面的python3改成python2,发现会报错:
    例如python3 main.py 改为 python2 main.py
    会有如下问题,想问一下有什么解决方案吗?
    Caused by op u'char_embedding/concat', defined at:
    File "main.py", line 232, in
    tf.app.run(main)
    File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
    File "main.py", line 227, in main
    evaluate_line()
    File "main.py", line 199, in evaluate_line
    model = create_model(sess, Model, FLAGS.ckpt_path, load_word2vec, config, id_to_char, logger)
    File "/data00/home/dengjiangdong/workspace/lab_basic_ner_v1/utils.py", line 174, in create_model
    model = Model_class(config)
    File "/data00/home/dengjiangdong/workspace/lab_basic_ner_v1/model.py", line 54, in init
    embedding = self.embedding_layer(self.char_inputs, self.seg_inputs, config)
    File "/data00/home/dengjiangdong/workspace/lab_basic_ner_v1/model.py", line 110, in embedding_layer
    embed = tf.concat(embedding, axis=-1)
    File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/ops/array_ops.py", line 1048, in concat
    name=name)
    File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 495, in _concat_v2
    name=name)
    File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
    File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
    File "/data00/home/dengjiangdong/miniconda3/envs/py2_tf/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1269, in init
    self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[0] = [1,10,100] vs. shape[1] = [1,6,20]
[[Node: char_embedding/concat = ConcatV2[N=2, T=DT_FLOAT, Tidx=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](char_embedding/embedding_lookup, char_embedding/seg_embedding/embedding_lookup, char_embedding/concat/axis)]]

Why should we expand the shape of logits to [self.num_tags + 1, self.num_tags + 1] ?

For example, when defining the loss function, you expand logits and targets to [self.num_tags + 1, self.num_tags + 1].

def loss_layer(self, project_logits, lengths, name=None):
    """
    calculate crf loss
    :param project_logits: [1, num_steps, num_tags]
    :return: scalar loss
    """
    with tf.variable_scope("crf_loss"  if not name else name):
        small = -1000.0
        # pad logits for crf loss
        start_logits = tf.concat(
            [small * tf.ones(shape=[self.batch_size, 1, self.num_tags]), tf.zeros(shape=[self.batch_size, 1, 1])], axis=-1)
        pad_logits = tf.cast(small * tf.ones([self.batch_size, self.num_steps, 1]), tf.float32)
        logits = tf.concat([project_logits, pad_logits], axis=-1)
        logits = tf.concat([start_logits, logits], axis=1)
        targets = tf.concat(
            [tf.cast(self.num_tags*tf.ones([self.batch_size, 1]), tf.int32), self.targets], axis=-1)
        self.trans = tf.get_variable(
            "transitions",
            shape=[self.num_tags + 1, self.num_tags + 1],
            initializer=self.initializer)
        log_likelihood, self.trans = crf_log_likelihood(
            inputs=logits,
            tag_indices=targets,
            transition_params=self.trans,
            sequence_lengths=lengths+1)
        return tf.reduce_mean(-log_likelihood)

But in fact, the model works fine with the original logits and targets as the code following, so what's the purpose of doing so? thx!

def loss_layer(self, project_logits, lengths, name=None):
    self.trans = tf.get_variable(
        "transitions",
        shape=[self.num_tags, self.num_tags],
        initializer=self.initializer)
    log_likelihood, self.trans = crf_log_likelihood(
        inputs=self.logits,
        tag_indices=self.targets,
        transition_params=self.trans,
        sequence_lengths=lengths)
    return tf.reduce_mean(-log_likelihood)

NameError: name 'os' is not defined

Hello! I run your code , but found errors below:


Traceback (most recent call last):
File "F:/yyhaker/software/project/NamedEntityRecognition/src/ChineseNER/main.py", line 225, in
if name == "main":
File "D:\perhack\Anaconda3\envs\my_pytorch\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "F:/yyhaker/software/project/NamedEntityRecognition/src/ChineseNER/main.py", line 219, in main
clean(FLAGS)
File "F:/yyhaker/software/project/NamedEntityRecognition/src/ChineseNER/main.py", line 114, in train
# create maps if not exist
NameError: name 'os' is not defined



I have install the os module, and it can run correctly! What's wrong with it?

关于句子实体词提取结果

请问是python main.py后在输入句子提示后输入句子,就能看到计算出的结果了吗?
感觉分出来的效果不是很理想,请问是不是有什么别的方式,谢谢!

请输入测试句子:老张开车去东北玩。
结果:
[{'end': 3, 'start': 1, 'type': 'PER', 'word': '老张开'},
{'end': 4, 'start': 1, 'type': 'PER', 'word': '车'},
{'end': 5, 'start': 4, 'type': 'LOC', 'word': '去'},
{'end': 6, 'start': 5, 'type': 'LOC', 'word': '东'},
{'end': 7, 'start': 6, 'type': 'LOC', 'word': '北'},
{'end': 8, 'start': 7, 'type': 'LOC', 'word': '玩'},
{'end': 9, 'start': 8, 'type': 'LOC', 'word': '。'}]

预测时,结果word、start、end与实际不符。

句子:他的检验报告等。
标注:“报告”
位置:4, 6
使用model.py中的evaluate_line方法会出现以下情况:

  1. word:报告 start:3 end:6
  2. word:验报告 start:4 end:6
  3. word:检验报告 start:4 end:6

针对 tensorflow 1.10 需要的改动

针对 tensorflow 1.10 需要的改动

  • tensorflow 1.10 中已经将 rnn_celltensorflow.python.ops 移除,功能类似的是 tensorflow.contrib.rnn 。可以把 model.py 中的第四行改为 import tensorflow.contrib.rnn as rnn_cell (不负责任的做法)。

  • tf.concat() 的参数顺序被做了调整,所有的 rnn_inputs = tf.concat(2, [rnn_inputs, self.features]) 应被改为 rnn_inputs = tf.concat([rnn_inputs, self.features], 2)

  • tf.batch_matmul() 已经被移除, 应改为 tf.matmul()

<UNK>投入模型的embedding是随机初始化的吗?

`

def create_model(session, Model_class, path, load_vec, config, id_to_char, logger):

# create model, reuse parameters if exists
model = Model_class(config)
ckpt = tf.train.get_checkpoint_state(path)
if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):
    logger.info("Reading model parameters from %s" % ckpt.model_checkpoint_path)
    model.saver.restore(session, ckpt.model_checkpoint_path)
else:
    logger.info("Created model with fresh parameters.")
    session.run(tf.global_variables_initializer())
    if config["pre_emb"]:
        emb_weights = session.run(model.char_lookup.read_value())
        emb_weights = load_vec(config["emb_file"],id_to_char, config["char_dim"], emb_weights)
        session.run(model.char_lookup.assign(emb_weights))
        logger.info("Load pre-trained embedding.")
return model

`
麻烦了,谢谢

input_from_line has bug

def input_from_line(line, char_to_id):
    """
    Take sentence data and return an input for
    the training or the evaluation function.
    """
    line = full_to_half(line)
    line = replace_html(line)
    inputs = list()
    inputs.append([line])
    line.replace(" ", "$")
    inputs.append([[char_to_id[char] if char in char_to_id else char_to_id["<UNK>"]
                   for char in line]])
    inputs.append([get_seg_features(line)])
    inputs.append([[]])
    return inputs
line.replace(" ", "$")

has no effect, line is unchanged
change to

line = re.sub('\s', '$', line)

?

关于model.py中project_layer输入向量的维度

您好,在model.py代码project_layer方法中,第138行注释, :param lstm_outputs: [batch_size, num_steps, emb_size] .project_layer是介于bilstm层和logits层之间,它的输入应该是bilstm_layer的输出:[batch_size,num_steps,2*lstm].您看我这样理解对吗?

演示训练语料咨询

你好!请教一个小白问题,origin_data里的训练及预测语料是用什么工具整理成那种格式的,能提供下代码吗?谢谢!

运行时报错UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

Traceback (most recent call last):
File "main.py", line 227, in
if name == "main":
File "C:\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 126, in run
_sys.exit(main(argv))
File "main.py", line 221, in main
clean(FLAGS)
File "main.py", line 187, in train

File "main.py", line 87, in evaluate
ner_results = model.evaluate(sess, data, id_to_tag)
File "C:\pyproject\ChineseNER-master\utils.py", line 66, in test_ner
eval_lines = return_report(output_file)
File "C:\pyproject\ChineseNER-master\conlleval.py", line 284, in return_report
counts = evaluate(f)
File "C:\pyproject\ChineseNER-master\conlleval.py", line 74, in evaluate
for line in iterable:
File "C:\Python35\lib\codecs.py", line 711, in next
return next(self.reader)
File "C:\Python35\lib\codecs.py", line 642, in next
line = self.readline()
File "C:\Python35\lib\codecs.py", line 555, in readline
data = self.read(readsize, firstline=True)
File "C:\Python35\lib\codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

关于数据集

Hi,您好,请教一下sighan.dev这个数据集和训练集和测试集有什么不同?

'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

Traceback (most recent call last):
File "E:\python2.7\pycharm\PyCharm 4.5.5\helpers\pydev\pydevd.py", line 2358, in
globals = debugger.run(setup['file'], None, None, is_module)
File "E:\python2.7\pycharm\PyCharm 4.5.5\helpers\pydev\pydevd.py", line 1778, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "E:\python2.7\pycharm\PyCharm 4.5.5\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "ChineseNER-master/main.py", line 225, in
tf.app.run(main)
File "tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "ChineseNER-master/main.py", line 219, in main
train()
File "ChineseNER-master/main.py", line 185, in train
best = evaluate(sess, model, "dev", dev_manager, id_to_tag, logger)
File "ChineseNER-master/main.py", line 85, in evaluate
eval_lines = test_ner(ner_results, FLAGS.result_path)
File "ChineseNER-master\utils.py", line 66, in test_ner
eval_lines = return_report(output_file)
File "ChineseNER-master\conlleval.py", line 282, in return_report
counts = evaluate(f)
File "ChineseNER-master\conlleval.py", line 74, in evaluate
for line in iterable:
File "tensorflow\lib\codecs.py", line 713, in next
return next(self.reader)
File "tensorflow\lib\codecs.py", line 644, in next
line = self.readline()
File "tensorflow\lib\codecs.py", line 557, in readline
data = self.read(readsize, firstline=True)
File "tensorflow\lib\codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa3 in position 0: invalid start byte

我的是tensorflow 1.3版本,请问下大家有没有遇到类似问题?有何解决方法。

project_layer 为什么用了两个 hidden layer ?

model.py 文件第 135 行的 project_layer 方法,定义了一个 hidden 层维度为 [self.lstm_dim*2, self.lstm_dim] ,然后定义 pred 维度为 [self.lstm_dim, self.num_tags]

为什么不直接从只定义一个 hidden layer,维度为 [self.lstm_dim*2, self.num_tags] ?

About structure of word/character in Chinese

  • Could you explain for me about the structure of words in Chinese? Is it similar to English (a sentence consists of several words and a word is a combination of several characters)?
  • Which dataset did you post in the data folder: example.train/ .dev/ .test?
  • In Python: can we use the below code to extract words, characters from a sentence:
for word in sentence:
    for char in word:
        # do something
    if word.lower() == word:
        # do something
    if word[0].upper() == word:
        # do something

Thank you in advance!

数据集

为什么我用我自己的数据集训练,会报下面这个错误?
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.