Giter Club home page Giter Club logo

plm-nlp-code's People

Contributors

carfly avatar jiangfeng1124 avatar ymcui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

plm-nlp-code's Issues

第三章 sent_split函数 问题

from ltp import StnSplit
from ltp import LTP

ltp = LTP()

sents2 = StnSplit().batch_split(["南京市长江大桥。", "汤姆生病了。他去了医院。"])
sents2

['南京市长江大桥。', '汤姆生病了。', '他去了医院。']

segment = ltp.pipeline(sents2,tasks=['cws'], return_dict=False)
segment

([['南京市', '长江', '大桥', '。'],
['汤姆', '生病', '了', '。'],
['他', '去', '了', '医院', '。']],)

segment = ltp.pipeline(sents2)
segment['pos']

[['ns', 'ns', 'n', 'wp'], ['nh', 'v', 'u', 'wp'], ['r', 'v', 'u', 'n', 'wp']]

你好,第四章的lstm_sent_polarity.py无法运行

如题,直接运行会报错。
RuntimeError: 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor

需要把'lengths'放到cpu中去,才可以。

将第40行的
x_pack = pack_padded_sequence(embeddings, lengths, batch_first=True, enforce_sorted=False)
改为
x_pack = pack_padded_sequence(embeddings, lengths.cpu(), batch_first=True, enforce_sorted=False)

同样的,当电脑中同时有CUDA和CPU环境时,transformer_sent_polarity.py文件也会报错
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
第四章文件utils.py 的第25行
mask = torch.arange(max_len).expand(lengths.shape[0], max_len) < lengths.unsqueeze(1)
也应修改为:
mask = torch.arange(max_len).expand(lengths.shape[0], max_len).cuda() < lengths.unsqueeze(1)

第七章 fine-tune代码优化 。SSC任务CPU上36小时变2小时

您好,我发现第七章代码中有处地方能够优化一下。 tokenizer函数中,可以去掉padding='max_length',浪费计算资源。transformer提供的Trainer构造时的data_collator参数默认采用了动态补全的方法,按照batch进行补全,能够节省计算资源。

在我的CPU上跑,时间从36小时变为2小时(没跑完,进度条给的预估时间)

我收集的勘误 updating

4.2.2
一行代码
outputs_pool2 = pool1(outputs2) , pool1 改为pool2
也许git clone 的代码是对的 , 只是印刷错误 我没有核实

4.5.1
公式没有完全体现 伯努利
“更本质地讲,交叉熵损失函数公式右侧是对多类输出结果的分布(伯努利分布)求极大似然中的对数似然函数(Log-Likelihood)。”
image
在y_(i)j = 0 的时候 应该是 - (1- y_(i)j ) log (1 - y^(i)j )
作者只写了一半(y=1的部分) 上下文结论是对的

4.5.2
原句“ log_probs = F.log_softmax(outputs ,dim=1) #取对数的目的是避免softmax溢出”
其实 取对数还有一个目的 是因为 后面的代码的 nn.NLLLoss 没有log运算 (默认 NLLLoss只执行 ‘乘 -1’ 和 ‘相乘’ 的操作)

第三章 3.4.3.1 wikiextractor 问题

安装问题比较多 (https://dumps.wikimedia.org/zhwiki/latest/ 语料库)

  1. 如果遇到err 就像下面
    ’”aise source.error('global flags not at the start '
    re.error: global flags not at the start of the expression at position 4 “

请务必将python 退到py3.10 的版本 (我用的anaconda 是3.11的 一直报错)

example :
Conda create --name py310 python=3.10
conda activate py310
pip install wikiextractor

2) 如果开始运行 python -m wikiextractor.WikiExtractor jawiki-latest-pages-articles.xml.bz2 了 很长一段时间 ,如
'...xxx pages ...
...xxx pages ...
...xxx pages ...'
突然报 带’fork‘的错误

一个解决方案
pip install git+https://github.com/prokotg/wikiextractor

wikiextractor 会从3.0.6 回退到 3.0.4 从而 ok


python -m wikiextractor.WikiExtractor jawiki-latest-pages-articles.xml.bz2

从而ok

3.2.1 使用ltp分词示例错误

from lip import LTP
ltp = LTP()
# segment, hidden = ltp.seg(['南京市长江大桥。']) 报错
# 修改为
segment = ltp.pipeline(['南京市长江大桥。'], tasks=['cws'], return_dict=False)
print(segment)

7.4.4.2节代码无法运行

文件finetune_bert_mrc.py加载数据时,会报如下错误
ConnectionError: Couldn't reach https://raw.githubusercontent.com/huggingface/datasets/1.10.2/datasets/squad/squad.py
原因是国内无法连接

ffnnlm.py注释疑似有误

根据书上的章节内容,第五章的ffnnlm.py的第一行注释应该改为# Defined in Section 5.1.3.2

第五章rnnlm.py显存溢出

如标题所示,调整了batch_size=32之后在3060的机器上运行本代码仍然会导致显存溢出,请问有无优化的方法呢?谢谢
图片

第5章代码问题

第5章中
在Class GloveDataset:
def collate_fn(self, examples)
但是全文没有一个地方有examples

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.