Giter Club home page Giter Club logo

fastnlp's Introduction

fastNLP

fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是减少用户项目中的工程型代码,例如数据处理循环、训练循环、多卡运行等。

fastNLP具有如下的特性:

  • 便捷。在数据处理中可以通过apply函数避免循环、使用多进程提速等;在训练循环阶段可以很方便定制操作。
  • 高效。无需改动代码,实现fp16切换、多卡、ZeRO优化等。
  • 兼容。fastNLP支持多种深度学习框架作为后端。

⚠️ 为了实现对不同深度学习架构的兼容,fastNLP 1.0.0之后的版本重新设计了架构,因此与过去的fastNLP版本不完全兼容, 基于更早的fastNLP代码需要做一定的调整:

fastNLP文档

中文文档

安装指南

fastNLP可以通过以下的命令进行安装

pip install fastNLP>=1.0.0alpha

如果需要安装更早版本的fastNLP请指定版本号,例如

pip install fastNLP==0.7.1

另外,请根据使用的深度学习框架,安装相应的深度学习框架。

Pytorch 下面是使用pytorch来进行文本分类的例子。需要安装torch>=1.6.0。
from fastNLP.io import ChnSentiCorpLoader
from functools import partial
from fastNLP import cache_results
from fastNLP.transformers.torch import BertTokenizer

# 使用cache_results装饰器装饰函数,将prepare_data的返回结果缓存到caches/cache.pkl,再次运行时,如果
#  该文件还存在,将自动读取缓存文件,而不再次运行预处理代码。
@cache_results('caches/cache.pkl')
def prepare_data():
    # 会自动下载数据,并且可以通过文档看到返回的 dataset 应该是包含"raw_words"和"target"两个field的
    data_bundle = ChnSentiCorpLoader().load()
    # 使用tokenizer对数据进行tokenize
    tokenizer = BertTokenizer.from_pretrained('hfl/chinese-bert-wwm')
    tokenize = partial(tokenizer, max_length=256)  # 限制数据的最大长度
    data_bundle.apply_field_more(tokenize, field_name='raw_chars', num_proc=4)  # 会新增"input_ids", "attention_mask"等field进入dataset中
    data_bundle.apply_field(int, field_name='target', new_field_name='labels')  # 将int函数应用到每个target上,并且放入新的labels field中
    return data_bundle
data_bundle = prepare_data()
print(data_bundle.get_dataset('train')[:4])

# 初始化model, optimizer
from fastNLP.transformers.torch import BertForSequenceClassification
from torch import optim
model = BertForSequenceClassification.from_pretrained('hfl/chinese-bert-wwm')
optimizer = optim.AdamW(model.parameters(), lr=2e-5)

# 准备dataloader
from fastNLP import prepare_dataloader
dls = prepare_dataloader(data_bundle, batch_size=32)

# 准备训练
from fastNLP import Trainer, Accuracy, LoadBestModelCallback, TorchWarmupCallback, Event
callbacks = [
    TorchWarmupCallback(warmup=0.1, schedule='linear'),   # 训练过程中调整学习率。
    LoadBestModelCallback()  # 将在训练结束之后,加载性能最优的model
]
# 在训练特定时机加入一些操作, 不同时机能够获取到的参数不一样,可以通过Trainer.on函数的文档查看每个时机的参数
@Trainer.on(Event.on_before_backward())
def print_loss(trainer, outputs):
    if trainer.global_forward_batches % 10 == 0:  # 每10个batch打印一次loss。
        print(outputs.loss.item())

trainer = Trainer(model=model, train_dataloader=dls['train'], optimizers=optimizer,
                  device=0, evaluate_dataloaders=dls['dev'], metrics={'acc': Accuracy()},
                  callbacks=callbacks, monitor='acc#acc',n_epochs=5,
                  # Accuracy的update()函数需要pred,target两个参数,它们实际对应的就是以下的field。
                  evaluate_input_mapping={'labels': 'target'},  # 在评测时,将dataloader中会输入到模型的labels重新命名为target
                  evaluate_output_mapping={'logits': 'pred'}  # 在评测时,将model输出中的logits重新命名为pred
                  )
trainer.run()

# 在测试集合上进行评测
from fastNLP import Evaluator
evaluator = Evaluator(model=model, dataloaders=dls['test'], metrics={'acc': Accuracy()},
                      # Accuracy的update()函数需要pred,target两个参数,它们实际对应的就是以下的field。
                      output_mapping={'logits': 'pred'},
                      input_mapping={'labels': 'target'})
evaluator.run()

更多内容可以参考如下的链接

快速入门

详细使用教程

Paddle 下面是使用paddle来进行文本分类的例子。需要安装paddle>=2.2.0以及paddlenlp>=2.3.3。
from fastNLP.io import ChnSentiCorpLoader
from functools import partial

# 会自动下载数据,并且可以通过文档看到返回的 dataset 应该是包含"raw_words"和"target"两个field的
data_bundle = ChnSentiCorpLoader().load()

# 使用tokenizer对数据进行tokenize
from paddlenlp.transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('hfl/chinese-bert-wwm')
tokenize = partial(tokenizer, max_length=256)  # 限制一下最大长度
data_bundle.apply_field_more(tokenize, field_name='raw_chars', num_proc=4)  # 会新增"input_ids", "attention_mask"等field进入dataset中
data_bundle.apply_field(int, field_name='target', new_field_name='labels')  # 将int函数应用到每个target上,并且放入新的labels field中
print(data_bundle.get_dataset('train')[:4])

# 初始化 model 
from paddlenlp.transformers import BertForSequenceClassification, LinearDecayWithWarmup
from paddle import optimizer, nn
class SeqClsModel(nn.Layer):
    def __init__(self, model_checkpoint, num_labels):
        super(SeqClsModel, self).__init__()
        self.num_labels = num_labels
        self.bert = BertForSequenceClassification.from_pretrained(model_checkpoint)

    def forward(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
        logits = self.bert(input_ids, token_type_ids, position_ids, attention_mask)
        return logits

    def train_step(self, input_ids, labels, token_type_ids=None, position_ids=None, attention_mask=None):
        logits = self(input_ids, token_type_ids, position_ids, attention_mask)
        loss_fct = nn.CrossEntropyLoss()
        loss = loss_fct(logits.reshape((-1, self.num_labels)), labels.reshape((-1, )))
        return {
            "logits": logits,
            "loss": loss,
        }
    
    def evaluate_step(self, input_ids, token_type_ids=None, position_ids=None, attention_mask=None):
        logits = self(input_ids, token_type_ids, position_ids, attention_mask)
        return {
            "logits": logits,
        }

model = SeqClsModel('hfl/chinese-bert-wwm', num_labels=2)

# 准备dataloader
from fastNLP import prepare_dataloader
dls = prepare_dataloader(data_bundle, batch_size=16)

# 训练过程中调整学习率。
scheduler = LinearDecayWithWarmup(2e-5, total_steps=20 * len(dls['train']), warmup=0.1)
optimizer = optimizer.AdamW(parameters=model.parameters(), learning_rate=scheduler)

# 准备训练
from fastNLP import Trainer, Accuracy, LoadBestModelCallback, Event
callbacks = [
    LoadBestModelCallback()  # 将在训练结束之后,加载性能最优的model
]
# 在训练特定时机加入一些操作, 不同时机能够获取到的参数不一样,可以通过Trainer.on函数的文档查看每个时机的参数
@Trainer.on(Event.on_before_backward())
def print_loss(trainer, outputs):
    if trainer.global_forward_batches % 10 == 0:  # 每10个batch打印一次loss。
        print(outputs["loss"].item())

trainer = Trainer(model=model, train_dataloader=dls['train'], optimizers=optimizer,
                  device=0, evaluate_dataloaders=dls['dev'], metrics={'acc': Accuracy()},
                  callbacks=callbacks, monitor='acc#acc',
                  # Accuracy的update()函数需要pred,target两个参数,它们实际对应的就是以下的field。
                  evaluate_output_mapping={'logits': 'pred'},
                  evaluate_input_mapping={'labels': 'target'}
                  )
trainer.run()

# 在测试集合上进行评测
from fastNLP import Evaluator
evaluator = Evaluator(model=model, dataloaders=dls['test'], metrics={'acc': Accuracy()},
                      # Accuracy的update()函数需要pred,target两个参数,它们实际对应的就是以下的field。
                      output_mapping={'logits': 'pred'},
                      input_mapping={'labels': 'target'})
evaluator.run()

更多内容可以参考如下的链接

快速入门

详细使用教程

oneflow
jittor

项目结构

fastNLP的项目结构如下:

fastNLP 开源的自然语言处理库
fastNLP.core 实现了核心功能,包括数据处理组件、训练器、测试器等
fastNLP.models 实现了一些完整的神经网络模型
fastNLP.modules 实现了用于搭建神经网络模型的诸多组件
fastNLP.embeddings 实现了将序列index转为向量序列的功能,包括读取预训练embedding等
fastNLP.io 实现了读写功能,包括数据读入与预处理,模型读写,数据与模型自动下载等

fastnlp's People

Contributors

2017alan avatar augc000 avatar chenkaiyu1997 avatar dqwang122 avatar fengziyjun avatar fftyyy avatar gosicfly avatar h00jiang avatar hy-struggle avatar keezen avatar kunyaa avatar leesureman avatar letianlee avatar linzehui avatar lxr-tech avatar lyhuang18 avatar morningforest avatar nlpqq avatar rogerdjq avatar srwyg avatar violetyao avatar willqvq avatar wlhgtc avatar x54-729 avatar xiaoxiong-liu avatar xpqiu avatar xuyige avatar xyltt avatar yhcc avatar zide05 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastnlp's Issues

tutorial的jupyter文件打开失败

报错如下:
NotJSONError("Notebook does not appear to be JSON: '\\n\\n\\n\\n\\n\\n<!DOCTYPE html>\\n<html lang...",)

附上截图如下:
image

SelfAttention only mask padding with 0,

in self_attention.py 's line 66 , if the input_origin==0 , the input will be masked. if the user didn't assign 0 as pad number, there will be something wrong.

We need separate the test codes and example codes.

Is your feature request related to a problem? Please describe. 问题是什么
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

We need separate the test codes and example codes.

Describe the solution you'd like 解决方案是什么
A clear and concise description of what you want to happen.

Replace them into the different directories. 'test' directory is just for the unit/function test.

Describe alternatives you've considered 其他解决方案
A clear and concise description of any alternative solutions or features you've considered.

Additional context 备注
Add any other context or screenshots about the feature request here.

关于fastNLP.core.batch模块中关于多进程prefetch疑问

阅读core/batch.py源代码时遇到一些疑问,希望能解答。

在batch.py代码中设置了全局变量_python_is_exit并且使用atexit进行了注册,但是在多进程程序当中,子进程只是复制父进程的全局变量的值,也就是在父进程(或子进程)中修改全局变量不会改变子进程(或父进程)的值,也就是全局变量不能使进程间通信,即使使用atexit进行注册,也只是局限于当前进程中。
在静态方法_run_fetch中,有一个逻辑是:当队列Full之后,判断_python_is_exit为真就返回None,我的问题是,既然_python_is_exit在程序退出之前无法手动修改,也就是_python_is_exit一直为False,那么这个逻辑在什么情况下会触发?

Proposal: Tensorboard for FastNLP

Tensorboard is a good thing to monitor (or record) what happens during training, including loss, learning rate, model weights, and any evaluation metric. FastNLP is expected to provide a comprehensive view of the training process. Support for tensorboard would be a distinguishable feature.

There are a lot of open source projects about how to use Tensorboard in pytorch. However, FastNLP should minimize the rely on third-party packages. That's why I suggest developing such a thing by ourselves.

reference:
https://github.com/lanpa/tensorboardX
https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/04-utils/tensorboard
https://github.com/torrvision/crayon
https://github.com/TeamHG-Memex/tensorboard_logger

Proposal changes of class names

Since we don't have sub classes in trainer, tester and preprocessor, there is no need to call them a base class.

BaseTrainer ----> Trainer
BaseTester  ----> Tester
BaseProprocess ----> Preprocess

Suggest a change to the sequence labeling model's forward

current:

                                      truth
x ----(forward)----> y  ---(crf)----> y_t ------> loss
                    |---(viterbi)---> prediction
    def forward(self, word_seq, word_seq_origin_len):
        """
        :param word_seq: LongTensor, [batch_size, mex_len]
        :param word_seq_origin_len: LongTensor, [batch_size,], the origin lengths of the sequences.
        :return y: [batch_size, mex_len, tag_size]
        """
    def loss(self, x, y):
        """
        Negative log likelihood loss.
        :param x: Tensor, [batch_size, max_len, tag_size]
        :param y: Tensor, [batch_size, max_len]
        :return loss: a scalar Tensor

        """
    def prediction(self, x):
        """
        :param x: FloatTensor, [batch_size, max_len, tag_size]
        :return prediction: list of [decode path(list)]
        """

changes:

If only x is given, output prediction.
x ----(forward)-----(viterbi)---> prediction

If x and ground truth are given, output loss.
x -----(forward)----(crf)-->  loss
truth
    def forward(self, word_seq, word_seq_origin_len, truth=None):
        """
        :param word_seq: LongTensor, [batch_size, mex_len]
        :param word_seq_origin_len: LongTensor, [batch_size,], the origin lengths of the sequences.
        :param truth: if None, return prediction; otherwise return loss.
        """

fastNLP安装完成之后导入有错

Python 3.5环境下安装fastNLP,显示可以安装成功,但是import fastNLP时会出现
File "D:\anaconda\lib\site-packages\fastNLP\core\instance.py", line 40
f" type={(str(type(self.fields[field_name]))).split(s)[1]}" for field_name in self.fields) + "}"
^
SyntaxError: invalid syntax
Python3.6和Python3.7也不行,都是安装完成之后,import时就会报错

dataset类中的delete_instance方法报错

from fastNLP import DataSet
dataset = DataSet({'a': list(range(-5, 5))})
dataset.delete_instance(3)


AttributeError Traceback (most recent call last)
in
1 from fastNLP import DataSet
2 dataset = DataSet({'a': list(range(-5, 5))})
----> 3 dataset.delete_instance(3)

~/anaconda3/envs/fastnlp/lib/python3.6/site-packages/fastNLP/core/dataset.py in delete_instance(self, index)
482 else:
483 for field in self.field_arrays.values():
--> 484 field.pop(index)
485
486 def delete_field(self, field_name):

AttributeError: 'FieldArray' object has no attribute 'pop'

TypeError: to() got an unexpected keyword argument 'non_blocking'

实例化Trainer,传入模型和数据,进行训练

先在test_data拟合(确保模型的实现是正确的)

copy_model = deepcopy(model)
overfit_trainer = Trainer(model=copy_model, train_data=test_data, dev_data=test_data,
loss=loss,
metrics=metric,
save_path=None,
batch_size=32,
n_epochs=5)
overfit_trainer.train()

TypeError Traceback (most recent call last)
in
7 save_path=None,
8 batch_size=32,
----> 9 n_epochs=5)
10 overfit_trainer.train()

/anaconda2/envs/python3.6/lib/python3.6/site-packages/fastNLP/core/trainer.py in init(self, train_data, model, loss, metrics, n_epochs, batch_size, print_every, validate_every, dev_data, save_path, optimizer, check_code_level, metric_key, sampler, prefetch, use_tqdm, use_cuda, callbacks)
103 _check_code(dataset=train_data, model=model, losser=losser, metrics=metrics, dev_data=dev_data,
104 metric_key=metric_key, check_level=check_code_level,
--> 105 batch_size=min(batch_size, DEFAULT_CHECK_BATCH_SIZE))
106
107 self.train_data = train_data

/anaconda2/envs/python3.6/lib/python3.6/site-packages/fastNLP/core/trainer.py in _check_code(dataset, model, losser, metrics, batch_size, dev_data, metric_key, check_level)
436 batch = Batch(dataset=dataset, batch_size=batch_size, sampler=SequentialSampler())
437 for batch_count, (batch_x, batch_y) in enumerate(batch):
--> 438 _move_dict_value_to_device(batch_x, batch_y, device=model_devcie)
439 # forward check
440 if batch_count==0:

/anaconda2/envs/python3.6/lib/python3.6/site-packages/fastNLP/core/utils.py in _move_dict_value_to_device(device, non_blocking, *args)
203 for key, value in arg.items():
204 if isinstance(value, torch.Tensor):
--> 205 arg[key] = value.to(device, non_blocking=non_blocking)
206 else:
207 raise TypeError("Only support dict type right now.")

TypeError: to() got an unexpected keyword argument 'non_blocking'

用train_data训练,在test_data验证

trainer = Trainer(model=model, train_data=train_data, dev_data=test_data,
loss=CrossEntropyLoss(pred="output", target="label_seq"),
metrics=AccuracyMetric(pred="predict", target="label_seq"),
save_path=None,
batch_size=32,
n_epochs=5)
trainer.train()
print('Train finished!')

TypeError Traceback (most recent call last)
in
5 save_path=None,
6 batch_size=32,
----> 7 n_epochs=5)
8 trainer.train()
9 print('Train finished!')

/anaconda2/envs/python3.6/lib/python3.6/site-packages/fastNLP/core/trainer.py in init(self, train_data, model, loss, metrics, n_epochs, batch_size, print_every, validate_every, dev_data, save_path, optimizer, check_code_level, metric_key, sampler, prefetch, use_tqdm, use_cuda, callbacks)
103 _check_code(dataset=train_data, model=model, losser=losser, metrics=metrics, dev_data=dev_data,
104 metric_key=metric_key, check_level=check_code_level,
--> 105 batch_size=min(batch_size, DEFAULT_CHECK_BATCH_SIZE))
106
107 self.train_data = train_data

/anaconda2/envs/python3.6/lib/python3.6/site-packages/fastNLP/core/trainer.py in _check_code(dataset, model, losser, metrics, batch_size, dev_data, metric_key, check_level)
436 batch = Batch(dataset=dataset, batch_size=batch_size, sampler=SequentialSampler())
437 for batch_count, (batch_x, batch_y) in enumerate(batch):
--> 438 _move_dict_value_to_device(batch_x, batch_y, device=model_devcie)
439 # forward check
440 if batch_count==0:

/anaconda2/envs/python3.6/lib/python3.6/site-packages/fastNLP/core/utils.py in _move_dict_value_to_device(device, non_blocking, *args)
203 for key, value in arg.items():
204 if isinstance(value, torch.Tensor):
--> 205 arg[key] = value.to(device, non_blocking=non_blocking)
206 else:
207 raise TypeError("Only support dict type right now.")

TypeError: to() got an unexpected keyword argument 'non_blocking'

Proposal for a concise data flow

A concise data flow:
dataset.loadCoNLL() %load data from raw text.
dataset.preprocess() %
train,test=dataset.split() %optional for the data without dev set. This feature should extract the split from preprocess.
dataset.save_to_pickle() %
dataset.load_from_pickle() % load the preprocess data.

%codes for model construction
model.fit(train hyperparameter)
model.evaulate(test)

@FengZiYjun

two bugs,

Describe the bug
[epoch: 1 step: 15623] train loss: 1.4 time: 0:16:43
[epoch: 1 step: 15624] train loss: 1.4 time: 0:16:43
Traceback (most recent call last):
File "main.py", line 55, in
trainer.train(model,train_data , dev_data)
File "/home/sjwang/anaconda3/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 142, in train
if self.save_best_dev and self.best_eval_result(validator):
File "/home/sjwang/anaconda3/lib/python3.6/site-packages/fastNLP/core/trainer.py", line 373, in best_eval_result
_, _, accuracy = validator.metrics()
File "/home/sjwang/anaconda3/lib/python3.6/site-packages/fastNLP/core/tester.py", line 254, in metrics
y_prob, y_true = zip(*self.eval_history)
ValueError: not enough values to unpack (expected 2, got 0)

when i test the self attention model, i meat this bug. that is during the test ,it want to compare the current test accuracy with the history accuracy ,however the history list is empty. i have already set save_output = true and save_loss = true .

the second bug is during the initial process , the config_loader tell me
"cannot load attribute epochs in section train" ,in my config file there is a line "epoches = 30" .
and the program will not break down for this line,and it will run two epochs before run the test module.

a new function for argparse

we should provide a function for arg parse so that we can support "python fastnlp.py --arg1 value1 --arg2 value2" and so on.

in this way, what argument should we have?

Suggestions about dataset and datasetloader.

  1. The dataset class should be enough for all the tasks, I don't understand why its sub-classes exist. Instead, different preprocessing methods can be implemented in dataset class.

  2. The datasetloader should return a 'dataset' object consisting of instances. Since the field of instance is related to the raw data format, some codes in dataset.convert() shoud be moved into datasetset loader.

star-transformer何时可以放出完整代码?实验完全无法重现,SST-5数据集上相差6个点哦

Describe the bug
A clear and concise description of what the bug is.
清晰而简要地描述bug

To Reproduce
使用你们的star-transformer代码,然后用allennlp做训练(glove 42B 词向量), 最后结果见如图,与论文中报告的结果相差6个点。

请求解释!以及完整版的代码,就是可以完全复现结果的完整版。

Additional context
Add any other context about the problem here.
备注
image

调用模型STSeqLabel参数不对

调用README里面Usage下面的实例代码出现报错
model = STSeqLabel(vocab_size=10000, num_cls=50, emb_dim=300)
报错:TypeError: init() got an unexpected keyword argument 'vocab_size'
有人遇到同样的问题吗,怎么解决呢?

ModuleNotFoundError:

我在使用fastnlp时,先import的时候会报错,没找到原因.
下面的是 我在本地运行 advance_tutorial的报错结果
请问是什么原因呀?
//fastnlp的版本为最新版0.4.0


ModuleNotFoundError Traceback (most recent call last)
in
1 # 声明部件
2 import torch
----> 3 import fastNLP
4 from fastNLP import DataSet
5 from fastNLP import Instance

d:\program\python\python3.7\lib\site-packages\fastNLP_init_.py in
56
57 from .core import *
---> 58 from . import models
59 from . import modules

d:\program\python\python3.7\lib\site-packages\fastNLP\models_init_.py in
25 ]
26
---> 27 from .base_model import BaseModel
28 from .bert import BertForMultipleChoice, BertForQuestionAnswering, BertForSequenceClassification,
29 BertForTokenClassification

d:\program\python\python3.7\lib\site-packages\fastNLP\models\base_model.py in
1 import torch
2
----> 3 from ..modules.decoder.mlp import MLP
4
5

d:\program\python\python3.7\lib\site-packages\fastNLP\modules_init_.py in
48
49 from . import aggregator
---> 50 from . import decoder
51 from . import encoder
52 from .aggregator import *

d:\program\python\python3.7\lib\site-packages\fastNLP\modules\decoder_init_.py in
6 ]
7
----> 8 from .crf import ConditionalRandomField
9 from .mlp import MLP
10 from .utils import viterbi_decode

ModuleNotFoundError: No module named 'fastNLP.modules.decoder.crf'

Proposal to abolish the use of dictionary as argument passed to trainer

def __init__(self, train_args):

To make code more consistent, and debuggable. I propose we abolish the use of dictionary as argument passed to trainer class. instead we can use **kwargs instead, with argument checking. Default values of argument can be stored in a template dictionary.

def foo(**kwargs):

    tmplt = {'arg1': 'value1', 'arg2': 'value2', 'arg3': 'value3', 'arg4': 'value4', 'arg5': 'value5'}
    for k in kwargs:
        if k not in tmplt:
            raise ValueError('Argument error: %s'%k)
        if type(tmplt[k]) != type(kwargs[k]):
            raise ValueError('Argument %s type mismatch: expected %s while get %s' %
                             (k, type(tmplt[k]), type(kwargs[k])))
        tmplt[k] = kwargs[k]
    print(tmplt)

When called,

foo(arg1='hello', arg3='there')

foo(arg1='hello', arg_error='there')

foo(arg1='hello', arg3=1)

The corresponding results would be,

{'arg1': 'hello', 'arg2': 'value2', 'arg3': 'there', 'arg4': 'value4', 'arg5': 'value5'}

The second calling is passed with a false argument name,

Traceback (most recent call last):
  File "test.py", line 16, in <module>
    foo(arg1='hello', arg_error='there')
  File "test.py", line 6, in foo
    raise ValueError('Argument error: %s'%k)
ValueError: Argument error: arg_error

The third calling is passed with right argument name arg3 but with wrong type of content <integer>

Traceback (most recent call last):
  File "test.py", line 20, in <module>
    foo(arg1='hello', arg3=1)
  File "test.py", line 9, in foo
    (k, type(tmplt[k]), type(kwargs[k])))
ValueError: Argument arg3 type mismatch: expected <type 'str'> while get <type 'int'>

User will know exactly what's going on when passed with a wrong argument name or a wrong argument type.
And from a users' perspective, it makes user code more consistent and elegant.

Source code for Style transformer

Is your feature request related to a problem? Please describe.
Hi, it was mentioned in https://arxiv.org/abs/1905.05621 that the source code will be made available some time in the future. Is there a timeline for the release?

Describe the solution you'd like
I am interested in seeing the released code so that I could do some experiments with it.

Describe alternatives you've considered

Additional context
The paper was accepted to the up coming ACL 2019. Will we be able to see the code before the event? Thanks a lot.

Re-train all models !!!

Since we add a Vocabulary class to present dictionaries in previous commits, word2id.pkl and class2id.pkl are no longer Python dict but Vocabulary instances. They were released with the models and would be used by the high-level interface.
Therefore, we need to re-train all the models, including CWS, POS tagging, and NER.
Also, this is a good chance to set up a complete model information record here and create a "script" folder of training scripts.

An Exception found in the tutorial jupyter notebook.

Bug Description

The ImportError: IntProgress not found was triggered when ran a cell in https://github.com/fastnlp/fastNLP/blob/master/tutorials/fastnlp_1_minute_tutorial.ipynb

Reproduction

1.Use the jupyter to lauch the https://github.com/fastnlp/fastNLP/blob/master/tutorials/fastnlp_1_minute_tutorial.ipynb, then run each cell from top to buttom.
2.When reach the cell[8]:

from fastNLP import Trainer, CrossEntropyLoss, AccuracyMetric
trainer = Trainer(model=model, 
                  train_data=train_data, 
                  dev_data=dev_data,
                  loss=CrossEntropyLoss(),
                  metrics=AccuracyMetric()
                  )
trainer.train()
print('Train finished!')

It produced:

~/anaconda3/envs/fastnlp/lib/python3.6/site-packages/tqdm/_tqdm_notebook.py in status_printer(_, total, desc, ncols)
    102             if total:
--> 103                 pbar = IntProgress(min=0, max=total)
    104             else:  # No total? Show info style bar with no progress tqdm status

NameError: name 'IntProgress' is not defined

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-8-4b34d005949c> in <module>
      6                   metrics=AccuracyMetric()
      7                   )
----> 8 trainer.train()
      9 print('Train finished!')

~/anaconda3/envs/fastnlp/lib/python3.6/site-packages/fastNLP/core/trainer.py in train(self)
    163                 self._summary_writer = SummaryWriter(path)
    164             if self.use_tqdm:
--> 165                 self._tqdm_train()
    166             else:
    167                 self._print_train()

~/anaconda3/envs/fastnlp/lib/python3.6/site-packages/fastNLP/core/trainer.py in _tqdm_train(self)
    177         total_steps = data_iterator.num_batches*self.n_epochs
    178         epoch = 1
--> 179         with tqdm(total=total_steps, postfix='loss:{0:<6.5f}', leave=False, dynamic_ncols=True) as pbar:
    180             ava_loss = 0
    181             for epoch in range(1, self.n_epochs+1):

~/anaconda3/envs/fastnlp/lib/python3.6/site-packages/tqdm/_tqdm_notebook.py in __init__(self, *args, **kwargs)
    210         # Replace with IPython progress bar display (with correct total)
    211         self.sp = self.status_printer(
--> 212             self.fp, self.total, self.desc, self.ncols)
    213         self.desc = None  # trick to place description before the bar
    214 

~/anaconda3/envs/fastnlp/lib/python3.6/site-packages/tqdm/_tqdm_notebook.py in status_printer(_, total, desc, ncols)
    109             # #187 #451 #558
    110             raise ImportError(
--> 111                 "IntProgress not found. Please update jupyter and ipywidgets."
    112                 " See https://ipywidgets.readthedocs.io/en/stable"
    113                 "/user_install.html")

ImportError: IntProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

Speculation

The library ipywidgets is missed, when use "pip install ipywidgets", the bug got fixed. (Since the travis CI is not related to jupyter notebook, the exception would not be deteceted by CI)

Proposal

Add the lib ipywidgets in https://github.com/fastnlp/fastNLP/blob/master/requirements.txt

Environment

Desktop (please complete the following information):

A proposal for path join.

ModelSaver(self.pickle_path + model_name).save_pytorch(network)

Maybe it will be better to use os.path.join(self.pickle_path, model_name), which is more robust.

For example, os.path.join('save', 'a.txt') and os.path.join('save/', 'a.txt') both return save/a.txt, while 'save' + 'a.txt' returns 'savea.txt', which is not expected.

Improvement of to_index and to_word

When a word index Vocabulary is build, some of the words in test set are not available in this vocabulary and I choose to use a new parameter use_unknown to control. If we want to the index of '' for a word not in this Vocabulary, set self.use_unknown=True and Vocabulary.to_index will return the index of ''.

Analyze the current fastNLP and discuss future work

Up to now, fastNLP has been become a semi finished product. I think we should analyze the current fastNLP including its structures, documents, codes and so on. And discuss the what we need to change and our future work.

Implementation details for AutoML

关于fastNLP的AutoML有几个问题:

  1. 我现在是想AutoML作为一个fastNLP.models里的model,并且调用AutoModel.fit的时候,不用正常的trainer,而是用AutoML的trainer去train这个model。所以我主要在Trainer里面新加一个AutoMLTrainerClass,Model里面加一个AutoMLModel的class,这样可以吧?

  2. 如果采取添加model和trainer的实现,如何改变Dataset的处理方式呢?比如搜索是否添加CharLevelFeatures,ParsingFeatures和其他文本级别的Features?

  3. 如果用户只需要fastNLP.FastNLP.fit(X_data_file, Y_data_file),这样是不是对用户更友好一点?但这样是否与fastNLP主框架不很兼容?

  4. AutoKeras的整体设计主要是BayesianOptimization去搜CNN结构,而我们的是不是要搜索更多?包括超参搜索,是否添加Char特征和其他Parsing特征等,选取利用哪些预训练模型和向量,选择怎样的CNN/LSTM/Transformer/CRF结构和层次等等。我们的AutoML是不是都应该搜索,才能保证AutoML的可用性?

embed_loader.py中matrix初始化问题

Describe the bug
A clear and concise description of what the bug is.
embed_loader.py中149-161行代码,对和初始化用到的不是embedding向量的std和mean,而是随机数的std和mean。

Additional context
Add any other context about the problem here.
把149-161行代码和163-165行代码进行交换。

A warning from torch about LSTM

UserWarning: ONNX export failed on LSTM because batch_first not supported
It happens when sequence labeling models are used.
I can't find any solution to this warning.

i guess we need to add something to help the user debug

i find a bug ,that is label's dictionary is wrong ,however my IDE tell me there is something wrong with my loss. and it take me some hours to read all the code to check it .

so i think can there add some module,like assert function in python ,to avoid it .
and in dataprocess's error should not occur in the trainner part.

the wrong line number is 243.

thank you.

Default value for train args.

self.validate = train_args["validate"]
self.save_best_dev = train_args["save_best_dev"]
self.model_saved_path = train_args["model_saved_path"]
self.use_cuda = train_args["use_cuda"]

Should we set some default value for train_args? Otherwise we will pass all these args every time, which is very redundant.

trainner 对2维数据的传入问题

problem
我的input是二维矩阵,一列代表1句话,一个矩阵代表一篇文章,我在文章层面做过padding(把不同长度的句子扩展成同样长度的向量),但是因为不同文章的句子数以及最长句子的长度不同,所以不同文章的2维表示的size是不同的,这就需要在传入一个batch的数据时对这个batch里的所有二维向量进行padding。但fastnlp的trainner似乎现在还没有支持这种操作。

Bug
我在调试的过程中发现,Trainner调用forward后,传入的数据不是一个3维矩阵,而是一个2维矩阵,准确的说总是只会传入一篇文章的2句话

Desktop

  • OS: Windows10 CPU
  • numpy: 1.14.2

Could you provide README files in English, too?

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.