padeoe / cail2019 Goto Github PK

View Code? Open in Web Editor NEW

243.0 243.0 39.0 267 KB

法研杯2019相似案例匹配第二名解决方案（附数据集和文档）,CAIL2020/2021司法考试赛道冠军队伍

Home Page: https://padeoe.com/cail-2019

License: Apache License 2.0

Python 82.60% Dockerfile 15.06% Shell 2.34%

bert competition text-similarity

cail2019's Introduction

Hi there 👋

cail2019's People

Stargazers

Watchers

cail2019's Issues

关于模型数据预处理及模型输入的问题？

在运行data.py后会得到文件夹data，其中子文件夹raw、test、train，请问在train.py中
# TRAINING_DATASET = 'data/train/input.txt' # for quick dev
TRAINING_DATASET = "data/raw/CAIL2019-SCM-big/SCM_5k.json"

其中training_dataset 的选择会影响模型的训练吗？./data/train下的input.txt与原始SCM_5k.json有什么不同吗？

作者您好！请问我该如何下载pytorch版本的BERT预训练模型呢？不胜感激！

作者您好，我昨天尝试跑了一下训练，没有进行任何改变最后的准确率为0.56，通过查看您的GitHub网页，我发现没有设置数据增广，

在设置数据增广后，在evaluate阶段报错，错误如下：
Epoch 1/2, Loss 0.1747398: 100%|████████████| 1865/1865 [21:49<00:00, 1.42it/s]
5964 1020（这两个数字是我打印出来的值分别对应len(predict_result) 和 len(real_label_list)）
Traceback (most recent call last):
File "train.py", line 60, in
trainer.train(MODEL_DIR, 1)
File "cail2019-master/model.py", line 573, in train
acc, loss = self.evaluate(model, test_data, test_label_list)
File "cail2019-master/model.py", line 648, in evaluate
assert len(predict_result) == len(real_label_list)
AssertionError

软label

“我们尝试了计算出符合三元组标记关系的软label，并要求这个软label和BERT原始模型的预测值相差不大。“，大神你好，请问这里可以具体说一下吗？

关于Bert模型加载？

请问您代码中加载Bert模型用的配置文件是TF版本的还是Bert版本？

数据label问题

作者您好，感谢您的分享。
在看数据增广部分时，发现这样一个问题，您在增广时默认B为标签值，即sim（AB）>sim（AC），但是我观察到下载的数据集中每条json数据是有“label”标签的，且“label”标签存在B和C两种，这样是不是和您的预设冲突了呢？
期待您的回复~

Prediction Acc is 0.533 when running main.py and judger.py

I use already learned 民事文书BERT as my pretrained model downloaded from https://github.com/thunlp/OpenCLaP
And I set fp16=False
I train the model with your code. The training is good.
I copy some information from the train.log

But it turns out the acc of the prediction is 0.53 as I run the main.py and judger.py
Does the saving model has some problems?

5fold and 1fold experiment GAP

Hi, I run your code under such setting:
TRAINING_DATASET = 'data/raw/CAIL2019-SCM-big/SCM_5k.json'
BERT_PRETRAINED_MODEL=民事文书BERT
config = {
"max_length": 512,
"epochs": 2,
"batch_size": 12,
"learning_rate": 2e-5,
"fp16": False
}
trainer.train(MODEL_DIR, 5)
The Acc in the test set 85%.
However, if I set the TRAINING_DATASET = 'data/train/input.txt' and trainer.train(MODEL_DIR, 1), then the Acc in the test set is 66.6%. I don't know why there is such a big gap between the 5fold and 1fold.
P.S. I both copy the original vocab.txt from the BERT_PRETRAINED_MODEL in 5fold and 1fold experiment in case of the "vocabulary indices are not consecutive".

I guess that training set in the 5th fold including many instances in the test set (create by data.py). So the test Acc can reach 85%. So I believe the 85% may not be solid.

同理，对于启发式+反对称式的增广，代码也有问题

padeoe / cail2019 Goto Github PK

cail2019's Introduction

Hi there 👋

cail2019's People

Stargazers

Watchers

Forkers

cail2019's Issues

Recommend Projects

Recommend Topics

Recommend Org