padeoe / cail2019 Goto Github PK
View Code? Open in Web Editor NEW法研杯2019相似案例匹配第二名解决方案(附数据集和文档),CAIL2020/2021司法考试赛道冠军队伍
Home Page: https://padeoe.com/cail-2019
License: Apache License 2.0
法研杯2019相似案例匹配第二名解决方案(附数据集和文档),CAIL2020/2021司法考试赛道冠军队伍
Home Page: https://padeoe.com/cail-2019
License: Apache License 2.0
在运行data.py后会得到文件夹data,其中子文件夹raw、test、train,请问在train.py中
# TRAINING_DATASET = 'data/train/input.txt' # for quick dev
TRAINING_DATASET = "data/raw/CAIL2019-SCM-big/SCM_5k.json"
其中training_dataset 的选择会影响模型的训练吗?./data/train下的input.txt与原始SCM_5k.json有什么不同吗?
如题
作者您好!请问我该如何下载pytorch版本的BERT预训练模型呢?不胜感激!
作者您好,我昨天尝试跑了一下训练,没有进行任何改变最后的准确率为0.56,通过查看您的GitHub网页,我发现没有设置数据增广,
在设置数据增广后,在evaluate阶段报错,错误如下:
Epoch 1/2, Loss 0.1747398: 100%|████████████| 1865/1865 [21:49<00:00, 1.42it/s]
5964 1020(这两个数字是我打印出来的值分别对应len(predict_result) 和 len(real_label_list))
Traceback (most recent call last):
File "train.py", line 60, in
trainer.train(MODEL_DIR, 1)
File "cail2019-master/model.py", line 573, in train
acc, loss = self.evaluate(model, test_data, test_label_list)
File "cail2019-master/model.py", line 648, in evaluate
assert len(predict_result) == len(real_label_list)
AssertionError
“我们尝试了计算出符合三元组标记关系的软label, 并要求这个软label和BERT原始模型的预测值相差不大。“,大神你好,请问这里可以具体说一下吗?
请问您代码中加载Bert模型用的配置文件是TF版本的还是Bert版本?
作者您好,感谢您的分享。
在看数据增广部分时,发现这样一个问题,您在增广时默认B为标签值,即sim(AB)>sim(AC),但是我观察到下载的数据集中每条json数据是有“label”标签的,且“label”标签存在B和C两种,这样是不是和您的预设冲突了呢?
期待您的回复~
I use already learned 民事文书BERT as my pretrained model downloaded from https://github.com/thunlp/OpenCLaP
And I set fp16=False
I train the model with your code. The training is good.
I copy some information from the train.log
But it turns out the acc of the prediction is 0.53 as I run the main.py and judger.py
Does the saving model has some problems?
Hi, I run your code under such setting:
TRAINING_DATASET = 'data/raw/CAIL2019-SCM-big/SCM_5k.json'
BERT_PRETRAINED_MODEL=民事文书BERT
config = {
"max_length": 512,
"epochs": 2,
"batch_size": 12,
"learning_rate": 2e-5,
"fp16": False
}
trainer.train(MODEL_DIR, 5)
The Acc in the test set 85%.
However, if I set the TRAINING_DATASET = 'data/train/input.txt' and trainer.train(MODEL_DIR, 1), then the Acc in the test set is 66.6%. I don't know why there is such a big gap between the 5fold and 1fold.
P.S. I both copy the original vocab.txt from the BERT_PRETRAINED_MODEL in 5fold and 1fold experiment in case of the "vocabulary indices are not consecutive".
I guess that training set in the 5th fold including many instances in the test set (create by data.py). So the test Acc can reach 85%. So I believe the 85% may not be solid.
你好,我用你的程序,跑出来,评分只有0.52。感觉差太远了。我只把epoch设成1,batch_size设成8.其它没变。
对于标签为C的数据,代码中(model.py)的启发式增广如下
pd.Series((x["C"], x["A"], x["B"], "C"))
但是我认为正确的应该是
pd.Series(x["C", x["B"], x["A"], "C"))
同理,对于启发式+反对称式的增广,代码也有问题
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.