dwadden / multivers Goto Github PK

View Code? Open in Web Editor NEW

44.0 44.0 11.0 128 KB

Code and model checkpoints for the MultiVerS model for scientific claim verification.

License: MIT License

Python 98.90% Shell 1.10%

multivers's People

Stargazers

Watchers

Forkers

khakhazeus domsoos gab123789 fractalflows sanjanachin boeyjw gnkitaa michelle-ming96 baoivy khartist htuannn

multivers's Issues

Models license

To be able to use your pre-trained models in other projects and research, could you please add an open license for the model files?
I'd prefer Creative Commons because it is very open and allows for modification of your files (e.g., fine-tuning): https://creativecommons.org/choose/

training scripts and training data

Hi @dwadden, could you share the training scripts and data of LongCheaker?

Source code license

To be able to use (and reproduce) your code in other projects, could you please add an open source license to this repository?
There are several good options to choose from: https://choosealicense.com/

environment error

I am trying to run your repo on google colab. But the problem is that the versions of the packages are no longer supported. I replaced some packages :
!pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118
!pip install pytorch-lightning

but still some errors occur. Can you update the requirements.txt file? Thank you

Adapting dataset for pre-train

Hi, @dwadden ,I finished reading the paper of the multivers,I consider multivers a great job for scientific claim verification.
And also,I suppose that to pre-train on several dataset really matters. So, I mean, can you provide the code about how to transfer the origin form of the dataset into the one which can be used for pretraining.
To be more specific,I wanna know how to transfer Fever,EVIDENCEINFERENCE,PUBMEDQA from origin form to what you use in pretraining.
Besides, I'm wondering if you can also provide the code for pre-train.
Thanks a lot!

About train.py

Hi,@dwadden

And i have a question,how could i get the

import lib.longformer.data as dm
from lib.longformer.model import SciFactModel
on train.py，i haven’t found the module named lib.longformer

and by the way did u train the model on scifact,covidfact, Healthver and fever datasets?

Looking forward your reply
best regards

Doubts about inference time

Hi David!
First of all, thanks a lot for sharing all your code and data.
I have a couple of questions.
I'm trying make inferences using my own data with your models and instructions.
I think it kinda works but I still have some doubts especially about the time it takes
Is it expected that inference takes on a very small dataset (3 claims and 10 abstracts for each one) takes about 6 minutes on CPU?

or 80 seconds on GPU?
And that's on the second and subsequent runs of predict.py, on the first run it takes even longer cause it's downloading something (tokenizer? data?). I guess I can debug it to see what's being downloaded (I haven't done it yet), but probably you'd know it right away by the looks of the size (1.74G):

My doubt here is, are these downloads needed for the inference.

And overall if I'm using it correctly
Here's exactly what I'm doing and my dummy data:
https://colab.research.google.com/drive/1dbY6ybcfezhqgVfN0CfUG_uJAkcU1IOV?usp=sharing

Thanks!

MultiVerS training data link inaccessible

It seems that the link provided in the code is currently inaccessible (https://github.com/dwadden/multivers/blob/main/script/get_data_train.sh).

I would be extremely grateful if you could provide me with access to the training data or offer any alternative means to obtain it.

Thanks!

About reproducibility

Hi, David:)
Thanks a lot for sharing your great work!

I am currently struggling on reproducing the results.
I tried finetuning on covidfact and scifact_20 dataset with one GPU(RTX 3090), but the results are different each time.

Looking at the metrics.csv file(in checkpoints folder) generated during finetuning, the label_loss, rationale_loss, and loss are recorded differently each time, so it seems that the randomness is not controlled during the training process.

When I looked at the code, I think there is no problem with the dataloader part because it is fixed.
I tried to add the code (in anaconda3/envs/multivers/lib/python3.8/site-packages/pytorch_lightning/utilities/seed.py) for setting the seed as below, but it is still not reproducible.

I wonder if there are any parts that need additional modification or if there are parts difficult to reproduce perfectly.
Thank you~~

Release on HF Hub

To make it easier to use the MultiVerS chekpoints, it would be nice to have them published on the Hugging Face Hub.
Ideally, we could then just load the model from the Transformers library. But I'm not sure if that's possible or even intended.

Using your own data

Hey David -

Thanks for this repo (it's very legible and easy to read - I also like PLightning!)

If I want to very quickly run predict a label for text A given text B, is there a script to get raw text data in the right format? It seems the predict script relies on specific input files for datasets, but I just want to use it on new text pairs. (the use case is factuality verification of generated scientific abstracts, given the body of the article). I might have to feed each abstract sentence-by-sentence to make it more in line with the SciFact hypotheses.

Make predictions using the convenience wrapper script [script/predict.sh](https://github.com/dwadden/multivers/blob/main/script/predict.sh). This script accepts a dataset name as an argument, and makes predictions **using the correct inputs files** and model checkpoints for that dataset.

hparams missing in the longformer_large_science checkpoint?

Hi David!
Thanks a lot for sharing your great work!

I am currently struggling to use the longformer_large_science checkpoint for making predictions on the scifact dataset.
It seems the model checkpoint doesn't have 'hparams' needed to instantiate the model. I am getting following error:

model = cls._load_model_state(checkpoint, strict=strict, **kwargs)

 File "/pytorch_lightning/core/saving.py", line 198, in _load_model_state
    model = cls(**_cls_kwargs)

TypeError: __init__() missing 1 required positional argument: 'hparams'

I wonder if there is an issue with saving the checkpoint or if additional modifications to the MultiVerSModel class can help run the checkpoint.

Thank you

Training on Custom data

Hi @dwadden,

I'm trying to pretrain your model on my custom data, but it seems like there is something wrong here. The label_loss seems to be unchanged while the rationale_loss still decreases normally through each iter. And the above still happens when I try running with the fever dataset you provided.

I was run with the command as your documents:
!python script/pretrain.py --datasets fever --gpus=1

I don't know if I missed anything or is there a problem with the training phase?

Thanks.

Issue with updating state_dict

I cloned from scratch and am encountering this error:

Traceback (most recent call last):

  File "multivers/predict.py", line 109, in <module>

    main()

  File "multivers/predict.py", line 101, in main

    predictions = get_predictions(args)

  File "multivers/predict.py", line 36, in get_predictions

    model = MultiVerSModel.load_from_checkpoint(checkpoint_path=args.checkpoint_path)

  File "/opt/miniconda3/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 156, in load_from_checkpoint

    model = cls._load_model_state(checkpoint, strict=strict, **kwargs)

  File "/opt/miniconda3/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 198, in _load_model_state

    model = cls(**_cls_kwargs)

  File "/Users/anna/Desktop/mv/multivers/multivers/model.py", line 86, in __init__

    self.encoder = self._get_encoder(hparams)

  File "/Users/anna/Desktop/mv/multivers/multivers/model.py", line 170, in _get_encoder

    new_state_dict[name] = orig_state_dict[name]

KeyError: 'embeddings.position_ids'

This is from line 168 in model.py: ADD_TO_CHECKPOINT = ["embeddings.position_ids"]

So I tried changing this to: ADD_TO_CHECKPOINT = ["embeddings.position_embeddings.weight"]
as this seemed to be the missing item from the Huggingface state_dict.

However I then encountered this error:

Traceback (most recent call last):

  File "multivers/predict.py", line 109, in <module>

    main()

  File "multivers/predict.py", line 101, in main

    predictions = get_predictions(args)

  File "multivers/predict.py", line 36, in get_predictions

    model = MultiVerSModel.load_from_checkpoint(checkpoint_path=args.checkpoint_path)

  File "/opt/miniconda3/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 156, in load_from_checkpoint

    model = cls._load_model_state(checkpoint, strict=strict, **kwargs)

  File "/opt/miniconda3/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 204, in _load_model_state

    model.load_state_dict(checkpoint['state_dict'], strict=strict)

  File "/opt/miniconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict

    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(

RuntimeError: Error(s) in loading state_dict for MultiVerSModel:

Unexpected key(s) in state_dict: "encoder.embeddings.position_ids"

Which I've been unable to resolve so far. This is the same for all checkpoints (e.g. scifact, healthvers). Thanks!

dwadden / multivers Goto Github PK

multivers's People

Stargazers

Watchers

Forkers

multivers's Issues

Recommend Projects

Recommend Topics

Recommend Org