Giter Club home page Giter Club logo

meditron's Introduction

MediTron logo

Meditron is a suite of open-source medical Large Language Models (LLMs).

We release Meditron-7B and Meditron-70B, which are adapted to the medical domain from Llama-2 through continued pretraining on a comprehensively curated medical corpus, including selected PubMed papers and abstracts, a new dataset of internationally-recognized medical guidelines, and a general domain corpus.

Meditron-70B, finetuned on relevant data, outperforms Llama-2-70B, GPT-3.5 and Flan-PaLM on multiple medical reasoning tasks.

Advisory Notice
While Meditron is designed to encode medical knowledge from sources of high-quality evidence, it is not yet adapted to deliver this knowledge appropriately, safely, or within professional actionable constraints. We recommend against using Meditron in medical applications without extensive use-case alignment, as well as additional testing, specifically including randomized controlled trials in real-world practice settings.

Model Details

How to use

You can load the Meditron model directly from the HuggingFace model hub as follows:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("epfl-llm/meditron-70b")
model = AutoModelForCausalLM.from_pretrained("epfl-llm/meditron-70b")

Pipeline

Medical Training Data

We release code to download and pre-process the data used to train Meditron.

MediTron’s domain-adaptive pre-training corpus GAP-Replay combines 48.1B tokens from four corpora:

  • Clinical Guidelines: a new corpus of 46K clinical practice guidelines from various healthcare-related sources, including hospitals and international organizations,
  • Paper Abstracts: 16.1M abstracts extracted from closed-access PubMed and PubMed Central papers,
  • Medical Papers: full-text articles extracted from 5M publicly available PubMed and PubMed Central papers.
  • Replay dataset: 400M tokens of general domain pretraining data sampled from RedPajama-v1.

Download instructions

You can download and pre-process the entire GAP-Replay corpus by running ./download.sh in the gap-replay folder.

You can download 36K open-access articles from our Guidelines corpus from the HuggingFace datasets hub.

from datasets import load_dataset

dataset = load_dataset("epfl-llm/guidelines")

You can scrape and clean all 46K guidelines (including closed-access sources) by running ./download.sh in the guidelines folder.

More details can be found in the GAP-Replay documentation.

Training Procedure

We used the Megatron-LLM distributed training library, a derivative of Nvidia's Megatron LM project, to optimize training efficiency. The hardware consists of 16 nodes of 8x NVIDIA A100 (80GB) SXM GPUs connected by NVLink and NVSwitch with a single Nvidia ConnectX-6 DX network card and equipped with 2 x AMD EPYC 7543 32-Core Processors and 512 GB of RAM. The nodes are connected via RDMA over Converged Ethernet.

Our three-way parallelism scheme uses the following:

  • Data Parallelism (DP -- different GPUs process different subsets of the batches) of 2,
  • Pipeline Parallelism (PP -- different GPUs process different layers) of 8,
  • Tensor Parallelism (TP -- different GPUs process different subtensors for matrix multiplication) of 8.

Training Hyperparameters (7B)

bf16 true
lr 3e-4
eps 1e-5
betas [0.9, 0.95]
clip_grad 1
weight decay 0.1
DP size 16
TP size 4
PP size 1
seq length 2048
lr scheduler cosine
min lr 1e-6
warmup iteration 2000
micro batch size 10
global batch size 1600

Training Hyperparameters (70B)

bf16 true
lr 1.5e-4
eps 1e-5
betas [0.9, 0.95]
clip_grad 1
weight decay 0.1
DP size 2
TP size 8
PP size 8
seq length 4096
lr scheduler cosine
min lr 1e-6
warmup iteration 2000
micro batch size 2
global batch size 512

You can see the script we used to pretrain our models through Megatron-LLM here: finetune.sh

Supervised Finetuning

We again used the Megatron-LLM distributed training library for supervised finetuning (sinlge-node and multi-node). We made a file, sft.py, that automatically handles the tokenization and finetuning process through Megatron-LLM. To start a multi-node finetuning process, here is an example:

cd finetuning
python sft.py \
    --checkpoint=baseline \
    --size=70 \
    --run_name=cotmedqa \
    --data /pure-mlo-scratch/zechen/meditron/benchmarks/ft_preprocessed/medqa_cot_train.jsonl \
    --val /pure-mlo-scratch/zechen/meditron/benchmarks/ft_preprocessed/medqa_cot_validation.jsonl \
    --micro_batch=4
    --nodes=4 \
    --addr=<RANK0_HOST_NAME> \
    --save_interval=200 \
    --pp=4 \
    --seq 4096 \
    --rank=<CURRENT_RANK>

Run the above line of code at node rank-0, rank-1, rank-2, and rank3 to start a 4-node finetuning process.

Important!: Make sure to have the proper paths defined in sft.py and finetune_sft.sh.

Finetuning Hyperparameters

bf16 true
lr 2e-5
eps 1e-5
betas [0.9, 0.95]
clip_grad 1
weight decay 0.1
DP size 16
TP size 4
PP size 1
seq length 2048 or 4096
lr scheduler cosine
min lr 2e-6
warmup ratio 0.1
added tokens [<|im_start|>, <|im_end|>]

Uses

Meditron-70B is being made available for further testing and assessment as an AI assistant to enhance clinical decision-making and democratize access to an LLM for healthcare use. Potential use cases may include but are not limited to:

  • Medical exam question answering
  • Supporting differential diagnosis
  • Disease information (symptoms, cause, treatment) query
  • General health information query

It is possible to use this model to generate text, which is useful for experimentation and understanding its capabilities. It should not be used directly for production or work that may impact people.

We do not recommend using this model for natural language generation in a production environment, finetuned or otherwise.

Downstream Use

Meditron-70B and Meditron-7B are both foundation models without finetuning or instruction-tuning. They can be finetuned, instruction-tuned, or RLHF-tuned for specific downstream tasks and applications. There are two ways we have used this model for downstream question-answering tasks.

  1. We apply in-context learning with k demonstrations (3 or 5 in our paper) added to the prompt.
  2. We finetuned the models for downstream question-answering tasks using specific training sets.

We encourage and look forward to the adaption of the base model for more diverse applications.

If you want a more interactive way to prompt the model, we recommend using a high-throughput and memory-efficient inference engine with a UI that supports chat and text generation.

You can check out our deployment guide below, where we used FastChat with vLLM. We collected generations for our qualitative analysis through an interactive UI platform, BetterChatGPT. Here is the prompt format we used as an example:

qualitative-analysis-prompt

Medical Benchmark Inference & Evaluation

Requirements

Before you start, please install the necessary packages:

vllm >= 0.2.1
transformers >= 4.34.0
datasets >= 2.14.6
torch >= 2.0.1

For detailed instructions to run inference and evaluation with medical benchmarks, please read the documentation here inference & evaluation instructions.

Model Deployment

For detailed instructions to deploy meditron models and have an interactive chat session, please read the documentation here Model Deployment

Citation

If you use this software or our paper, please cite them:

@misc{chen2023meditron70b,
      title={MEDITRON-70B: Scaling Medical Pretraining for Large Language Models},
      author={Zeming Chen and Alejandro Hernández-Cano and Angelika Romanou and Antoine Bonnet and Kyle Matoba and Francesco Salvi and Matteo Pagliardini and Simin Fan and Andreas Köpf and Amirkeivan Mohtashami and Alexandre Sallinen and Alireza Sakhaeirad and Vinitra Swamy and Igor Krawczuk and Deniz Bayazit and Axel Marmet and Syrielle Montariol and Mary-Anne Hartley and Martin Jaggi and Antoine Bosselut},
      year={2023},
      eprint={2311.16079},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@software{epfmedtrn,
  author = {Zeming Chen and Alejandro Hernández-Cano and Angelika Romanou and Antoine Bonnet and Kyle Matoba and Francesco Salvi and Matteo Pagliardini and Simin Fan and Andreas Köpf and Amirkeivan Mohtashami and Alexandre Sallinen and Alireza Sakhaeirad and Vinitra Swamy and Igor Krawczuk and Deniz Bayazit and Axel Marmet and Syrielle Montariol and Mary-Anne Hartley and Martin Jaggi and Antoine Bosselut},
  title = {MediTron-70B: Scaling Medical Pretraining for Large Language Models},
  month = November,
  year = 2023,
  url = {https://github.com/epfLLM/meditron}
}

meditron's People

Contributors

agbonnet avatar agromanou avatar alehd avatar athatheo avatar eltociear avatar eric11eca avatar josephrmartinez avatar jpcorb20 avatar martinjaggi avatar vinitra avatar xxrjun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

meditron's Issues

Accuracy calculation failure

I see that here in evaluate.py accuracy is calculated in two different ways. And there is an assert statements to make sure they match.
In my tests, these assertion sometimes fails. Why is this assertion there? and what does it failure signify?

Thank you,

What system prompt do you recommend?

Hello Meditron team,

Happy New Year! Hope you are doing well. Thank you so much for releasing Meditron! I have the following questions on the recommended system prompt and how to input it to the model.

  1. It seems that during evaluation, different system prompts are used for different datasets (https://github.com/epfLLM/meditron/blob/main/evaluation/inference.py#L121). In general, when using the meditron models posted on huggingface (https://huggingface.co/epfl-llm/meditron-7b and https://huggingface.co/epfl-llm/meditron-70b), what system prompt do you recommend?

  2. Given a system prompt, is the following the proper way to input it to the model?

import torch
from transformers import pipeline

model_path = "epfl-llm/meditron-7b"
prompt = "What are the symptoms of diabetes?"
prompt_template = f"system prompt beginning... {prompt} ... system prompt end"

#load model and tokenizer
model = transformers.AutoModelForCausalLM.from_pretrained(
   model_path,
   torch_dtype=torch.float16,
   device_map="auto",
   )
tokenizer = transformers.AutoTokenizer.from_pretrained(model_path)

#tokenize input
prompt_tokens = tokenizer(prompt_template, return_tensors='pt')["input_ids"].to(device)

#generate output
output = model.generate(
   inputs=prompt_tokens,
   temperature=0.1,
   do_sample=True,
   eos_token_id=tokenizer.eos_token_id,
   pad_token_id=tokenizer.pad_token_id,
   max_new_tokens=512
   )

print(tokenizer.decode(output[0]))

Thank you very much!

Are you planing to release fine-tuned models?

Thank you for this great work and very detailed paper! In the paper, you write:

MEDITRON models (7B and 70B) with and without fine-tuning to the public to ensure access for real-world evaluation and to facilitate similar efforts in other domains.

Should we expect fine-tuned models to be released soon?

share python scripts for processing PubMed full articles

It is great to see you have done an open-source Medical LLM with SOTA performance. I searched online the python scripts to process PubMed full articles are not complete. Could you share your python scripts of processing PubMed full articles? Thank you very much!

Issue with using model

Trying the prompt give in the paper but the model just repeats the question without any helpful answers. I am prompting it wrong?

image

Question about training hour

Hi I've tried to calculate the training hours by myself and get following result
截图 2023-12-11 22-28-57

it shows 430 hours, which is inconsistent with the 332 hours that given in your appendix A
截图 2023-12-11 22-31-55

Issue with generation with standard HF generation

Great work and repo - however there is a tokenizer issue with the base version of the model.

When trying to just simple prompt the base model to do something with the suggested format, it runs into cuda issues which seem to indicate weird tokenizer/embedding mismatches

Working example:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "epfl-llm/meditron-7b"

# BitsAndBytesConfig int-4 config 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=False, device_map="auto")

tokenizer = AutoTokenizer.from_pretrained(model_id)

def format_prompt(prompt):


    system_msg = "You are a helpful, respectful and honest assistant." + \
    "Always answer as helpfully as possible, while being safe." + \
    "Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content." + \
    "Please ensure that your responses are socially unbiased and positive in nature.\n\n" + \
    "If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct." + \
    "If you don't know the answer to a question, please don't share false information."
    
    system_msg = "You are a helpful, respectful and honest assistant."
    
    return f"<|im_start|> system\n{system_msg}<|im_end|>\n <|im_start|> user\n{prompt}<|im_end|>\n <|im_start|> assistant\n"


 med_prompt = format_prompt("What is a possible treatment for high blood pressure in a pregnant woman?")


Gives us this prompt:

'<|im_start|> system\nYou are a helpful, respectful and honest assistant.<|im_end|>\n <|im_start|> user\nmake a clinical note<|im_end|>\n <|im_start|> assistant\n'

Use vanilla HF pipeline:

# Use a pipeline for later
from transformers import pipeline

pipe = pipeline("text-generation",
                model=model,
                tokenizer= tokenizer,    
                max_new_tokens = 1024,
                do_sample=True,
                top_k=30,
                num_return_sequences=2,
                eos_token_id=tokenizer.eos_token_id,
                return_full_text=False,
                )

# generate from prompt
generated = pipe(med_prompt)

Leads to:

../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [642,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

But it all works fine if the special formatting is not provided. I understand the special formatting was only for the finetuned versions, but the tokenizer has these special tokens added for the base model too, which seems problematic.

I hope this is enough detail to go on, but its throwing me a bit - seems like the special tokens do not play nice.

Envrionment details:

Python 3.9

Pip packages:

Package Version


accelerate 0.20.3
aiofiles 23.2.1
aiohttp 3.8.4
aiosignal 1.3.1
altair 5.1.2
annotated-types 0.6.0
anyio 3.7.1
asttokens 2.2.1
async-timeout 4.0.2
attrs 23.1.0
backcall 0.2.0
bertopic 0.16.0
blis 0.7.11
catalogue 2.0.10
certifi 2023.5.7
charset-normalizer 3.1.0
click 8.1.7
cloudpathlib 0.16.0
cmake 3.26.4
colorama 0.4.6
comm 0.1.3
confection 0.1.4
contourpy 1.2.0
cycler 0.12.1
cymem 2.0.8
Cython 0.29.36
datasets 2.13.1
debugpy 1.6.7
decorator 5.1.1
dill 0.3.6
einops 0.6.1
en-core-web-sm 3.7.1
exceptiongroup 1.1.3
executing 1.2.0
fastapi 0.104.1
fastjsonschema 2.19.0
ffmpy 0.3.1
filelock 3.12.2
fonttools 4.44.0
frozenlist 1.3.3
fsspec 2023.6.0
gradio 4.2.0
gradio_client 0.7.0
h11 0.14.0
hdbscan 0.8.33
httpcore 1.0.2
httpx 0.25.1
huggingface-hub 0.15.1
idna 3.4
importlib-metadata 6.7.0
importlib-resources 6.1.1
ipykernel 6.23.3
ipython 8.14.0
jedi 0.18.2
Jinja2 3.1.2
joblib 1.3.2
jsonschema 4.19.2
jsonschema-specifications 2023.7.1
jupyter_client 8.3.0
jupyter_core 5.3.1
kiwisolver 1.4.5
langcodes 3.3.0
lit 16.0.6
llvmlite 0.41.1
markdown-it-py 3.0.0
MarkupSafe 2.1.3
matplotlib 3.8.1
matplotlib-inline 0.1.6
mdurl 0.1.2
mpmath 1.3.0
multidict 6.0.4
multiprocess 0.70.14
murmurhash 1.0.10
nbformat 5.9.2
nest-asyncio 1.5.6
networkx 3.1
nltk 3.8.1
numba 0.58.1
numpy 1.25.0
nvidia-cublas-cu11 11.10.3.66
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu11 8.5.0.96
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu11 10.9.0.58
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu11 10.2.10.91
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu11 11.7.4.91
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu11 2.14.3
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.3.52
nvidia-nvtx-cu11 11.7.91
nvidia-nvtx-cu12 12.1.105
orjson 3.9.10
packaging 23.1
pandas 2.0.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
Pillow 10.1.0
pip 23.1.2
platformdirs 3.8.0
plotly 5.18.0
preshed 3.0.9
prompt-toolkit 3.0.38
psutil 5.9.5
ptyprocess 0.7.0
pure-eval 0.2.2
pyarrow 12.0.1
pydantic 2.4.2
pydantic_core 2.10.1
pydub 0.25.1
Pygments 2.15.1
pynndescent 0.5.11
pyparsing 3.1.1
python-dateutil 2.8.2
python-multipart 0.0.6
pytz 2023.3
PyYAML 6.0
pyzmq 25.1.0
referencing 0.30.2
regex 2023.6.3
requests 2.31.0
rich 13.6.0
rpds-py 0.12.0
safetensors 0.3.1
scikit-learn 1.3.2
scipy 1.11.4
semantic-version 2.10.0
sentence-transformers 2.2.2
sentencepiece 0.1.99
setuptools 58.1.0
shellingham 1.5.4
six 1.16.0
smart-open 6.4.0
sniffio 1.3.0
spacy 3.7.2
spacy-legacy 3.0.12
spacy-loggers 1.0.5
srsly 2.4.8
stack-data 0.6.2
starlette 0.27.0
sympy 1.12
tenacity 8.2.3
thinc 8.2.1
threadpoolctl 3.2.0
tokenizers 0.13.3
tomlkit 0.12.0
toolz 0.12.0
torch 2.1.1
torchvision 0.16.1
tornado 6.3.2
tqdm 4.65.0
traitlets 5.9.0
transformers 4.30.2
triton 2.1.0
typer 0.9.0
typing_extensions 4.8.0
tzdata 2023.3
umap-learn 0.5.5
urllib3 2.0.3
uvicorn 0.24.0.post1
wasabi 1.1.2
wcwidth 0.2.6
weasel 0.3.4
websockets 11.0.3
wheel 0.40.0
xxhash 3.2.0
yarl 1.9.2
zipp 3.15.0

load model size mismatch error

operations

I download the model file from https://huggingface.co/epfl-llm/meditron-7b/tree/main
then load the model using:
model = transformers.AutoModelForCausalLM.from_pretrained('./meditron-7b/', trust_remote_code=True, use_cache=True)

get the error:

size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32000, 4096]) from checkpoint, the shape in current model is torch.Size([32017, 4096]).

package

transformer version is 4.25.2

where is mycode error?

there are not any output. where is mycode error?
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed

tokenizer = AutoTokenizer.from_pretrained("/Users/yutao/Documents/MyCode/meditron_7b/model")
model = AutoModelForCausalLM.from_pretrained("/Users/yutao/Documents/MyCode/meditron_7b/model")

question = "What to do about high blood pressure?"
inputs = tokenizer(question, return_tensors="pt")
set_seed(42)

answers = []

with torch.inference_mode():
beam_output = model.generate(**inputs,
max_new_tokens=1024,
num_beams=1,
pad_token_id=2,
eos_token_id=2,
early_stopping=False,
do_sample=False,
)
answers.append(tokenizer.decode(beam_output[0], skip_special_tokens=True))

print("answers: "+answers)

Can't run finetuning script (wrong paths?)

Hello Meditron team,

Thank you so much for sharing your work! I'd like to follow your instructions to fine-tune the meditron model, but I get an error (potentially due to wrong paths). Specifically, I perform the following:

  1. Navigate in the meditron folder: cd path/meditron
  2. Run the script: python finetuning/sft.py --checkpoint=meditron --size=7 --run_name=pubmedqa --data bigbio/pubmedqa

But, I get the following error:

python finetuning/sft.py --checkpoint=meditron --size=7 --run_name=pubmedqa --data bigbio/pubmedqa
Tokenizing data!
Traceback (most recent call last):
  File "/n/home07/than157/desktop/llm-med/meditron/Megatron-LLM/tools/preprocess_instruct_data.py", line 28, in <module>
    from megatron.tokenizer import build_tokenizer
ModuleNotFoundError: No module named 'megatron.tokenizer'
Traceback (most recent call last):
  File "/n/home07/than157/desktop/llm-med/meditron/finetuning/sft.py", line 268, in <module>
    main(args)
  File "/n/home07/than157/desktop/llm-med/meditron/finetuning/sft.py", line 206, in main
    data_prefix = tokenize_data(
  File "/n/home07/than157/desktop/llm-med/meditron/finetuning/sft.py", line 85, in tokenize_data
    execute(cmd)
  File "/n/home07/than157/desktop/llm-med/meditron/finetuning/sft.py", line 41, in execute
    assert proc.wait() == 0
AssertionError

I've spent hours trying to figure out the right paths, but to no avail. I would be so grateful if you could help me with the following so I can run your script:

  1. How to fix the error above?
  2. How should I set CHECKPOINTS in sft.py to finetune the meditron-7b model that I downloaded from huggingface?

Thank you very much!

Can't benchmark on medqa

In the evaluation folder, running python inference.py --checkpoint mistral --checkpoint_name mistral --benchmark medqa fails with the error datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset.

I managed to fix the error by setting the MedQA benchmark attribute self.subsets = ['med_qa_en_source]

By the way, the repository requirements.txt is missing the packages wandb, scikit-learn and openai (unused though), to be able to run the benchmark suite.

Cannot find article data in papers-PubMed.jsonl

It is great to see you have done an open-source Medical LLM with SOTA performance. When I ran "python load.py --dataset papers --key_path keys.json". It outputs papers-PubMed.jsonl. But I cannot find any paper in this dataset. Only some basic info of each article. Anyone knows what's wrong?
Thank you!

eval generation path issue

Hello, I'm trying to use your eval pipeline.
I ran ./inference_pipeline.sh -b pubmedqa -c gpt2 -s 0 -m 0 -out_dir out_dir
After it is done with generation I get the following error:

Stored pubmedqa generations to the following path: ../benchmarks/generations/pubmedqa-gpt2.jsonl
Traceback (most recent call last):
  File "evaluate.py", line 475, in <module>
    main(args)
  File "evaluate.py", line 390, in main
    data = load_jsonl(path)
  File "evaluate.py", line 39, in load_jsonl
    with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../benchmarks/generations/pubmedqa/pubmedqa-gpt2.jsonl'

It looks like the generations are saved in: meditron/benchmarks/generations/pubmedqa-gpt2.jsonl
But eval looks for them in: meditron/benchmarks/generations/pubmedqa/pubmedqa-gpt2.jsonl'

Meditron-7b doesn't behave as expected

I've been experimenting with Meditron-7b for answering medical queries, but its performance seems not as expected compared to other LLM models.

I loaded the model and tokenizer and then used the standard HF pipeline:

pipeline = transformers.pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    temperature=0.01,
    do_sample=True,
    top_k=3,
    top_p=0.01,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
    max_new_tokens=200,
)

Then I used langchain wrapper:

from langchain.llms import HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=pipeline)

For a simple greeting with llm(prompt="Hi, how are you?"), the model repetitively echoed the prompt:

'\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi, how are you?\n- Hi,'

When asked about lung cancer risk factors with llm(prompt="What are the risk factors for lung cancer?"),, it provided a list of related questions instead of direct answers:

  • What are the symptoms of lung cancer?
  • What causes lung cancer?
  • What are the stages of lung cancer?
  • When to seek urgent medical care?
  • How to diagnose lung cancer?
  • How to treat lung cancer?
  • How to prevent lung cancer?
  • What to expect (Outlook/Prognosis)?

Further, using a formatted prompt based on a GitHub repository example, the response included the prompt format instructions verbatim, without addressing the medical query.

def format_prompt(prompt):
    system_msg = "You are a helpful, respectful and honest assistant." + \
    "Always answer as helpfully as possible, while being safe." + \
    "Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content." + \
    "Please ensure that your responses are socially unbiased and positive in nature.\n\n" + \
    "If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct." + \
    "If you don't know the answer to a question, please don't share false information."""
    return f"<|im_start|> system\n{system_msg}<|im_end|>\n <|im_start|> user\n{prompt}<|im_end|>\n <|im_start|> assistant\n"
example = {
        "prompt": """Four weeks after starting hydrochlorothiazide, a 49-year-old man with hypertension comes to the physician because of muscle cramps and weakness. His home medications also include amlodipine. His blood pressure today is 176/87 mm Hg. Physical examination shows no abnormalities. The precordial leads of a 12-lead ECG are shown. The addition of which of the following is most likely to have prevented this patient's condition?\n\nOptions:\nA. Torsemide \nB. Nifedipine \nC. Eplerenone \nD. Hydralazine""",
        "gold": "C",
        "steps": [
            "The patient has started hydrochlorothiazide.",
            "He now presents with muscle cramps and weakness and an ECG that supports the diagnosis of hypokalemia.",
            "(A) Torsemide is a loop diuretic and would likely aggravate the hypokalemia.",
            "(B) Nifedipine is a calcium antagonist and would not alleviate the hypocalcemia.",
            "(C) Eplerenone is a potassium-sparing diuretic and would likely decrease the chance of hypokalemia.",
            "(D) Hydralazine is a potent vasodilator and would not decrease the risk of hypokalemia.",
        ]
    }
prompt = format_prompt(example['prompt'])
res = llm(prompt=prompt )
print(res)

And this returned

You are a helpful, respectful and honest assistant.Always answer as helpfully as possible, while being safe.Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.If you don't know the answer to a question, please don't share false information.<|im_end|>
<|im_start|> user
A 65-year-old man with a history of hypertension and hyperlipidemia presents with a 2-week history of progressive dyspnea on exertion. He has a history of smoking 1 pack of cigarettes per day for 30 years. He has no history of diabetes mellitus, coronary artery disease, or peripheral vascular disease. His blood pressure is 150/90 mm Hg, and his pulse is 80 beats per minute. Physical examination reveals a grade 3/6 systolic murmur at the apex. The precordial leads of a 12-lead ECG are shown. The addition of which of the following is most likely to have prevented this patient's condition?

Options:
A. Amlodipine
B. Lisinopril
C. Metoprolol
D. Nifedipine<|im_end|>
<|im_start|> assistant
You are a helpful, respectful and honest assistant.Always answer as helpfully as possible, while being safe.Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.Please ensure that your responses are socially unbiased and positive in nature.

Is this behavior typical for Meditron-7b, or might it be an issue with my prompting technique? Additionally, would Meditron-70b potentially yield better results?

Errors with three of the scrapers

Hello,

I was trying to scrape magic, drugs and guidelinecentral without success, while some others were fine. Any idea how to make them work? Drugs seemed to work but 0 article was in the JSONL. GuidelineCentral got some click issues. FInally, Magic printed errors for each article but one.

Thanks in advance,

Data preparation not working

Overall this part doesn't works, the scripts seems to have wrong paths, and the part of selenium.. etc

If you sort it out we'll be making a UNA version of your model, hope it can help on your research.

prompt Meditron for NER

Hi, thank you for releasing the model. I believe it's a great resource for biomedical research and applications.

I am trying to prompt Meditron (7b, 70b-4b) for a medical NER task (using the example prompt, i.e., you are a helpful...) and couldn't get good results. Any suggestions on how to do NER with this model?

llama.cpp Integration to Support Low-End Hardware Compatibility

Request for llama.cpp Integration to Support Low-End Hardware Compatibility

Description

I'm currently trying to integrate llama.cpp with Meditron for running models on lower-end hardware. Meditron is based on Llama, so in theory, this should be possible. However, I'm encountering issues when attempting to convert the Meditron model using llama.cpp.

Steps to Reproduce

  1. Either run python3 convert-hf-to-gguf.py ../meditron-7b/

    • Output:
      Loading model: meditron-7b
      Traceback (most recent call last):
      ...
      NotImplementedError: Architecture "LlamaForCausalLM" not supported!
      
  2. Or directly launching with llama.cpp using:

    ./build/bin/main --rope-freq-scale 8.0 -m ../meditron-7b/pytorch_model-00008-of-00008.bin -p "I have pain in my leg from toes to hip"
    
    • Output:
      Log start
      ...
      error loading model: llama_model_loader: failed to load model from ../meditron-7b/pytorch_model-00008-of-00008.bin
      

Expected Behavior

Successful integration of llama.cpp with Meditron, allowing the model to run on lower-end hardware.

Actual Behavior

Encountering a NotImplementedError for the architecture "LlamaForCausalLM" when trying to convert the model, and an error loading the model when launching directly with llama.cpp.

Possible Solution

Adjustments in llama.cpp to support the "LlamaForCausalLM" architecture used by Meditron. This could involve modifying the model conversion script or the model loading mechanism in llama.cpp.

Additional Context

Link to llama.cpp

Request

I kindly request the team to consider adding support for llama.cpp integration with Meditron. Or to give advices on how to implement it. This would be a significant enhancement, enabling the use of Meditron models on more diverse hardware setups, especially those at the lower end.

Question about Figure 1

图片

I read the paper and found 70.2 is task specific fine-tuned version:
图片

Are the values of chatgpt(60.2) and gpt-4(82.3) coming from in-context learning result or fine-tuning? If they are coming from in-context learning, then I think this graph is misleading and unfair.

Eval results aren't matching the paper

I'm not able to match the 3-shot eval results reported in the paper for the pretrained model.
I downloaded the Meditron-7b model from HF.
For example, for MedQA I get 0.353, while the paper reports 0.287±0.008
My command was: ./inference_pipeline.sh -b medqa4 -c meditron-7b -s 3 -m 0 -out_dir out_dir

On PubMedQA, I got 0.486, but the paper reports .693±.151.

Mismatch in vocab_size between .bin files and .safetensors files

Hey !

I'm sorry if this is not an issue and it's just me not understanding the problem, I'm not an expert, rather a novice, in this field.

I'm trying to deploy the project according to your deployment guide.
However, since I don't have access to enough memory for the -70B version of the model, I want to use the --load-8bit parameter to enable model compression. (I shall specify that I run the model using the CPU, with the --device cpu flag)

When I use this, I get the following error:

ValueError: Trying to set a tensor of shape torch.Size([32000, 8192]) in "weight" (which has shape torch.Size([32017, 8192])), this look incorrect

If I look in the HF's upload log, I see that there were two main upload of the model:

  • The first one with the .bin files, including the vocab_size value set to 32000
  • The seconde one with the .safetensors files, including the vocab_size value set to 32017

My understanding is that to enable model compression, the .bin files are needed, which do not match to the model configuration anymore.

This is supported by a manual edit of the config.json file to set vocab_size back to 32000, which allows the model to load properly using --load-8bit.

Loading the guidelines with huggingface datasets fails

Running the following code

from datasets import load_dataset

dataset = load_dataset("epfl-llm/guidelines")

Gives me this error:

{
	"name": "DatasetGenerationError",
	"message": "An error occurred while generating the dataset",
	"stack": "---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/builder.py:1932, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1925     writer = writer_class(
   1926         features=writer._features,
   1927         path=fpath.replace(\"SSSSS\", f\"{shard_id:05d}\").replace(\"JJJJJ\", f\"{job_id:05d}\"),
   (...)
   1930         embed_local_files=embed_local_files,
   1931     )
-> 1932 writer.write_table(table)
   1933 num_examples_progress_update += len(table)

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/arrow_writer.py:573, in ArrowWriter.write_table(self, pa_table, writer_batch_size)
    572 pa_table = pa_table.combine_chunks()
--> 573 pa_table = table_cast(pa_table, self._schema)
    574 if self.embed_local_files:

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:2332, in table_cast(table, schema)
   2331 if table.schema != schema:
-> 2332     return cast_table_to_schema(table, schema)
   2333 elif table.schema.metadata != schema.metadata:

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:2291, in cast_table_to_schema(table, schema)
   2290     raise ValueError(f\"Couldn't cast\
{table.schema}\
to\
{features}\
because column names don't match\")
-> 2291 arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
   2292 return pa.Table.from_arrays(arrays, schema=schema)

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:2291, in <listcomp>(.0)
   2290     raise ValueError(f\"Couldn't cast\
{table.schema}\
to\
{features}\
because column names don't match\")
-> 2291 arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
   2292 return pa.Table.from_arrays(arrays, schema=schema)

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:1834, in _wrap_for_chunked_arrays.<locals>.wrapper(array, *args, **kwargs)
   1833 if isinstance(array, pa.ChunkedArray):
-> 1834     return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
   1835 else:

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:1834, in <listcomp>(.0)
   1833 if isinstance(array, pa.ChunkedArray):
-> 1834     return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
   1835 else:

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:2147, in cast_array_to_feature(array, feature, allow_number_to_str)
   2146 elif not isinstance(feature, (Sequence, dict, list, tuple)):
-> 2147     return array_cast(array, feature(), allow_number_to_str=allow_number_to_str)
   2148 raise TypeError(f\"Couldn't cast array of type\
{array.type}\
to\
{feature}\")

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:1836, in _wrap_for_chunked_arrays.<locals>.wrapper(array, *args, **kwargs)
   1835 else:
-> 1836     return func(array, *args, **kwargs)

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/table.py:2029, in array_cast(array, pa_type, allow_number_to_str)
   2028 if pa.types.is_null(pa_type) and not pa.types.is_null(array.type):
-> 2029     raise TypeError(f\"Couldn't cast array of type {array.type} to {pa_type}\")
   2030 return array.cast(pa_type)

TypeError: Couldn't cast array of type string to null

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
model_playground.ipynb Cell 6 line 3
      model_playground.ipynb#W5sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2'>3</a> dataset = load_dataset(\"epfl-llm/guidelines\")

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/load.py:2152, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   2149 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
   2151 # Download and prepare data
-> 2152 builder_instance.download_and_prepare(
   2153     download_config=download_config,
   2154     download_mode=download_mode,
   2155     verification_mode=verification_mode,
   2156     try_from_hf_gcs=try_from_hf_gcs,
   2157     num_proc=num_proc,
   2158     storage_options=storage_options,
   2159 )
   2161 # Build dataset for splits
   2162 keep_in_memory = (
   2163     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   2164 )

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/builder.py:948, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    946     if num_proc is not None:
    947         prepare_split_kwargs[\"num_proc\"] = num_proc
--> 948     self._download_and_prepare(
    949         dl_manager=dl_manager,
    950         verification_mode=verification_mode,
    951         **prepare_split_kwargs,
    952         **download_and_prepare_kwargs,
    953     )
    954 # Sync info
    955 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/builder.py:1043, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1039 split_dict.add(split_generator.split_info)
   1041 try:
   1042     # Prepare split will record examples associated to the split
-> 1043     self._prepare_split(split_generator, **prepare_split_kwargs)
   1044 except OSError as e:
   1045     raise OSError(
   1046         \"Cannot find data file. \"
   1047         + (self.manual_download_instructions or \"\")
   1048         + \"\
Original error:\
\"
   1049         + str(e)
   1050     ) from None

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/builder.py:1805, in ArrowBasedBuilder._prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
   1803 job_id = 0
   1804 with pbar:
-> 1805     for job_id, done, content in self._prepare_split_single(
   1806         gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1807     ):
   1808         if done:
   1809             result = content

File ~/miniconda3/envs/llm/lib/python3.10/site-packages/datasets/builder.py:1950, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1948     if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1949         e = e.__context__
-> 1950     raise DatasetGenerationError(\"An error occurred while generating the dataset\") from e
   1952 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset"
}

Abnormal evaluation result

I evaluation llama-2-70b model on pubmedqa with cot, sc_cot, and multi_seed + sc_cot inference modes, but I got some abnormal evaluation results.

For the cot inference mode: There are only 26 correct answers with 476 ignored, does it normal?
For the sc_cot and multi_seed + sc_cot result, I got about 52% acc, different from the result in your paper.

I want to know does the evaluation code is completely same as that you used?

My evaluation result:
cot:

====================================
Report accuracy for pubmedqa-cot-llama2-70b-base on pubmedqa:
Accuracy: 0.032
Accuracy (calibrated): 0.6153846153846154
Precision: 0.03709090909090909
Recall: 0.032
F1: 0.033303703703703696
------------------------------------
Correct: 16
Counted: 26
Total: 500
Unable to find answer: 474
Ignored prompts: 474
 ====================================

sc_cot

====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa:
Accuracy: 0.458
Accuracy (calibrated): 0.5240274599542334
Precision: 0.36550423868216514
Recall: 0.458
F1: 0.4052266991967127
------------------------------------
Correct: 229
Counted: 437
Total: 500
Unable to find answer: 63
Ignored prompts: 63
====================================

Multi-seed + sc_cot:

====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa:
Accuracy: 0.458
Accuracy (calibrated): 0.5240274599542334
Precision: 0.36550423868216514
Recall: 0.458
F1: 0.4052266991967127
------------------------------------
Correct: 229
Counted: 437
Total: 500
Unable to find answer: 63
Ignored prompts: 63
====================================
====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa-1234:
Accuracy: 0.458
Accuracy (calibrated): 0.5240274599542334
Precision: 0.36550423868216514
Recall: 0.458
F1: 0.4052266991967127
------------------------------------
Correct: 229
Counted: 437
Total: 500
Unable to find answer: 63
Ignored prompts: 63
====================================
====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa-432:
Accuracy: nan
Accuracy (calibrated): -1
Precision: nan
Recall: nan
F1: nan
------------------------------------
Correct: 0
Counted: 0
Total: 0
Unable to find answer: 0
Ignored prompts: 0
====================================
====================================
Report accuracy for pubmedqa-sc_sc-cot-llama2-70b-base on pubmedqa-32:
Accuracy: nan
Accuracy (calibrated): -1
Precision: nan
Recall: nan
F1: nan
------------------------------------
Correct: 0
Counted: 0
Total: 0
Unable to find answer: 0
Ignored prompts: 0
====================================

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.