explodinggradients / ragas Goto Github PK
View Code? Open in Web Editor NEWEvaluation framework for your Retrieval Augmented Generation (RAG) pipelines
Home Page: https://docs.ragas.io
License: Apache License 2.0
Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines
Home Page: https://docs.ragas.io
License: Apache License 2.0
TODO
For measuring context_relevancy, the LLM tries to extract sentences from a given context that is actually useful to answer the question. But in this process, there is a slight chance that the LLM might hallucinate and give out a sentence that is not exactly present in the given context.
It would be good to have some kind of check to ensure that only sentences present in the given context are selected by the LLM.
Another way to look at it is to change the prompt so that the LLM outputs the indices of candidate sentences from a given context. This could bypass the need for an extra check.
Current link in README.md
https://github.com/explodinggradients/ragas/blob/main/README.md points to https://discord.gg/5djav8GGNZ, which returns an invalid invite.
Hopefully you can refresh the invite link :). If you don't want people to join, probably should remove the link
Hey guys,
I have a problem that occured all of a sudden. I am using the Azure OpenAI client and it worked until recently. Now I am getting the error ´ Did not find openai_api_key, please add an environment variable OPENAI_API_KEY
which contains it, or pass openai_api_key
as a named parameter.´
I am setting the paramters like so at the beginning of the code:
os.environ["OPENAI_API_TYPE"] = "azure"
os.environ["OPENAI_API_KEY"] = "..................................."
os.environ["OPENAI_API_VERSION"] = "....................................."
os.environ["OPENAI_API_BASE"] = "..........................."
llm = AzureOpenAI(deployment_name = ".................................", openai_api_key=os.environ["OPENAI_API_KEY"])
Passing it as a named parameter to the evaluate method did not solve it by the way.
I have absolutely zero idea where this is coming from. Can anyone hint me in a direction?
Hi,
Great piece of work here, really well done.
It would be fantastic if we could any supported LLM from langchain to do the evaluation.
I have a number of use cases which require sovereignty, which means essentially either using on prem LLMs or Azure OpenAPI locked to a certain region.
Happy to help wherever I can!
Can you please supply a notebook that demonstrates how to use ragas with open source llm ?
Thanks,
Eyal
We track very basic usage metrics to guide us to figure out what our users want, what is working and what's not. As a young startup, we have to be brutally honest about this which is why we are tracking these metrics. We are also an Open Startup, which is a product or company which operates in the open and shares its statistics publically.
All the data and the code we track will be open-sourced soon
If you don't want to send tracking info, you can easily disable it by setting RAGAS_DO_NOT_TRACK
to True.
Keeping this issue open for feedback from the community and further discussions.
Hey,
Thanks for creating and maintaining this repository.
I assume that you would be using an LLM to get out the scores for each metric. Or are you guys using some bespoke model for each metric like coherence, faithfulness?
If you rely on an LLM, how do you get the score? Do you ask the LLM to spit out a score?
Please let me know!
Thanks!
Hi, Thanks for this amazing framework
I would like to know is there any support for using locally trained LLMs, I see that currently we can change the LLM with Langchain, but I don't want to use langchain and use a local LLM like Llama 2
I set OPENAI_API_BASE to my own deployed model, and then there were some errors in the data evaluated by Ragas.
First, the same data produces different results each time it's evaluated. Second, there are values outside the range, like -1, in faithfulnessn.
Is there something wrong with my model?how can i locate the bug
I cannot reproduce the results of the fiqa baseline in this notebook:
https://github.com/explodinggradients/ragas/blob/main/experiments/baselines/fiqa/dataset-exploration-and-baseline.ipynb
At the end of this notebook, is shows the score:
{'NLI_score': 0.8655555555555556, 'answer_relevancy': 0.8737666666666667, 'context_ relevancy': 0.8181444444444443, 'ragas_score': 0.8517704492684051}
But when I test, the score I get is:
{
"context_ relevancy": 0.10368744449698047,
"faithfulness": 1.0,
"answer_relevancy": 0.9286177818722253,
"context_recall": 0.6370370370370371,
"harmfulness": 0.0,
"ragas_score": 0.300955397960847,
}
I noticed that my context_relevancy is very low. I know that in the latest pr, the prompt used to test context_relevancy has been modified, but I am not running with this latest version. But I don’t think it’s caused by this, because this jupyter notebook seems to be running with an old version.
In addition to the context_relevancy value, there seem to be some gaps in the other values. So I'm wondering, what could be causing this?
Hi guys,
I am evaluating the quality of the metrics in your framework for my RAG use case. So far I a am afraid to say that most prompts and responses are not delivering the expected result and are therefore not reliable. I am using the Azure OpenAI endpoint with GPT 3.5 so it is very possible that it has everything to do with that. Can anyone else confirm this observation?
Hello, I'm trying to use ragas for a simple evaluation on my dataset with only 2 columns ("question", "answer").
import datasets
from ragas.metrics import answer_relevancy
import os
os.environ["OPENAI_API_KEY"] = {my OpenAI key}
data=datasets.Dataset.from_dict({"question":["2+2","what is it ragas"], "answer":["4","an evaluation metric"]})
results = evaluate(data, metrics=[answer_relevancy])
From this very simple example I receive an error message
RuntimeError Traceback (most recent call last)
Cell In[9], line 1
----> 1 results = evaluate(data, metrics=[answer_relevancy])
File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/ragas/evaluation.py:89](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/ragas/evaluation.py:89), in evaluate(dataset, metrics)
87 scores = []
88 for metric in metrics:
---> 89 scores.append(metric.score(dataset).select_columns(metric.name))
91 # log the evaluation event
92 metrics_names = [m.name for m in metrics]
File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/ragas/metrics/answer_relevance.py:164](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/ragas/metrics/answer_relevance.py:164), in AnswerRelevancy.score(self, dataset)
160 sentence_ds = dataset.map(
161 self._make_question_answer_pairs, batched=True, batch_size=10
162 )
163 # we loose memory here because we have to make it py_list
--> 164 scores = self.model.predict(sentence_ds["sentences"])
165 return Dataset.from_dict({f"{self.name}": scores})
File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/ragas/metrics/answer_relevance.py:133](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/ragas/metrics/answer_relevance.py:133), in QGen.predict(self, sentences, batch_size, show_progress)
131 inputs, labels = data
132 with torch.no_grad():
--> 133 logits = self.model(**inputs, output_hidden_states=False).logits
134 loss = self.get_loss(logits, labels)
135 predictions.append(loss)
File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/module.py:1501](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:1683](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:1683), in T5ForConditionalGeneration.forward(self, input_ids, attention_mask, decoder_input_ids, decoder_attention_mask, head_mask, decoder_head_mask, cross_attn_head_mask, encoder_outputs, past_key_values, inputs_embeds, decoder_inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict)
1680 # Encode if needed (training, first prediction pass)
1681 if encoder_outputs is None:
1682 # Convert encoder inputs in embeddings if needed
-> 1683 encoder_outputs = self.encoder(
1684 input_ids=input_ids,
1685 attention_mask=attention_mask,
1686 inputs_embeds=inputs_embeds,
1687 head_mask=head_mask,
1688 output_attentions=output_attentions,
1689 output_hidden_states=output_hidden_states,
1690 return_dict=return_dict,
1691 )
1692 elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
1693 encoder_outputs = BaseModelOutput(
1694 last_hidden_state=encoder_outputs[0],
1695 hidden_states=encoder_outputs[1] if len(encoder_outputs) > 1 else None,
1696 attentions=encoder_outputs[2] if len(encoder_outputs) > 2 else None,
1697 )
File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/module.py:1501](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:988](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/transformers/models/t5/modeling_t5.py:988), in T5Stack.forward(self, input_ids, attention_mask, encoder_hidden_states, encoder_attention_mask, inputs_embeds, head_mask, cross_attn_head_mask, past_key_values, use_cache, output_attentions, output_hidden_states, return_dict)
986 if self.embed_tokens is None:
987 raise ValueError("You have to initialize the model with valid token embeddings")
--> 988 inputs_embeds = self.embed_tokens(input_ids)
990 batch_size, seq_length = input_shape
992 # required mask seq length can be calculated via length of past
File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/module.py:1501](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/sparse.py:162](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/modules/sparse.py:162), in Embedding.forward(self, input)
161 def forward(self, input: Tensor) -> Tensor:
--> 162 return F.embedding(
163 input, self.weight, self.padding_idx, self.max_norm,
164 self.norm_type, self.scale_grad_by_freq, self.sparse)
File [~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/functional.py:2210](https://vscode-remote+ssh-002dremote-002b7b22686f73744e616d65223a226d61636368696e61556e695069227d.vscode-resource.vscode-cdn.net/home/bagnol/progetti/NLG/alpaca-lora/~/miniconda3/envs/NLG/lib/python3.10/site-packages/torch/nn/functional.py:2210), in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
2204 # Note [embedding_renorm set_grad_enabled]
2205 # XXX: equivalent to
2206 # with torch.no_grad():
2207 # torch.embedding_renorm_
2208 # remove once script supports set_grad_enabled
2209 _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)
I've python 3.10.10 with:
torch 2.0.1
ragas 0.0.5
datasets 2.12.0
Thanks a lot
We started ragas with ground-truth
free evaluations so that you didn't have to put significant upfront effort into building an ideal test set before running evaluations. Creating a test set needs substantial upfront investment in time, money, human hours and expertise to get it right. It is also a continuous process as your product and ML model evolve to cater to diverse use cases. We are exploring the possibilities of synthetic test set generation because
The whole focus of the Ragas library is to help you build more reliable RAG applications which is why with the next leg of Ragas we'll be focusing a lot more on test set generation and continual learning of RAG pipelines. The goal is to leverage custom LLMs and Data-Centric AI techniques to
there is a lot of work to be done but with the v0.1 release of Ragas, we'll be releasing features in this direction. In the meantime, we would love to hear your opinions, expectations, suggestions and ideas about this too :)
Team Ragas
Citations.md refers to how others should cite ragas.
Hello team 👋
When I try to reproduce your llamaindex notebook, without modifying anything, I've got an error with:
result = evaluate(query_engine, metrics, eval_questions, eval_answers)
It says:
TypeError: evaluate() takes 3 positional arguments but 4 were given
Any idea on how to make it work? Thanks!
I got the result for faithfulness as -1.5 for one of the rows of my dataset even though its defined scale is 0 - 1. I am currently using ragas version 0.0.9.
This did not happen when I previously ran the version 0.0.7 on the same dataset.
Following is the piece of code I used :
gpt3 = ChatOpenAI()
faithfulness_gpt3 = Faithfulness(
name="faithfulness_gpt3", llm=gpt3, batch_size=3
)
subset = df.iloc[:3]
hg_dataset_1 = Dataset(pa.Table.from_pandas(subset))
result = evaluate(hg_dataset_1, metrics=[faithfulness_gpt3, context_relevancy, answer_relevancy])```
Azure OpenAI requires the special parameter deployment
or deployment_id
The langchain wrappers seem to mostly have been updated to accomodate for this but it doesn't seem to work with ragas.
I ended up getting faithfulness working but updating the generate
method with:
from
elif isinstance(llm, BaseChatModel):
ps = [p.format_messages() for p in prompts]
result = llm.generate(ps, callbacks=callbacks)
to
elif isinstance(llm, BaseChatModel):
ps = [p.format_messages() for p in prompts]
result = llm.generate(ps, callbacks=callbacks, deployment_id='<my_id>', api_version='<my_version>')
but with answer_relevancy
I hit the same issue when it tries to run:
│ 91 │ def calculate_similarity( │
│ 92 │ │ self: t.Self, question: str, generated_questions: list[str] │
│ 93 │ ): │
│ ❱ 94 │ │ question_vec = np.asarray(self.embedding.embed_query(question)).reshape(1, -1) │
│ 95 │ │ gen_question_vec = np.asarray( │
│ 96 │ │ │ self.embedding.embed_documents(generated_questions) │
│ 97 │ │ )
Any ideas?
Currently, if the sequence length of context is more than max_length (512) it will be truncated before scoring relevancy. Instead, such contexts should be chunked to sequence < 512 tokens before scoring and then averaged.
Changes for this could be made here
I am trying to run evaluate on outputs generated by GPT4. I have the columns structured in the desired format however am running into the following error
Dataset feature "contexts" should be of type Sequence[string[, got <class 'datasets.features.features.Value'>
Any tips on how to resolve? Thank you!
Ragas can be only used with dataset that contains fixed attributes. Any other attributes other than required causes key errors. For example, here I used dataset with column names [question,answer,contexts,ungrounded_answer]
from datasets import load_dataset
from ragas.metrics import (
answer_relevancy,
faithfulness,
)
from ragas import evaluate
wikieval = load_dataset("explodinggradients/WikiEval")
wikieval = wikieval['train'].rename_columns({"grounded_answer":"answer","context_v1":"contexts"})
results = evaluate(dataset=wikieval1,metrics=[faithfulness,answer_relevancy])
KeyError Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 results = evaluate(dataset=wikieval,metrics=[context_relevancy,faithfulness,answer_relevancy])
File ~/belar/src/ragas/evaluation.py:89, in evaluate(dataset, metrics, column_map)
86 metrics = [answer_relevancy, context_relevancy, faithfulness, context_recall]
88 # remap column names from the dataset
---> 89 dataset = remap_column_names(dataset, column_map)
91 # validation
92 validate_evaluation_modes(dataset, metrics)
File ~/belar/src/ragas/validation.py:14, in remap_column_names(dataset, column_map)
9 """
10 Remap the column names in case dataset uses different column names
11 """
12 inverse_column_map = {v: k for k, v in column_map.items()}
13 return dataset.from_dict(
---> 14 {inverse_column_map[name]: dataset[name] for name in dataset.column_names}
15 )
File ~/belar/src/ragas/validation.py:14, in <dictcomp>(.0)
9 """
10 Remap the column names in case dataset uses different column names
11 """
12 inverse_column_map = {v: k for k, v in column_map.items()}
13 return dataset.from_dict(
---> 14 {inverse_column_map[name]: dataset[name] for name in dataset.column_names}
15 )
KeyError: 'ungrounded_answer'
Hi all,
I understand how RAGAS work for RAG system.
I have use case where I have fine-tuned a GPT3.5 model on my data and using this model for question-answering.
I want to know how if I can use RAGAS to evaluate this fine-tuned model, as it does not have context/chunks to be passed into RAGAS metrics function.
Can anyone help me with "how to use RAGAS metrics for my GPT3.5 Finetuned model"
Improve documentation of metrics. Try to explain the working of different metrics more deeply.
I want to evaluate a single completion of my LLM.
Code:
from ragas import evaluate
from datasets import Dataset
import os
# prepare your huggingface dataset in the format
# Dataset({
# features: ['question','contexts','answer'],
# num_rows: 25
# })
data = {
"question": [query], # single query, string
"contexts": [sources], # single source document in string, I have tried sources[:3000], sources[:3500] to avoid this error
"answer": [answer] # single answer
}
# Create the Hugging Face dataset
dataset = Dataset.from_dict(data)
# Set the dataset format
dataset.set_format(
type="torch", columns=["question", "contexts", "answer"] # I have tried without type='torch'
)
# Print dataset information
print(dataset)
dataset: Dataset
results = evaluate(dataset)
Complete Output traceback:
Dataset({
features: ['question', 'contexts', 'answer'],
num_rows: 1
})
100%|██████████| 1/1 [00:00<00:00, 1.44it/s]
100%|██████████| 52/52 [03:09<00:00, 3.65s/it]
0%| | 0/1 [00:02<?, ?it/s]
---------------------------------------------------------------------------
InvalidRequestError Traceback (most recent call last)
[<ipython-input-32-8200ac55342c>](https://localhost:8080/#) in <cell line: 38>()
36 dataset: Dataset
37
---> 38 results = evaluate(dataset)
9 frames
[/usr/local/lib/python3.10/dist-packages/ragas/evaluation.py](https://localhost:8080/#) in evaluate(dataset, metrics)
86 scores = []
87 for metric in metrics:
---> 88 scores.append(metric.score(dataset).select_columns(metric.name))
89
90 return Result(scores=concatenate_datasets(scores, axis=1), dataset=dataset)
[/usr/local/lib/python3.10/dist-packages/ragas/metrics/factual.py](https://localhost:8080/#) in score(self, dataset)
71 scores = []
72 for batch in tqdm(self.get_batches(len(dataset))):
---> 73 score = self._score_batch(dataset.select(batch))
74 scores.append(score)
75
[/usr/local/lib/python3.10/dist-packages/ragas/metrics/factual.py](https://localhost:8080/#) in _score_batch(self, ds)
101 prompts.append(prompt)
102
--> 103 response = openai_completion(prompts)
104 outputs = response["choices"] # type: ignore
105
[/usr/local/lib/python3.10/dist-packages/backoff/_sync.py](https://localhost:8080/#) in retry(*args, **kwargs)
103
104 try:
--> 105 ret = target(*args, **kwargs)
106 except exception as e:
107 max_tries_exceeded = (tries == max_tries_value)
[/usr/local/lib/python3.10/dist-packages/ragas/metrics/llms.py](https://localhost:8080/#) in openai_completion(prompts, **kwargs)
24 - what happens when backoff fails?
25 """
---> 26 response = openai.Completion.create(
27 model=kwargs.get("model", "text-davinci-003"),
28 prompt=prompts,
[/usr/local/lib/python3.10/dist-packages/openai/api_resources/completion.py](https://localhost:8080/#) in create(cls, *args, **kwargs)
23 while True:
24 try:
---> 25 return super().create(*args, **kwargs)
26 except TryAgain as e:
27 if timeout is not None and time.time() > start + timeout:
[/usr/local/lib/python3.10/dist-packages/openai/api_resources/abstract/engine_api_resource.py](https://localhost:8080/#) in create(cls, api_key, api_base, api_type, request_id, api_version, organization, **params)
151 )
152
--> 153 response, _, api_key = requestor.request(
154 "post",
155 url,
[/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py](https://localhost:8080/#) in request(self, method, url, params, headers, files, stream, request_id, request_timeout)
296 request_timeout=request_timeout,
297 )
--> 298 resp, got_stream = self._interpret_response(result, stream)
299 return resp, got_stream, self.api_key
300
[/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py](https://localhost:8080/#) in _interpret_response(self, result, stream)
698 else:
699 return (
--> 700 self._interpret_response_line(
701 result.content.decode("utf-8"),
702 result.status_code,
[/usr/local/lib/python3.10/dist-packages/openai/api_requestor.py](https://localhost:8080/#) in _interpret_response_line(self, rbody, rcode, rheaders, stream)
761 stream_error = stream and "error" in resp.data
762 if stream_error or not 200 <= rcode < 300:
--> 763 raise self.handle_error_response(
764 rbody, rcode, resp.data, rheaders, stream_error=stream_error
765 )
InvalidRequestError: This model's maximum context length is 4097 tokens, however you requested 4331 tokens (3831 in your prompt; 500 for the completion). Please reduce your prompt; or completion length.
Even after changing all the lengths inside my dataset, I get the same error every single time. 3831 tokens in my prompt, 500 for completion.
hello, i read the document about the quickstart. ground_truth just required if you are using context_recall.if I do not need context_recall, i do not parpare the context_recall. i do not add context_recall in my metrics,but something wrong happen "Column ground_truths not in the dataset. Current columns in the dataset: ['question', 'answer', 'contexts']"
DatasetDict({
baseline: Dataset({
features: ['question', 'ground_truths', 'answer', 'contexts'],
num_rows: 30
})
})
I wanted to know if the ground truth refers to the ground truth passages that need to be retrieved or the the final answer
I was a bit confused as I saw
ground_truths: list[list[str]]
in the documntation.
Should it not be ground_truths: list[str]
?
from ragas.metrics.answer_relevance import AnswerRelevancy, answer_relevancy
File "####\Python\Python311\Lib\site-packages\ragas\metrics\answer_relevance.py", line 10, in
from langchain.embeddings.base import Embeddings
ModuleNotFoundError: No module named 'langchain.embeddings.base'
my langchain version 0.0.261
Make sure the output distribution from the entailment score is correct.
Hi team,
I'm new with RAGAS, however I found this :
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy, context_recall
from ragas.langchain import RagasEvaluatorChain
# make eval chains
eval_chains = {
m.name: RagasEvaluatorChain(metric=m)
for m in [faithfulness, answer_relevancy, context_relevancy, context_recall]
}
for name, eval_chain in eval_chains.items():
score_name = f"{name}_score"
print(f"{score_name}: {eval_chain(result)[score_name]}")
using template code, i found that the score name for context_ relevancy -> it has space in between. Is it just me?
(dont mind the len=3, i was using different code)
At the moment, this metrics use OpenAIEmbedding by default.
Can we allow this to use custom embedding instead ?
Hello! Thanks for the great work :)
I've run into a bug while trying to use a local LLM. I cannot compute either ContextRelevancy
or AnswerRelevancy
when using a langchain BaseLLM, due to this Exception in line 36 in ragas.metrics.llms
:
20 def generate(
21 prompts: list[ChatPromptTemplate],
22 llm: BaseLLM | BaseChatModel,
23 n: t.Optional[int] = None,
24 temperature: float = 0,
25 callbacks: t.Optional[Callbacks] = None,
26 ) -> LLMResult:
27 old_n = None
28 n_swapped = False
29 llm.temperature = temperature
30 if n is not None:
31 if isinstance(llm, OpenAI) or isinstance(llm, ChatOpenAI):
32 old_n = llm.n
33 llm.n = n
34 n_swapped = True
35 else:
36 ---> raise Exception(
37 f"n={n} was passed to generate but the LLM {llm} does not support it."
38 " Raise an issue if you want support for {llm}."
39 )
The issue arises because when ContextRelevancy
and AnswerRelevancy
call this function they pass in n=self.strictness
, eg in ragas.metrics.answer_relevance
:
75 results = generate(
76 prompts,
77 self.llm,
78 n=self.strictness,
79 temperature=self.temperature,
80 callbacks=batch_group,
81 )
However strictness
must be an integer due to both class's post_init:
54 def __post_init__(self: t.Self):
55 self.temperature = 0.2 if self.strictness > 0 else 0
Passing strictness=None
would resolve the Exception issue but yields another error due to the post_init integer comparison.
Is there any scope to change the Exception to allow for n=0 or similar (eg change line 30 in ragas.metrics.llms
to if n is not None and n>0
?
If not, we are currently unable to use these metrics with a non-OpenAI LLM as far as I can tell.
Thank you!
Can we prevent printing the progress and other logs?
Issue in src/ragas/metrics/llms.py
The result variable is not initiated outside the if elif else blocks.
results.keys()
return
dict_keys(['answer_relevancy', 'context_ relevancy', 'faithfulness', 'ragas_score'])
where there is a whitespace in context_ relevancy
Due to
ragas/src/ragas/metrics/context_relevance.py
Line 110 in 9f54d01
Ensure results are reproducible by using a Seed generator.
Hi guys,
I am using Azure OpenAI and the only evaluation method that is currently working is "faithfullness". All the others fail with the same error Exception: n=3 was passed to generate but the LLM AzureOpenAI Params: {'deployment_name': '.........', 'model_name': 'text-davinci-003', 'temperature': 0.2, 'max_tokens': 256, 'top_p': 1, 'frequency_penalty': 0, 'presence_penalty': 0, 'n': 1, 'request_timeout': None, 'logit_bias': {}} does not support it. Raise an issue if you want support for {llm}.
Hi guys - I'm currently adding Ragas to DeepEval (confident-ai/deepeval#101)
One thing I found odd though was context_recall
was done using prompts. Curious to hear if there's other experiments done in this space to answer the question of if the retriever answered the query. Potentially a mean Cross-Encoder QA would be helpful here?
Curious to hear your thoughts guys!
Exception: n=1 was passed to generate but the LLM VLLM
Params: {} does not support it. Raise an issue if you want support for {llm}.
Following the instructions from the exception.
HuggingFace is famous for moving fast and breaking things. Would be great to have the exact dependency versions of all the python libraries. Something like "datasets==2.14.3
" instead of datasets
.
I created a small Dataset of my own and observed that the context_relevancy>1
Can I please know what this means?
Hi,
When running the quickstart.ipynb notebook, i got the following errors:
ragas/src/ragas/metrics/faithfulnes.py
Line 25 in 5cf4975
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.