fullfact / health-misinfo-shared Goto Github PK
View Code? Open in Web Editor NEWRaphael health misinformation project, shared by Full Fact and Google
License: MIT License
Raphael health misinformation project, shared by Full Fact and Google
License: MIT License
Currently, the list of Previous Searches just shows the video ID and links to the list of extracted claims.
As a user demonstrating this tool,
I want to see the title of each processed video,
so I can easily find the one I'm looking for.
It's possible that videos may have the same/similar titles, so including the video id would be helpful.
If the title is very long, it could be truncated. (Not sure if that ever happens though.)
We have a script which lets us run promptfoo to compare prompts. This is good. However, we need to write the actual tests which are done on the prompt output.
We want to automatically run our evaluation code on each promptfoo evaluation output.
This will involve adapting code found in src/evaluation.py
.
For the test we'll need a python script, which matches a certain format. This is from the docs:
This file will be called with an output string and an AssertContext object (see above). It expects that either a bool (pass/fail), float (score), or GradingResult will be returned.
get_assert
method, with the inputs/outputs specified in the documentation (see notes).Example of a script
from typing import Dict, TypedDict, Union
def get_assert(output: str, context) -> Union[bool, float, Dict[str, Any]]:
print('Prompt:', context['prompt'])
print('Vars', context['vars']['topic'])
# This return is an example GradingResult dict
return {
'pass': True,
'score': 0.6,
'reason': 'Looks good to me',
}
When I type in a video ID and click "get claims" there's no response. The browser tools suggest a CORS error.
Steps to reproduce the behaviour:
docker compose up -d
Cross-Origin Request Blocked: The Same Origin Policy disallows reading the remote resource at https://127.0.0.1:4000/api/transcripts/4WAFHXdTMbY. (Reason: CORS request did not succeed). Status code: (null).
2
Uncaught (in promise) TypeError: NetworkError when attempting to fetch resource.
The video should be processed and the claims listed.
Currently the prompt given to gemini is sourced from src/raphael_backend_flask/prompts.py
and so hardcoded. If there were an input on the landing page for a custom prompt (pre-filled with the default), it would give data scientists (and select users?) the ability to iterate on results or demonstrate different prompts. Would this be a useful and/or desirable feature?
In order to demonstrate and see the value of this approach, we need a minimal working tool.
This issue will server as a umbrella for defining and implementing the first pass at this MVP.
High-level product sketch
This code should be stand alone (not depending on existing Full Fact code).
This code will be discarded at some point, so let's not worry about NFRs like scale, speed, robustness etc.
This API will provide a simple interface to store and retrieve video transcripts and claims extracted by the health model. It will interact with a simple on-disk sqlite3 database to persist data.
Requirements:
database API endpoints to store, retrieve and delete:
backend API to:
We don’t want to accidentally push .env.backend to version control, so we should either gitignore it, or rename it to .env instead (which is already gitignored).
Currently, if someone enters the video id and clicks analyse for a video that has already been analysed, the model will extract the same (or similar) claims again and add them to the list.
Instead, we should list the video twice (or more times) on the main page, each linking to its associated set of claims.
This is especially important as we update the prompts/models etc. and may want to compare the same video before and after a change.
Steps to reproduce the behaviour:
If the same id occurs twice in the list of analysed videos, distinguish by a version number. E.g.
etc.
For reference: in Live, if the same YouTube video is analysed twice, two versions are shown. This is what we want here.
We have recently asked Full Fact's health fact checkers to annotate some claims.
We now want to use that data for in-context learning (meaning the training data is put in the prompt for few-shot/many-shot learning). (We might later also use it to fine-tune a model and use that for inference.)
We'll also want to do some evaluation. The simplest approach might be to split the annotated set and use part for in-context learning and th rest for evaluation.
We'll also start with multiple CSV files for annotations - one per annotator. Probably best to keep these separate (e.g. so we can add more later), but merge into one big JSON file for use in the actual prompt.
After researching the impact on results of adding an introduction to prompts where the model is told which persona to adopt, a persona of a medical factchecker was determined to perform best, including outperforming not having a persona.
This should be added permanently to the code.
We'll need a high quality labelled data set for ongoing training & evaluation.
Vv minor, but titles stored in metadata can include HTML entities. These should be unescaped before storing them.
(BeautifulSoup would handle this for us, if we wanted to use that for parsing HTML).
process_video
runs chunks of text through an LLM, and appends the results to a list which it eventually returns. For long videos, it can take a long time to complete. If there’s an error along the way, this can mean none of the results get saved to the database.
Instead of appending to a big list and returning that, it would be preferable to yield results as it gets them.
The database initialisation script introduced in 39acff9 contains the wrong child key in both FOREIGN KEY constraints. The key is id
instead of the correct key, video_id
.
A simple database with API access will be created as discussed in the MVP doc.
This will provide a simple API to allow storage/retrieval of video transcripts; annotated claims for training; and extracted claims inferred by the final model.
Create three tables:
video_transcripts: id, URL, metadata (title/date/topic etc.), transcript (=big block of text, e.g. 5k characters)
Will retrieve via exact matches of URL or fields of metadata
training_claims: text of claim (~ a sentence), any manually added label, reference to video id, timestamp (& UUID?)
Will retrieve via video id
inferred_claims: text of claim, model-added label of harm, model_id (string), reference to video id, timestamp (& UUID?)
Will retrieve via video id
There are a number of tools out there that are designed to speed up prompt engineering, by allowing rapid evaluation or comparison of genAI models. We should investigate these to see if they could be useful.
Our prompt & model should identify claims worth checking from a video, but should also indicate how checkworthy it is. One form of the prompt (for example) labels each claim as one of:
"not worth checking", "worth checking", "may be worth checking"
These labels should be reflected in the front end by grouping the claims by that "summary" label so that the claims "worth checking" are shown at the top, clearly demarkated (a text label and/or maybe a background colour?); then the "may be worth checking" as a separate block below; and finally the "not worth checking" at the bottom.
The on-screen labels should reflect our caveats; e.g. in order: "mostly likely to be worth checking", "may be worth checking" and "less likely to be worth checking" for these three labels.
Later on, we might have more than 3 groups, poss. even a continuous score. But let's see how this crude 'traffic light' sorting works first.
For clarity, the tool shows paraphrased version of extracted claims. The first thing any fact checker would want to do is to read the actual claim made in context. (This is partly to check the model hasn't hallucinated any extra meaning into the claim.)
After a video has been processed and a list of claims is displayed, also display the entire transcript in a scrollable area to one side.
When a claim is clicked, the video should start playing from that point (as per issue 20) AND also the transcript display should jump to that point in the transcript and highlight the raw text of the claim.
In this mockup, the user has clicked on a claim on the left and can see the raw transcript on the right with the same claim highlighted. (The video should be playing somewhere too.)
I am not a UI expert! It may be that starting a video AND jumping the text at the same time is confusing or obscure. An alternative might be to add a little "jump to transcript" button beside each extacted claim, maybe with a second button saying "jump to video".
If someone makes an outrageous claim, but in the context of saying things like "But ask your doctor first" or "I'm not a doctor" or "this worked for me but YMMV" etc. then the fact checker should be told of the context as well as the claim, as this might effect their decision to scope the claim further.
This could be an extra variable extracted by our main prompt (i.e. alongside clarity, evidence, type of claim etc.). We can probably find quite a few examples to annotate.
This work depends on #11
Create a simple UI that allows users to:
UI will be based on the following api contract:
GET /transcripts
200, [{
id: string
title: string
url: string
}]
POST /transcripts
201, {
id: string
}
GET /transcripts/<string:id>
200, {
id: string
title: string
url: string
metadata: string
transcript: string
}
404, ""
DELETE /transcripts/<string:id>
204, ""
GET /transcripts/<string:id>/status
200, {
status: string (done|processing|error)
}
404, ""
GET /inferred_claims/<string:id>
200, [{
id: number
video_id: text
claim: text
label: text
model: text
offset_ms: number
}]
Currently, when we ask Gemini to identify and extract claims, it paraphrases them. This is good because it improves the readability of the claims, in contrast to the raw transcript. Part of this is to make the claims standalone without needing extra context.
However, Gemini also tries to be helpful by adding extra context that isn't in the transcript. In some cases, this can change the meaning quite significantly.
E.g. if a transcript says "i love carrots you know they're so crunchy carrots make you see in the dark", Gemini may summarise this as "Carrots are good for night vision because they're rich in vitamin A".
We need a way to evaluate a model by comparing its output to a gold-set of evaluation data.
Being generative, the output from models will vary. (If the temperature is set to 0, then the response to a fixed prompt should be constant; but switching between different models or changing the prompt even in a trivial way can break that.)
Given a model and a set of labelled evaluation data, return metrics (precision, recall, F1 or equivalent) in such a way that the same model gives the same metrics with repeated trials.
We currently search for natural remedies for a bunch of health issues, but we could probably widen the search a bit.
By adding more conditions/topics and more search phrases we can find a wider range of content, and ensure that the tool works on a representative set of videos.
The current database schema (#62 (comment)) has a few gaps. We’d like to store the following things:
Here’s a proposed revised schema. This is WIP, and it may also be more complicated than we really need. But hopefully it should capture the things mentioned above. UPDATE: @dcorney, @ff-dh, @JamesMcMinn and @andylolz discussed and agreed the following:
erDiagram
youtube_videos ||--o{ claim_extraction_runs : runs
youtube_videos {
text id
text metadata
text transcript
}
claim_extraction_runs ||--o{ inferred_claims : claims
claim_extraction_runs {
integer id PK
text youtube_id FK
text model
text status
integer timestamp
}
inferred_claims {
integer id PK
integer run_id FK
text claim
text raw_sentence_text
text labels
real offset_start_s
real offset_end_s
}
training_claims {
integer id PK
text youtube_id
text claim
text labels
}
Currently, all URLs of claims found within a video are identical. Instead, each claim should link to the point in the video where the claim is made.
Steps to reproduce the behaviour:
&t=0
Each link URL should end with the timestamp associated with the claim.
(Acutally, we should probably link a few seconds earlier).
We've done some research into some GenAI evaluation tools (#38), but the scripts written so far have been toy examples.
We should write a script that compares prompts on representative data, which we can reuse for evaluating new prompts and new models.
As @ff-dh points out, it’s not necessary to run python -m tools.db
– a database will be initialised on first use (which is very neat!)
In fact, the schema in tools/db/main.py appears to be out of date, so it’s best not to use it. Potentially this should be removed.
For use in claim_extraction_runs.models (See: #71 (comment))
We have a script which lets us run promptfoo to compare prompts. This is good. However, we need to write the actual tests which are done on the prompt output.
We want our model to produce a paraphrased claim, alongside the direct quote from the text that this was paraphrased from.
We should write a test which checks if the quotes pulled out by the model actually exist in the text.
For the test we'll need a python script, which matches a certain format. This is from the docs:
This file will be called with an output string and an AssertContext object (see above). It expects that either a bool (pass/fail), float (score), or GradingResult will be returned.
get_assert
method, with the inputs/outputs specified in the documentation (see notes).Example of a script
from typing import Dict, TypedDict, Union
def get_assert(output: str, context) -> Union[bool, float, Dict[str, Any]]:
print('Prompt:', context['prompt'])
print('Vars', context['vars']['topic'])
# This return is an example GradingResult dict
return {
'pass': True,
'score': 0.6,
'reason': 'Looks good to me',
}
baseUrl
is hardcoded currently:
This is a problem when running the flask server locally. It would be useful to be able to set this from an environment variable.
Currently, the fine-tuning task is to take a chunk of text and return a list of checkworthy health claims, i.e. a simple list of strings. The training set therefore is a list of chunks (input) and checkworthy claims (target output).
However, we have expert knowledge about what makes some claims checkworthy (e.g. they are concrete claims, they refer to studies or science, they contradict clear scientific advice etc.) which we should include in our training. We can do this by making the target output richer and fine-tuning on that. Claims containing "hedge words", like "can help", "may be effective", "possibly useful" etc. are generally not worth checking.
generate_training_set()
should be updated so that the CSV file it produces is easier to label. E.g. produce a separate file with just the claim field. When this is labelled in an external spreadsheet, we'll need to import it again. So...make_training_set()
so that it reads in the annotated CSV file and the corresponding rich JSON file and merges them to produce the fine-tuning dataset.{"claim
":"Fenugreek may help to reduce body fat especially around the abdominal area.",
"label
":"false",
"reason
":"may is a hedge word"}
{"claim
":"A whole food plant-based diet can help fight cancer.",
"label
":"false",
"reason
":"can is a hedge word"}
{"claim
":"HPV is associated with almost 100% of cervical cancers.",
"label
":"false",
"reason
":"claim is accurate"}
{"claim
":"Cinnamon is one of the most effective cancer fighting food mainly because it has high antioxidant content and also because of its antibacterial property.",
"label
":"true",
"reason
":"high harm"}
{"claim
":"In many Scandinavian countries, even if you get aggressive cancer, they still don't treat you, they'll just watch you.",
"label
":"true",
"reason
":"high harm"}
Running docker compose
fails.
Solution seems to be to remove localhost
from both image: lines in the compose.yaml
(though this should be confirmed).
Steps to reproduce the behaviour:
docker compose up -d
✘ raphael_frontend_react Error Get "http://localhost/v2/": dial tcp [::1]:80: connect: connection refused 0.2s
✘ raphael_backend_flask Error context canceled 0.2s
Error response from daemon: Get "http://localhost/v2/": dial tcp [::1]:80: connect: connection refused
The docker image should start running
See slack thread
We have training data from various sources currently sitting in various CSV files. We also have a (probably empty) db table called training_claims
.
/tools/db
) that auto-populates the database with some sample data.The application frontend and backend need some mode of deployment. After looking at the Kubernetes configuration, it's probably best this is done by deploying a new VM to the Machine Learning project on GCP instead - we can point a subdomain at the box, stick nginx and docker on it.
Requirements:
database.db is in version control, presumably added in error. It should be removed and gitignored.
Users should be able to copy/paste the complete URL of a YouTube video such as:
Or just the video id such as:
Stretch goal (only if it's not much work) - allow more complex URLs, such as:
We need a script or module to initialise a sqlite3 db that has the fields listed in #6. It shouldn't be hard to do as a script with arguments that provides some kind of initialisation function if we want to import it as a module.
Currently, we get and store the offset (in seconds) with each bit of text. But when we form chunks of text to pass to an LLM, we discard the offset.
Track the offset of each chunk.
If the chunks are long, the offset might be quite a long way before the claims within the chunk.
Currently, vertex.py and prompts.py are duplicated for use by the flask app. Ideally they’d instead be imported (or symlinked) from a canonical source.
The core of this system is a fine-tuned LLM that takes as input (part of) a video transcript, and as output returns a list of health-related claims, each with a label. These labels serve two purposes: first, they can be used to indicate which claims are worth showing to fact checkers as potentially worth checking; and second they allow us to provide domain knowledge to the model during the training process.
E.g. the claim
Conversely, the claim
This is a first pass at a set of labels. This was just created to test the workflow so needs reviewing/reworking.
A example set of claims, which were extracted by Gemini from a small sample of videos.
Bind for 0.0.0.0:4000 failed: port is already allocated
This actually appears to be a YouTube Player API bug, but one that we might want to work around.
If you are logged into YouTube (I think this includes being logged into any Google account) and watch a long (more than 20 minutes, I think) video on the YouTube website, YouTube attempts to remember your position in the video. If you close the tab and reopen, it will continue playing from the same point. (I am not exactly sure how this works / what the parameters are.)
If YouTube has a remembered last position in the video and you run the same video through Raphael, this messes up jumping to timestamp in the YouTube embed. Clicking a claim to jump to the timestamp causes the embedded player to jump to that timestamp and then immediately jump to the last played position. This means it’s not really possible to jump to the right point in the video using the embedded player.
Steps to reproduce the behaviour:
This problem doesn’t occur if:
I had an initial go at technical workarounds, but I’m afraid I wasn’t able to find one.
A common type of video promotes a particular remedy (e.g. an essential oil or some magic food extract) and is combined with a financial tie-in, so the video creator will make money from related product sales. While some of these will be perfectly legitimate commercial videos, some will be promoting health misinformation and profiting from it.
We have developed a prompt & model that generates 5 labels for each claim:
understandability, type_of_claim , type_of_medical_claim , support , harm
We should switch the model used in the MVP to use this version instead, as a precursor to fixing #61 as it will allow us to display and sort by the degree of checkworthiness.
We potentially have the option of using a fine-tuned Gemini model here, or using in-context learning (i.e. putting the training data in the prompt). The results should be broadly similar so let's deploy whatever is simplest to deploy!
vertex.py/generate_reponse()
to the new modelinferred_claims
table - the output will need to be converted from JSON to a single string firstA simple database with API access will be created by #6. Once this exists, the rest of the code needs to be updated to make use of it.
Examine the code to find where data is stored in local files and update to use persistent db via the new API.
Examples include:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.