opensanctions / storyweb Goto Github PK

View Code? Open in Web Editor NEW

46.0 4.0 4.0 7.59 MB

Extract networks of entities from journalistic reporting

Home Page: https://storyweb.opensanctions.org

License: MIT License

Jupyter Notebook 49.21% Python 22.02% Makefile 0.18% TypeScript 26.82% SCSS 1.28% HTML 0.13% Dockerfile 0.37%

knowledge-graph ner news-mining text-mining

storyweb's Introduction

StoryWeb

StoryWeb is a project aimed to extract networks of entities from journalistic reporting. The idea is to reverse engineer stories into structured graphs of the persons and companies involved, and to capture the relationships between them.

https://storyweb.opensanctions.org

StoryWeb consumes news articles as input data. Individual articles can be imported via the web interface, but there's also a possibility for bulk import using the articledata micro-format. One producer of articledata files is mediacrawl, which can be used to crawl news websites and harvest all of their articles.

Installation

Storyweb can be run as a Python web application from a developer's machine, or via a docker container. We recommend using docker for any production deployment and as a quick means to get the application running if you don't intend to change its code.

Running in Docker mode

You can start up the a docker instance by running the following commands in an empty directory:

wget https://raw.githubusercontent.com/opensanctions/storyweb/main/docker-compose.yml
docker-compose up

This will make the storyweb user interface available on port 8000 of the host machine.

Running in development mode

Before installing storyweb on the host machine, we recommend setting up a Python virtual environment of some form (venv, virtualenv, etc.).

As a first step, let's install the spaCy models that are used to extract person and company names from the given articles:

pip install spacy
python3 -m spacy download en_core_web_sm
python3 -m spacy download de_core_news_sm
python3 -m spacy download xx_ent_wiki_sm
python3 -m spacy download ru_core_news_sm

Next, we'll install the application itself, and its dependencies. Run the following command inside of a git checkout of the storyweb repository:

pip install -e ".[dev]"

You also need to have a PostgreSQL server running somewhere (e.g. on the same machine, perhaps installed via homebrew or apt). Create a fresh database on that server and point storyweb to it like this:

export STORYWEB_DB_URL=postgresql://storyweb:storyweb@db/storyweb
# Create the database tables:
storyweb init

You now have the application configured and you can explore the commands exposed by the storyweb command-line tool:

Usage: storyweb [OPTIONS] COMMAND [ARGS]...

  Storyweb CLI

Options:
  --help  Show this message and exit.

Commands:
  auto-merge  Automatically merge on fingerprints
  compute     Run backend computations
  graph       Export an entity graph
  import      Import articles into the DB
  import-url  Load a single news story by URL
  init        Initialize the database

The import command listed here will accept any data file in the articledata format, which is emitted by the mediacrawl tool.

Running the backend API

Finally, you can run the backend API using uvicorn:

uvicorn --reload --host 0.0.0.0 storyweb.server:app

This will boot up the API server of port 8000 of the local host and enable hot reloads whenever the code changes during development.

Installing and running the frontend

Once you have the API running, you can install and run the development server for the frontend. Storyweb uses React and ReduxToolkit internally and will use a Webpack dev server to dynamically re-build the frontend during development.

cd frontend/
npm install 
npm run dev

Remember that you need to run npm run dev whenever you do frontend development.

License and credits

Thanks to Heleen Emanuel and Tobias Sterbak for their advice on the design and implementation of StoryWeb.

This project receives financial support from the German Federal Ministry for Education and Research (Bundesministerium für Bildung und Forschung, BMBF) under the grant identifier 01IS22S42. The full responsibility for the content of this publication remains with its authors.

The software is licensed under the MIT license, see LICENSE in this repository.

storyweb's People

Contributors

Stargazers

Watchers

Forkers

clementlefevre prototypefund bravet teamjupyter2023

storyweb's Issues

Move article extractor into storyweb

This is in mediacrawl right now, but could be in the app itself so users can submit URLs to be loaded as articles.

Fix up FastAPI butching the SPA

for non-/ routes

Have a first look at co-occurency through NER

run NER
somehow filter (NER type?)
disambiguate NER's
visualize/analyze co-occurency graph

Fix Drawers calling empty API endpoints

Still hating react hooks....

Designing a disambiguation model

Progress

StoryWeb can now load articles, run them through spaCy NER and store the extracted entity tags (e.g. John Doe) to a database. In that database, each tag is identified per article, ie. (article_id, tag, count_of_mentions, ...). There's also a database model that describes a link between two tags (A and B are the same, unrelated, or have some semantic link - e.g. family). Once two tags have a same link between them they are considered a cluster, i.e. they become essentially the same node in the graph.

There is also a small UI that lets users make those links manually - both between different tags in the same article, or tags with the same surface form across different articles.

The rationale for keeping tags constrained to one article is disambiguation: John Doe in article A may refer to a different individual than John Doe in article B.

Challenge

While disambiguation between different tags with the same surface form (e.g. two John Doe) is needed, doing this whole thing manually is intensely annoying and not even good for a prototype. What I'd like to do is to find a way to auto-decide the unambiguous/auto-decidable cases, then show the rest to the user and refine further merges based on their input.

In my mind, the core evidence for making these decisions is co-occurrence: John Doe A co-occurs with Jane Doe and Italy; John Doe B co-occurs with MegaCorp Ltd. and State Prosecutor. I'm aware that this would leave a lot of signal - the body of the documents - on the table. This goes back to wanting to build an interactive, human-in-the-loop system for better precision: keep it down to something that's explainable and where we can even re-compute clustering proposals as and while the user is providing input (active learning).

But I'm kind of stuck on this: how do I take the co-occurrence sets, model them into an input to some very simple machine learning model and then get both a set of judgements and a confidence score for each, so that I can then a) decide the ones the system is confident about, and b) show the most informative uncertain ones to a user to judge by hand, then c) re-train the model with these additional judgements.

Some things I've pondered:

Using tf/idf on the tags to score down the super common ones (Russia in my corpus is not signal, it's a stopword). I tried implementing that, and it leads to more interesting co-occurrence patterns - but not impressively so. Especially for common names (like "Vladimir Putin") the co-occurrence ends up generic as well.
Maybe this could be a simple bayes classifier (given A, B, C, what's the likelihood of this being X?)?

Stuff I want to avoid

I'd really like to avoid some sort of article-content-based mystery vectorisation (e.g. BERT), unless there's a really nice and reproducible way of productising this. However, that's what a lot of the literature is pushing:
Very keen to avoid using an external knowledge base to disambiguate, because the entities we're most interested in are the ones that would not yet be recorded and identified in a KB like Wikidata - the people who work for oligarchs, kleptocrats, etc.

Link-type guesser

Heuristics:

LOC -> ANY
LOC -> LOC
Observers

Implement "stories" user journey

As a user I want to:

Create a story in the tool
Add a bunch of article links to it and have them crawled and parsed
Select articles from bulk scrapes to add to the story
Run the linker loop on the most mentioned pairs of entities in the corpus
Export a simple graph of the story entities and relationships

Fix UI cache invalidation

Figure out when RTK Query will refresh on entity pages.

Nav flows for app

Landing page options:

Browse/filter list of clusters (api: /clusters)
Find an article to build entities from (sort by entity count? /articles?sort=count:desc)

Article page:

Go through all co-tags and build a network (api: /tags?article=xxx )
- How does this pick pairs? By descending mention count?

Cluster page:

Go through similar tags and integrate (api: /tags/xxx/similar ? coref-query from research)
- Merge all
- Merge all by type
Go through co-tags and build a network (api /tags?cluster=xxxx)
- Trigger the linkloom (/linkloom?anchor=xxxx )

Have a first look at co-reference resolving

Stuck on SQL query for pair generation

I've been stuck on this SQL query for a day now, so I'm throwing it up here and would appreciate any advice others can give.

This is the problem: I want to generate a set of pairs of tags (named entities from articles), a and b, ordered by how many articles they co-occur in. This is relatively simple. However, there's a twist: the query should also check another table, link, to see if there's already an existing link between both tags. A link is a directed edge, ie. two tags could be connected either a->b or b->a.

As a minimum, I want to filter out all links where a and b are already connected - but a better implementation would allow me to return unfiltered pairs, with the type of the link whereever a link exists.

Here's the basic pair-generating query, which works as expected:

SELECT
   l.cluster AS left_id,
   l.cluster_type AS left_type,
   l.cluster_label AS left_label,
   r.cluster AS right_id,
   r.cluster_type AS right_type,
   r.cluster_label AS right_label,
   count(distinct(l.article)) AS articles
FROM tag AS l, tag AS r
WHERE
   l.cluster > r.cluster
   AND l.article = r.article
GROUP BY l.cluster, l.cluster_label, l.cluster_type, r.cluster, r.cluster_label, r.cluster_type
ORDER BY count(distinct(l.article)) DESC;

CTE-based approach

Here's a sort of solution to the sub-problem of getting all the pairs where a link exists:

WITH links AS (
  SELECT
    greatest(link.source_cluster, link.target_cluster) AS big,
    least(link.source_cluster, link.target_cluster) AS smol,
    link.type AS type
  FROM link AS link
)
SELECT l.cluster AS left_id, l.cluster_type AS left_type, l.cluster_label AS left_label, r.cluster AS right_id, r.cluster_type AS right_type, r.cluster_label AS right_label,
  count(distinct(l.article)) AS articles,
  array_agg(distinct(links.type)) AS link_types
FROM tag AS r, tag AS l
  JOIN links ON l.cluster = links.big
WHERE
  l.cluster > r.cluster
  AND l.article = r.article
  AND r.cluster = links.smol
GROUP BY l.cluster, l.cluster_label, l.cluster_type, r.cluster, r.cluster_label, r.cluster_type
ORDER BY count(distinct(l.article)) DESC

But this doesn't handle showing unlinked pairs, or showing both linked and unlinked pairs. Maybe there's some way of sub-querying the links CTE in the main query that would handle non-linked pairs?

Table definitions

CREATE TABLE tag (
    cluster character varying(40),
    article character varying(255),
    cluster_type character varying(10),
    cluster_label character varying,
);

CREATE TABLE link (
    source_cluster character varying(40),
    target_cluster character varying(40),
    type character varying(255),
);

Example data

tag:

"cluster","cluster_type","cluster_label","article"
"fffcc580c020f689e206fddbc32777f0d0866f23","LOC","Russia","a"
"fffcc580c020f689e206fddbc32777f0d0866f23","LOC","Russia","b"
"fff03a54c98cf079d562998d511ef2823d1f1863","PER","Vladimir Putin","a"
"fff03a54c98cf079d562998d511ef2823d1f1863","PER","Vladimir Putin","b"
"fff03a54c98cf079d562998d511ef2823d1f1863","PER","Vladimir Putin","d"
"ff9be8adf69cddee1b910e592b119478388e2194","LOC","Moscow","a"
"ff9be8adf69cddee1b910e592b119478388e2194","LOC","Moscow","b"
"ffeeb6ebcdc1fe87a3a2b84d707e17bd716dd20b","LOC","Latvia","a"
"ffd364472a999c3d1001f5910398a53997ae0afe","ORG","OCCRP","a"
"ffd364472a999c3d1001f5910398a53997ae0afe","ORG","OCCRP","d"
"fef5381215b1dfded414f5e60469ce32f3334fdd","ORG","Moldindconbank","a"
"fef5381215b1dfded414f5e60469ce32f3334fdd","ORG","Moldindconbank","c"
"fe855a808f535efa417f6d082f5e5b6581fb6835","ORG","KGB","a"
"fe855a808f535efa417f6d082f5e5b6581fb6835","ORG","KGB","b"
"fe855a808f535efa417f6d082f5e5b6581fb6835","ORG","KGB","d"
"fff14a3c6d8f6d04f4a7f224b043380bb45cb57a","ORG","Moldova","a"
"fff14a3c6d8f6d04f4a7f224b043380bb45cb57a","ORG","Moldova","c"

link

"source_cluster","target_cluster","type"
"fff03a54c98cf079d562998d511ef2823d1f1863","fffcc580c020f689e206fddbc32777f0d0866f23","LOCATED"
"fe855a808f535efa417f6d082f5e5b6581fb6835","fff03a54c98cf079d562998d511ef2823d1f1863","EMPLOYER"
"fff14a3c6d8f6d04f4a7f224b043380bb45cb57a","fef5381215b1dfded414f5e60469ce32f3334fdd","LOCATED"

create a GU subset

filter GU articles on pillar/news & content type article

further filter down GU subset

somehow decide on a keyword, or topic, or label to further narrow down the set
run spacy on news subset
look at textcategorizer pipeline part results

run dependencyparser
~~figure out coreference resolution~~ > turned into a separate issue
create table of object-predicate->subject tuples
run some sort of synonym algo on predicates
look at predicate occurrence count
is there a english lang dataset that differentiates between properties and relationships?
disambiguate NER's > storyweb?
visualize/analyze graph

Render generated graphs in UI

How about this as a first attempt: https://sim51.github.io/react-sigma/docs/example/load-graph