Giter Club home page Giter Club logo

storyweb's Introduction

StoryWeb

StoryWeb is a project aimed to extract networks of entities from journalistic reporting. The idea is to reverse engineer stories into structured graphs of the persons and companies involved, and to capture the relationships between them.

https://storyweb.opensanctions.org

StoryWeb consumes news articles as input data. Individual articles can be imported via the web interface, but there's also a possibility for bulk import using the articledata micro-format. One producer of articledata files is mediacrawl, which can be used to crawl news websites and harvest all of their articles.

Installation

Storyweb can be run as a Python web application from a developer's machine, or via a docker container. We recommend using docker for any production deployment and as a quick means to get the application running if you don't intend to change its code.

Running in Docker mode

You can start up the a docker instance by running the following commands in an empty directory:

wget https://raw.githubusercontent.com/opensanctions/storyweb/main/docker-compose.yml
docker-compose up

This will make the storyweb user interface available on port 8000 of the host machine.

Running in development mode

Before installing storyweb on the host machine, we recommend setting up a Python virtual environment of some form (venv, virtualenv, etc.).

As a first step, let's install the spaCy models that are used to extract person and company names from the given articles:

pip install spacy
python3 -m spacy download en_core_web_sm
python3 -m spacy download de_core_news_sm
python3 -m spacy download xx_ent_wiki_sm
python3 -m spacy download ru_core_news_sm

Next, we'll install the application itself, and its dependencies. Run the following command inside of a git checkout of the storyweb repository:

pip install -e ".[dev]"

You also need to have a PostgreSQL server running somewhere (e.g. on the same machine, perhaps installed via homebrew or apt). Create a fresh database on that server and point storyweb to it like this:

export STORYWEB_DB_URL=postgresql://storyweb:storyweb@db/storyweb
# Create the database tables:
storyweb init

You now have the application configured and you can explore the commands exposed by the storyweb command-line tool:

Usage: storyweb [OPTIONS] COMMAND [ARGS]...

  Storyweb CLI

Options:
  --help  Show this message and exit.

Commands:
  auto-merge  Automatically merge on fingerprints
  compute     Run backend computations
  graph       Export an entity graph
  import      Import articles into the DB
  import-url  Load a single news story by URL
  init        Initialize the database

The import command listed here will accept any data file in the articledata format, which is emitted by the mediacrawl tool.

Running the backend API

Finally, you can run the backend API using uvicorn:

uvicorn --reload --host 0.0.0.0 storyweb.server:app

This will boot up the API server of port 8000 of the local host and enable hot reloads whenever the code changes during development.

Installing and running the frontend

Once you have the API running, you can install and run the development server for the frontend. Storyweb uses React and ReduxToolkit internally and will use a Webpack dev server to dynamically re-build the frontend during development.

cd frontend/
npm install 
npm run dev

Remember that you need to run npm run dev whenever you do frontend development.

License and credits

Thanks to Heleen Emanuel and Tobias Sterbak for their advice on the design and implementation of StoryWeb.

This project receives financial support from the German Federal Ministry for Education and Research (Bundesministerium für Bildung und Forschung, BMBF) under the grant identifier 01IS22S42. The full responsibility for the content of this publication remains with its authors.

The software is licensed under the MIT license, see LICENSE in this repository.

storyweb's People

Contributors

dependabot[bot] avatar pudo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

storyweb's Issues

Designing a disambiguation model

Progress

StoryWeb can now load articles, run them through spaCy NER and store the extracted entity tags (e.g. John Doe) to a database. In that database, each tag is identified per article, ie. (article_id, tag, count_of_mentions, ...). There's also a database model that describes a link between two tags (A and B are the same, unrelated, or have some semantic link - e.g. family). Once two tags have a same link between them they are considered a cluster, i.e. they become essentially the same node in the graph.

There is also a small UI that lets users make those links manually - both between different tags in the same article, or tags with the same surface form across different articles.

Screenshot 2022-10-17 at 12 58 42

The rationale for keeping tags constrained to one article is disambiguation: John Doe in article A may refer to a different individual than John Doe in article B.

Challenge

While disambiguation between different tags with the same surface form (e.g. two John Doe) is needed, doing this whole thing manually is intensely annoying and not even good for a prototype. What I'd like to do is to find a way to auto-decide the unambiguous/auto-decidable cases, then show the rest to the user and refine further merges based on their input.

In my mind, the core evidence for making these decisions is co-occurrence: John Doe A co-occurs with Jane Doe and Italy; John Doe B co-occurs with MegaCorp Ltd. and State Prosecutor. I'm aware that this would leave a lot of signal - the body of the documents - on the table. This goes back to wanting to build an interactive, human-in-the-loop system for better precision: keep it down to something that's explainable and where we can even re-compute clustering proposals as and while the user is providing input (active learning).

But I'm kind of stuck on this: how do I take the co-occurrence sets, model them into an input to some very simple machine learning model and then get both a set of judgements and a confidence score for each, so that I can then a) decide the ones the system is confident about, and b) show the most informative uncertain ones to a user to judge by hand, then c) re-train the model with these additional judgements.

Some things I've pondered:

  • Using tf/idf on the tags to score down the super common ones (Russia in my corpus is not signal, it's a stopword). I tried implementing that, and it leads to more interesting co-occurrence patterns - but not impressively so. Especially for common names (like "Vladimir Putin") the co-occurrence ends up generic as well.
  • Maybe this could be a simple bayes classifier (given A, B, C, what's the likelihood of this being X?)?

Stuff I want to avoid

Implement "stories" user journey

As a user I want to:

  • Create a story in the tool
  • Add a bunch of article links to it and have them crawled and parsed
  • Select articles from bulk scrapes to add to the story
  • Run the linker loop on the most mentioned pairs of entities in the corpus
  • Export a simple graph of the story entities and relationships

Nav flows for app

Landing page options:

  • Browse/filter list of clusters (api: /clusters)
  • Find an article to build entities from (sort by entity count? /articles?sort=count:desc)

Article page:

  • Go through all co-tags and build a network (api: /tags?article=xxx )
    • How does this pick pairs? By descending mention count?

Cluster page:

  • Go through similar tags and integrate (api: /tags/xxx/similar ? coref-query from research)
    • Merge all
    • Merge all by type
  • Go through co-tags and build a network (api /tags?cluster=xxxx)
    • Trigger the linkloom (/linkloom?anchor=xxxx )

Stuck on SQL query for pair generation

I've been stuck on this SQL query for a day now, so I'm throwing it up here and would appreciate any advice others can give.

This is the problem: I want to generate a set of pairs of tags (named entities from articles), a and b, ordered by how many articles they co-occur in. This is relatively simple. However, there's a twist: the query should also check another table, link, to see if there's already an existing link between both tags. A link is a directed edge, ie. two tags could be connected either a->b or b->a.

As a minimum, I want to filter out all links where a and b are already connected - but a better implementation would allow me to return unfiltered pairs, with the type of the link whereever a link exists.

Here's the basic pair-generating query, which works as expected:

SELECT
   l.cluster AS left_id,
   l.cluster_type AS left_type,
   l.cluster_label AS left_label,
   r.cluster AS right_id,
   r.cluster_type AS right_type,
   r.cluster_label AS right_label,
   count(distinct(l.article)) AS articles
FROM tag AS l, tag AS r
WHERE
   l.cluster > r.cluster
   AND l.article = r.article
GROUP BY l.cluster, l.cluster_label, l.cluster_type, r.cluster, r.cluster_label, r.cluster_type
ORDER BY count(distinct(l.article)) DESC;

CTE-based approach

Here's a sort of solution to the sub-problem of getting all the pairs where a link exists:

WITH links AS (
  SELECT
    greatest(link.source_cluster, link.target_cluster) AS big,
    least(link.source_cluster, link.target_cluster) AS smol,
    link.type AS type
  FROM link AS link
)
SELECT l.cluster AS left_id, l.cluster_type AS left_type, l.cluster_label AS left_label, r.cluster AS right_id, r.cluster_type AS right_type, r.cluster_label AS right_label,
  count(distinct(l.article)) AS articles,
  array_agg(distinct(links.type)) AS link_types
FROM tag AS r, tag AS l
  JOIN links ON l.cluster = links.big
WHERE
  l.cluster > r.cluster
  AND l.article = r.article
  AND r.cluster = links.smol
GROUP BY l.cluster, l.cluster_label, l.cluster_type, r.cluster, r.cluster_label, r.cluster_type
ORDER BY count(distinct(l.article)) DESC

But this doesn't handle showing unlinked pairs, or showing both linked and unlinked pairs. Maybe there's some way of sub-querying the links CTE in the main query that would handle non-linked pairs?

Table definitions

CREATE TABLE tag (
    cluster character varying(40),
    article character varying(255),
    cluster_type character varying(10),
    cluster_label character varying,
);

CREATE TABLE link (
    source_cluster character varying(40),
    target_cluster character varying(40),
    type character varying(255),
);

Example data

tag:

"cluster","cluster_type","cluster_label","article"
"fffcc580c020f689e206fddbc32777f0d0866f23","LOC","Russia","a"
"fffcc580c020f689e206fddbc32777f0d0866f23","LOC","Russia","b"
"fff03a54c98cf079d562998d511ef2823d1f1863","PER","Vladimir Putin","a"
"fff03a54c98cf079d562998d511ef2823d1f1863","PER","Vladimir Putin","b"
"fff03a54c98cf079d562998d511ef2823d1f1863","PER","Vladimir Putin","d"
"ff9be8adf69cddee1b910e592b119478388e2194","LOC","Moscow","a"
"ff9be8adf69cddee1b910e592b119478388e2194","LOC","Moscow","b"
"ffeeb6ebcdc1fe87a3a2b84d707e17bd716dd20b","LOC","Latvia","a"
"ffd364472a999c3d1001f5910398a53997ae0afe","ORG","OCCRP","a"
"ffd364472a999c3d1001f5910398a53997ae0afe","ORG","OCCRP","d"
"fef5381215b1dfded414f5e60469ce32f3334fdd","ORG","Moldindconbank","a"
"fef5381215b1dfded414f5e60469ce32f3334fdd","ORG","Moldindconbank","c"
"fe855a808f535efa417f6d082f5e5b6581fb6835","ORG","KGB","a"
"fe855a808f535efa417f6d082f5e5b6581fb6835","ORG","KGB","b"
"fe855a808f535efa417f6d082f5e5b6581fb6835","ORG","KGB","d"
"fff14a3c6d8f6d04f4a7f224b043380bb45cb57a","ORG","Moldova","a"
"fff14a3c6d8f6d04f4a7f224b043380bb45cb57a","ORG","Moldova","c"

link

"source_cluster","target_cluster","type"
"fff03a54c98cf079d562998d511ef2823d1f1863","fffcc580c020f689e206fddbc32777f0d0866f23","LOCATED"
"fe855a808f535efa417f6d082f5e5b6581fb6835","fff03a54c98cf079d562998d511ef2823d1f1863","EMPLOYER"
"fff14a3c6d8f6d04f4a7f224b043380bb45cb57a","fef5381215b1dfded414f5e60469ce32f3334fdd","LOCATED"

Create topical source set

  1. create a GU subset
  • filter GU articles on pillar/news & content type article
  1. further filter down GU subset
  • somehow decide on a keyword, or topic, or label to further narrow down the set
  • run spacy on news subset
  • look at textcategorizer pipeline part results

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.