Giter Club home page Giter Club logo

potato's Introduction

🥔 POTATO

POTATO is a human-in-the-loop XAI framework for extracting and evaluating interpretable graph features for any classification problem in Natural Language Processing.

Built systems

To get started with rule-systems we provide rule-based features prebuilt with POTATO on different datasets (e.g. our paper Offensive text detection on English Twitter with deep learning models and rule-based systems for the HASOC2021 shared task). If you are interested in that, you can go under features/ for more info!

Install and Quick Start

Check out our quick demonstration (~2 min) video about the tool: https://youtu.be/PkQ71wUSeNU

There is a longer version with a detailed method description and presented background research (~1 hour): https://youtu.be/6R_V1WfIjsU

Setup

The tool is heavily dependent upon the tuw-nlp repository. You can install tuw-nlp with pip:

pip install tuw-nlp

Then follow the instructions to setup the package.

Then install POTATO from pip:

pip install xpotato

Or you can install it from source:

pip install -e .

Usage

  • POTATO is an IE tool that works on graphs, currently we support three types of graphs: AMR, UD and Fourlang.

  • In the README we provide examples with fourlang semantic graphs. Make sure to follow the instructions in the tuw_nlp repo to be able to build fourlang graphs.

  • If you are interested in AMR graphs, you can go to the hasoc folder To get started with rule-systems prebuilt with POTATO on the HASOC dataset (we also presented a paper named Offensive text detection on English Twitter with deep learning models and rule-based systems for the HASOC2021 shared task).

  • We also provide experiments on the CrowdTruth medical relation extraction datasets with UD graphs, go to the crowdtruth folder for more info!

  • POTATO can also handle unlabeled, or partially labeled data, see advanced mode to get to know more.

To see complete working examples go under the notebooks/ folder to see experiments on HASOC and on the Semeval relation extraction dataset.

First import packages from potato:

from xpotato.dataset.dataset import Dataset
from xpotato.models.trainer import GraphTrainer

First we demonstrate POTATO's capabilities with a few sentences manually picked from the dataset.

Note that we replaced the two entitites in question with XXX and YYY.

sentences = [("Governments and industries in nations around the world are pouring XXX into YYY.", "Entity-Destination(e1,e2)"),
            ("The scientists poured XXX into pint YYY.", "Entity-Destination(e1,e2)"),
            ("The suspect pushed the XXX into a deep YYY.", "Entity-Destination(e1,e2)"),
            ("The Nepalese government sets up a XXX to inquire into the alleged YYY of diplomatic passports.", "Other"),
            ("The entity1 to buy papers is pushed into the next entity2.", "Entity-Destination(e1,e2)"),
            ("An unnamed XXX was pushed into the YYY.", "Entity-Destination(e1,e2)"),
            ("Since then, numerous independent feature XXX have journeyed into YYY.", "Other"),
            ("For some reason, the XXX was blinded from his own YYY about the incommensurability of time.", "Other"),
            ("Sparky Anderson is making progress in his XXX from YYY and could return to managing the Detroit Tigers within a week.", "Other"),
            ("Olympics have already poured one XXX into the YYY.", "Entity-Destination(e1,e2)"),
            ("After wrapping him in a light blanket, they placed the XXX in the YYY his father had carved for him.", "Entity-Destination(e1,e2)"),
            ("I placed the XXX in a natural YYY, at the base of a part of the fallen arch.", "Entity-Destination(e1,e2)"),
            ("The XXX was delivered from the YYY of Lincoln Memorial on August 28, 1963 as part of his famous March on Washington.", "Other"),
            ("The XXX leaked from every conceivable YYY.", "Other"),
            ("The scientists placed the XXX in a tiny YYY which gets channelled into cancer cells, and is then unpacked with a laser impulse.", "Entity-Destination(e1,e2)"),
            ("The level surface closest to the MSS, known as the XXX, departs from an YYY by about 100 m in each direction.", "Other"),
            ("Gaza XXX recover from three YYY of war.", "Other"),
            ("This latest XXX from the animation YYY at Pixar is beautiful, masterly, inspired - and delivers a powerful ecological message.", "Other")]

Initialize the dataset and also provide a label encoding. Then parse the sentences into graphs. Currently we provide three types of graphs: ud, fourlang, amr. Also provide the language you want to parse, currently we support English (en) and German (de).

dataset = Dataset(sentences, label_vocab={"Other":0, "Entity-Destination(e1,e2)": 1}, lang="en")
dataset.set_graphs(dataset.parse_graphs(graph_format="ud"))

Check the dataset:

df = dataset.to_dataframe()

We can also check any of the graphs:

Check any of the graphs parsed

from xpotato.models.utils import to_dot
from graphviz import Source

Source(to_dot(df.iloc[0].graph))

graph

Rules

If the dataset is prepared and the graphs are parsed, we can write rules to match labels. We can write rules either manually or extract them automatically (POTATO also provides a frontend that tries to do both).

The simplest rule would be just a node in the graph:

# The syntax of the rules is List[List[rules that we want to match], List[rules that shouldn't be in the matched graphs], Label of the rule]
rule_to_match = [[["(u_1 / into)"], [], "Entity-Destination(e1,e2)"]]

Init the rule matcher:

from xpotato.graph_extractor.extract import FeatureEvaluator
evaluator = FeatureEvaluator()

Match the rules in the dataset:

#match single feature
df = dataset.to_dataframe()
evaluator.match_features(df, rule_to_match)
Sentence Predicted label Matched rule
0 Governments and industries in nations around the world are pouring XXX into YYY. Entity-Destination(e1,e2) [['(u_1 / into)'], [], 'Entity-Destination(e1,e2)']
1 The scientists poured XXX into pint YYY. Entity-Destination(e1,e2) [['(u_1 / into)'], [], 'Entity-Destination(e1,e2)']
2 The suspect pushed the XXX into a deep YYY. Entity-Destination(e1,e2) [['(u_1 / into)'], [], 'Entity-Destination(e1,e2)']
3 The Nepalese government sets up a XXX to inquire into the alleged YYY of diplomatic passports. Entity-Destination(e1,e2) [['(u_1 / into)'], [], 'Entity-Destination(e1,e2)']
4 The entity1 to buy papers is pushed into the next entity2. Entity-Destination(e1,e2) [['(u_1 / into)'], [], 'Entity-Destination(e1,e2)']
5 An unnamed XXX was pushed into the YYY. Entity-Destination(e1,e2) [['(u_1 / into)'], [], 'Entity-Destination(e1,e2)']
6 Since then, numerous independent feature XXX have journeyed into YYY. Entity-Destination(e1,e2) [['(u_1 / into)'], [], 'Entity-Destination(e1,e2)']
7 For some reason, the XXX was blinded from his own YYY about the incommensurability of time.
8 Sparky Anderson is making progress in his XXX from YYY and could return to managing the Detroit Tigers within a week.
9 Olympics have already poured one XXX into the YYY. Entity-Destination(e1,e2) [['(u_1 / into)'], [], 'Entity-Destination(e1,e2)']
10 After wrapping him in a light blanket, they placed the XXX in the YYY his father had carved for him.
11 I placed the XXX in a natural YYY, at the base of a part of the fallen arch.
12 The XXX was delivered from the YYY of Lincoln Memorial on August 28, 1963 as part of his famous March on Washington.
13 The XXX leaked from every conceivable YYY.
14 The scientists placed the XXX in a tiny YYY which gets channelled into cancer cells, and is then unpacked with a laser impulse. Entity-Destination(e1,e2) [['(u_1 / into)'], [], 'Entity-Destination(e1,e2)']
15 The level surface closest to the MSS, known as the XXX, departs from an YYY by about 100 m in each direction.
16 Gaza XXX recover from three YYY of war.
17 This latest XXX from the animation YYY at Pixar is beautiful, masterly, inspired - and delivers a powerful ecological message.

You can see in the dataset that the rules only matched the instances where the "into" node was present.

One of the core features of our tool is that we are also able to match subgraphs. To describe a graph, we use the PENMAN notation.

E.g. the string (u_1 / into :1 (u_3 / pour)) would describe a graph with two nodes ("into" and "pour") and a single directed edge with the label "1" between them.

#match a simple graph feature
evaluator.match_features(df, [[["(u_1 / into :1 (u_2 / pour) :2 (u_3 / YYY))"], [], "Entity-Destination(e1,e2)"]])

Describing a subgraph with the string "(u_1 / into :1 (u_2 / pour) :2 (u_3 / YYY))" will return only three examples instead of 9 (when we only had a single node as a feature)

Sentence Predicted label Matched rule
0 Governments and industries in nations around the world are pouring XXX into YYY. Entity-Destination(e1,e2) [['(u_1 / into :1 (u_2 / pour) :2 (u_3 / YYY))'], [], 'Entity-Destination(e1,e2)']
1 The scientists poured XXX into pint YYY. Entity-Destination(e1,e2) [['(u_1 / into :1 (u_2 / pour) :2 (u_3 / YYY))'], [], 'Entity-Destination(e1,e2)']
2 The suspect pushed the XXX into a deep YYY.
3 The Nepalese government sets up a XXX to inquire into the alleged YYY of diplomatic passports.
4 The entity1 to buy papers is pushed into the next entity2.
5 An unnamed XXX was pushed into the YYY.
6 Since then, numerous independent feature XXX have journeyed into YYY.
7 For some reason, the XXX was blinded from his own YYY about the incommensurability of time.
8 Sparky Anderson is making progress in his XXX from YYY and could return to managing the Detroit Tigers within a week.
9 Olympics have already poured one XXX into the YYY. Entity-Destination(e1,e2) [['(u_1 / into :1 (u_2 / pour) :2 (u_3 / YYY))'], [], 'Entity-Destination(e1,e2)']
10 After wrapping him in a light blanket, they placed the XXX in the YYY his father had carved for him.
11 I placed the XXX in a natural YYY, at the base of a part of the fallen arch.
12 The XXX was delivered from the YYY of Lincoln Memorial on August 28, 1963 as part of his famous March on Washington.
13 The XXX leaked from every conceivable YYY.
14 The scientists placed the XXX in a tiny YYY which gets channelled into cancer cells, and is then unpacked with a laser impulse.
15 The level surface closest to the MSS, known as the XXX, departs from an YYY by about 100 m in each direction.
16 Gaza XXX recover from three YYY of war.
17 This latest XXX from the animation YYY at Pixar is beautiful, masterly, inspired - and delivers a powerful ecological message.

We can also add negated features that we don't want to match (e.g. this won't match the first row where 'pour' is present):

#match a simple graph feature
evaluator.match_features(df, [[["(u_1 / into :2 (u_3 / YYY))"], ["(u_2 / pour)"], "Entity-Destination(e1,e2)"]])
Sentence Predicted label Matched rule
0 Governments and industries in nations around the world are pouring XXX into YYY.
1 The scientists poured XXX into pint YYY.
2 The suspect pushed the XXX into a deep YYY. Entity-Destination(e1,e2) [['(u_1 / into :2 (u_3 / YYY))'], ['(u_2 / pour)'], 'Entity-Destination(e1,e2)']
3 The Nepalese government sets up a XXX to inquire into the alleged YYY of diplomatic passports. Entity-Destination(e1,e2) [['(u_1 / into :2 (u_3 / YYY))'], ['(u_2 / pour)'], 'Entity-Destination(e1,e2)']
4 The entity1 to buy papers is pushed into the next entity2.
5 An unnamed XXX was pushed into the YYY. Entity-Destination(e1,e2) [['(u_1 / into :2 (u_3 / YYY))'], ['(u_2 / pour)'], 'Entity-Destination(e1,e2)']
6 Since then, numerous independent feature XXX have journeyed into YYY. Entity-Destination(e1,e2) [['(u_1 / into :2 (u_3 / YYY))'], ['(u_2 / pour)'], 'Entity-Destination(e1,e2)']
7 For some reason, the XXX was blinded from his own YYY about the incommensurability of time.
8 Sparky Anderson is making progress in his XXX from YYY and could return to managing the Detroit Tigers within a week.
9 Olympics have already poured one XXX into the YYY.
10 After wrapping him in a light blanket, they placed the XXX in the YYY his father had carved for him.
11 I placed the XXX in a natural YYY, at the base of a part of the fallen arch.
12 The XXX was delivered from the YYY of Lincoln Memorial on August 28, 1963 as part of his famous March on Washington.
13 The XXX leaked from every conceivable YYY.
14 The scientists placed the XXX in a tiny YYY which gets channelled into cancer cells, and is then unpacked with a laser impulse.
15 The level surface closest to the MSS, known as the XXX, departs from an YYY by about 100 m in each direction.
16 Gaza XXX recover from three YYY of war.
17 This latest XXX from the animation YYY at Pixar is beautiful, masterly, inspired - and delivers a powerful ecological message.

If we don't want to specify nodes, regex can also be used in place of the node and edge-names:

#regex can be used to match any node (this will match instances where 'into' is connected to any node with '1' edge)
evaluator.match_features(df, [[["(u_1 / into :1 (u_2 / .*) :2 (u_3 / YYY))"], [], "Entity-Destination(e1,e2)"]])
Sentence Predicted label Matched rule
0 Governments and industries in nations around the world are pouring XXX into YYY. Entity-Destination(e1,e2) [['(u_1 / into :1 (u_2 / .*) :2 (u_3 / YYY))'], [], 'Entity-Destination(e1,e2)']
1 The scientists poured XXX into pint YYY. Entity-Destination(e1,e2) [['(u_1 / into :1 (u_2 / .*) :2 (u_3 / YYY))'], [], 'Entity-Destination(e1,e2)']
2 The suspect pushed the XXX into a deep YYY. Entity-Destination(e1,e2) [['(u_1 / into :1 (u_2 / .*) :2 (u_3 / YYY))'], [], 'Entity-Destination(e1,e2)']
3 The Nepalese government sets up a XXX to inquire into the alleged YYY of diplomatic passports. Entity-Destination(e1,e2) [['(u_1 / into :1 (u_2 / .*) :2 (u_3 / YYY))'], [], 'Entity-Destination(e1,e2)']
4 The entity1 to buy papers is pushed into the next entity2.
5 An unnamed XXX was pushed into the YYY. Entity-Destination(e1,e2) [['(u_1 / into :1 (u_2 / .*) :2 (u_3 / YYY))'], [], 'Entity-Destination(e1,e2)']
6 Since then, numerous independent feature XXX have journeyed into YYY. Entity-Destination(e1,e2) [['(u_1 / into :1 (u_2 / .*) :2 (u_3 / YYY))'], [], 'Entity-Destination(e1,e2)']
7 For some reason, the XXX was blinded from his own YYY about the incommensurability of time.
8 Sparky Anderson is making progress in his XXX from YYY and could return to managing the Detroit Tigers within a week.
9 Olympics have already poured one XXX into the YYY. Entity-Destination(e1,e2) [['(u_1 / into :1 (u_2 / .*) :2 (u_3 / YYY))'], [], 'Entity-Destination(e1,e2)']
10 After wrapping him in a light blanket, they placed the XXX in the YYY his father had carved for him.
11 I placed the XXX in a natural YYY, at the base of a part of the fallen arch.
12 The XXX was delivered from the YYY of Lincoln Memorial on August 28, 1963 as part of his famous March on Washington.
13 The XXX leaked from every conceivable YYY.
14 The scientists placed the XXX in a tiny YYY which gets channelled into cancer cells, and is then unpacked with a laser impulse.
15 The level surface closest to the MSS, known as the XXX, departs from an YYY by about 100 m in each direction.
16 Gaza XXX recover from three YYY of war.
17 This latest XXX from the animation YYY at Pixar is beautiful, masterly, inspired - and delivers a powerful ecological message.

We can also train regex rules from a training data, this will automatically replace regex '.*' with nodes that are 'good enough' statistically based on the provided dataframe.

evaluator.train_feature("Entity-Destination(e1,e2)", "(u_1 / into :1 (u_2 / .*) :2 (u_3 / YYY))", df)

This returns '(u_1 / into :1 (u_2 / push|pour) :2 (u_3 / YYY))' (replaced '.*' with push and pour)

Learning rules

To extract rules automatically, train the dataset with graph features and rank them based on relevancy:

df = dataset.to_dataframe()
trainer = GraphTrainer(df)
#extract features
features = trainer.prepare_and_train()

from xpotato.dataset.utils import save_dataframe
from sklearn.model_selection import train_test_split

train, val = train_test_split(df, test_size=0.2, random_state=1234)

#save train and validation, this is important for the frontend to work
save_dataframe(train, 'train.tsv')
save_dataframe(val, 'val.tsv')

import json

#also save the ranked features
with open("features.json", "w+") as f:
    json.dump(features, f)

You can also save the parsed graphs for evaluation or for caching:

import pickle
with open("graphs.pickle", "wb") as f:
    pickle.dump(val.graph, f)

Frontend

If the DataFrame is ready with the parsed graphs, the UI can be started to inspect the extracted rules and modify them. The frontend is a streamlit app, the simplest way of starting it is (the training and the validation dataset must be provided):

streamlit run frontend/app.py -- -t notebooks/train.tsv -v notebooks/val.tsv -g ud

it can be also started with the extracted features:

streamlit run frontend/app.py -- -t notebooks/train.tsv -v notebooks/val.tsv -g ud -sr notebooks/features.json

if you already used the UI and extracted the features manually and you want to load it, you can run:

streamlit run frontend/app.py -- -t notebooks/train.tsv -v notebooks/val.tsv -g ud -sr notebooks/features.json -hr notebooks/manual_features.json

Advanced mode

If labels are not or just partially provided, the frontend can be started also in advanced mode, where the user can annotate a few examples at the start, then the system gradually offers rules based on the provided examples.

Dataset without labels can be initialized with:

sentences = [("Governments and industries in nations around the world are pouring XXX into YYY.", ""),
            ("The scientists poured XXX into pint YYY.", ""),
            ("The suspect pushed the XXX into a deep YYY.", ""),
            ("The Nepalese government sets up a XXX to inquire into the alleged YYY of diplomatic passports.", ""),
            ("The entity1 to buy papers is pushed into the next entity2.", ""),
            ("An unnamed XXX was pushed into the YYY.", ""),
            ("Since then, numerous independent feature XXX have journeyed into YYY.", ""),
            ("For some reason, the XXX was blinded from his own YYY about the incommensurability of time.", ""),
            ("Sparky Anderson is making progress in his XXX from YYY and could return to managing the Detroit Tigers within a week.", ""),
            ("Olympics have already poured one XXX into the YYY.", ""),
            ("After wrapping him in a light blanket, they placed the XXX in the YYY his father had carved for him.", ""),
            ("I placed the XXX in a natural YYY, at the base of a part of the fallen arch.", ""),
            ("The XXX was delivered from the YYY of Lincoln Memorial on August 28, 1963 as part of his famous March on Washington.", ""),
            ("The XXX leaked from every conceivable YYY.", ""),
            ("The scientists placed the XXX in a tiny YYY which gets channelled into cancer cells, and is then unpacked with a laser impulse.", ""),
            ("The level surface closest to the MSS, known as the XXX, departs from an YYY by about 100 m in each direction.", ""),
            ("Gaza XXX recover from three YYY of war.", ""),
            ("This latest XXX from the animation YYY at Pixar is beautiful, masterly, inspired - and delivers a powerful ecological message.", "")]

Then, the frontend can be started:

streamlit run frontend/app.py -- -t notebooks/unsupervised_dataset.tsv -g ud -m advanced

Once the frontend starts up and you define the labels, you are faced with the annotation interface. You can search elements by clicking on the appropriate column name and applying the desired filter. You can annotate instances by checking the checkbox at the beginning of the line. You can check multiple checkboxs at a time. Once you've selected the utterances you want to annotate, click on the Annotate button. The annotated samples will appear in the lower table. You can clear the annotation of certain elements by selecting them in the second table and clicking Clear annotation.

Once you have some annotated data, you can train rules by clicking the Train! button. It is recommended to set the Rank features based on accuracy to True, if you have just a few samples. You will get a similar interface as in supervised mode, you can generate rule suggestions, and write your own rules as usual. Once you are satisfied with the rules, select each of them and click annotate based on selected. This process might take a while if you are working with large data. You should get all the rule matches marked in the first and the second tables. You can order the tables by each column, so it's easier to check. You will have to manually accept the annotations generated this way for them to appear in the second table.

  • You can read about the use of the advanced mode in the docs

Evaluate

If you have the features ready and you want to evaluate them on a test set, you can run:

python scripts/evaluate.py -t ud -f notebooks/features.json -d notebooks/val.tsv

The result will be a csv file with the labels and the matched rules.

Service

If you are ready with the extracted features and want to use our package in production for inference (generating predictions for sentences), we also provide a REST API built on POTATO (based on fastapi).

First install FastAPI and Uvicorn

pip install fastapi
pip install "uvicorn[standard]"

To start the service, you should set language, graph_type and the features for the service. This can be done through enviroment variables.

Example:

export FEATURE_PATH=/home/adaamko/projects/POTATO/features/semeval/test_features.json
export GRAPH_FORMAT=ud
export LANG=en

Then, start the REST API:

python services/main.py

It will start a service running on localhost on port 8000 (it will also initialize the correct models).

Then you can use any client to make post requests:

curl -X POST localhost:8000 -H 'Content-Type: application/json' -d '{"text":"The suspect pushed the XXX into a deep YYY.\nSparky Anderson is making progress in his XXX from YYY and could return to managing the Detroit Tigers within a week."}'

The answer will be a list with the predicted labels (if none of the rules match, it will return "NONE"):

["Entity-Destination(e1,e2)","NONE"]

The streamlit frontend also has an inference mode, where the implemented rule-system can be used for inference. It can be started with:

streamlit run frontend/app.py -- -hr features/semeval/test_features.json -m inference

Contributing

We welcome all contributions! Please fork this repository and create a branch for your modifications. We suggest getting in touch with us first, by opening an issue or by writing an email to Adam Kovacs or Gabor Recski at [email protected]

Citing

If you use the library, please cite our paper published in CIKM 2022:

@inproceedings{Kovacs:2022,
author = {Kov\'{a}cs, \'{A}d\'{a}m and G\'{e}mes, Kinga and Ikl\'{o}di, Eszter and Recski, G\'{a}bor},
title = {POTATO: ExPlainable InfOrmation ExTrAcTion FramewOrk},
year = {2022},
isbn = {9781450392365},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3511808.3557196},
doi = {10.1145/3511808.3557196},
booktitle = {Proceedings of the 31st ACM International Conference on Information & Knowledge Management},
pages = {4897–4901},
numpages = {5},
keywords = {explainability, explainable, hitl},
location = {Atlanta, GA, USA},
series = {CIKM '22}
}

License

MIT license

potato's People

Contributors

adaamko avatar eszti avatar gkinga avatar kuma-rtin avatar recski avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

potato's Issues

Unsupervised mode

3rd panel listing all sentences. User can "label" them by selecting positive examples, rules can be trained based on those examples, user can see which other sentences these rules would trigger, then they can also label those sentences to mark that they are positive or negative examples (thereby increasing the amount of labeled data, one sentence at a time), and can choose to accept rules, reject rules, or "train rules" (i.e. refine them), thereby refining the rule system as they go along

New scikit-criteria 0.5

Hi I am the author of scikit-criteria, and I just publish a new and improved version of the project (but is not backward compatible)

So I paste an equivalent snippet to replace the current functionality in POTATO

The import lines

from skcriteria import Data
from skcriteria.madm import simple

Must be replaced with

import skcriteria as skc
from skcriteria.madm import simple
from skcriteria.pipeline import mkpipe
from skcriteria.preprocessing import invert_objectives, scalers

Then the code

criteria_data = Data(
stat_opt,
[min, max],
anames=val,
cnames=stat_opt.columns,
weights=[30, 70],
)
dm = simple.WeightedSum(mnorm="sum")
dec = dm.decide(criteria_data)

must be replaced with

criteria_data = skc.mkdm(
    matrix=stat_opt,
    objectives=[min, max],
    alternatives=val,
    criteria=stat_opt.columns,
    weights=[30, 70]
)

dm = mkpipe(
    invert_objectives.MinimizeToMaximize(), 
    scalers.SumScaler(target="both"), 
    simple.WeightedSumModel()
)

dec = dm.evaluate(criteria_data)

Also: The previous versions of skcriteria have a conceptual bug. In many cases is incorrect to normalize before invert the objectives, so here I split the logic and must be manually assembled in a pipeline

how hard to use this with a very large corpus?

Hi,

I'm really excited by this. Could I ask if it might be easy to use this with a very large corpus --- e.g., something on the scale of Common Crawl? Is it going to be easy to parallelize things on a SLURM cluster once we've extracted some rules we're happy with from a smaller training set? (I'm guessing I can just run the prediction mode with different machines over different files, but I wanted to ask just in case.)

Building rules for homophobia detection - issues

As we agreed on in our previous meeting last Wednesday, I have compiled a few examples for rules that are (a) either hard to formulate for me or (b) don't work as expected (be it a problem with the parsed graphs, a problem with the formulation of the rule or something else).

A. Doesn't work as expected: "I am gay" and related structures

I have noticed that the parsed UD graphs seem to be a bit off sometimes. I want to exclude tweets that include some meaning of "I am gay" (similarly for the words dyke, lesbian, queer etc.), and therefore I formulated the following exception to a rule: (u_1 / gay :nsubj (u_4 / I)). Unfortunately some tweets that include this were still matched:

  1. "im gay i love fathers dm for dilfs"

image

Here the parser did not match "I" as a subject in the first part of the sentence.

  1. "when i say i only like seven men i mean i only love seven men bc im fucking gay"

image

Same as in 1.

  1. "I ' m super gay fuck"

image

This graph is not correct. The peculiar spelling of I'm and the f*** at the ending of the sentence might have thrown the parser off.

  1. "yea im def a dyke cause pretty boys r so ugly to me n also all the rest of them"

image

All of these seem to be problems with "internet" spelling and/or internet slang.

B. Doesn't work as expected: Plural forms

UD graphs don't capture the number of nouns. Therefore it is not possible to only match the plural forms of nouns. In my case this led to the following problem: the word queer in its plural form queers is often used in hatespeech against LGBTQ+ people in the dataset. If I would be able to match the plural form of the word queer this would be a very easy and comprehensible rule. See the following examples:

  1. "and all this butthurt faggotry just makes me want to join twp to spite you queers"

image

  1. "queers in olympics are the same in the military a huge fucking mistake"

image

  1. "there shouldnt be an arguement against this but once the queers got a voice it was prolly too late"

image

I solved this by formulating the rule as follows: (u_1 / .* :obj|nsubj (u_2251 / queer)), because this plural form of queer mainly talks about the group of queer people and therefore is most often present in the form of an object or a subject. Unfortunately then the following sentence is matched (which wouldn't happen if I were able to make the rule only apply to plural forms):
"so then queer is not queer how about homosexuality"...
image

C. Formulation problem: words in the same sentence/clause

I would like to match tweets that include the word gay and some negative words like f***, sh**, kill, die etc. I want to do this in order to be able to capture all the different meanings these negative words can have (e.g. f*** gays, but also f***ing gays). The problem is that this rule only seems to be valid when gay and the negative word are in the same sentence or clause (which suggests that the negative word is somehow related to the word gay). Here are some examples that I would NOT want to match:

  1. "i feel like such a yt gay when i listen to kim petras but fuck i cant resist im so sorry"

image

  1. "it okay to be white black straight or gay but it is not okay for you to stop at a yellow light when we both could have fucking made it"

image

  1. "i miss the old queer eye where they d turn the straight guy into a slutty club gay with snake skin boots and frosted tips and instruct him go say shit like yo it fresh"

image

  1. "im the gay one nigger you the one fucking my ass"

image

This idea for a rule is really just an idea and I am not so sure anymore that it is a good rule. If the formulation of the rule is not too complicated, I would nevertheless want to try it out and see the results.

very good keyword not found by "suggest rules"

For "WidmungUndZwechbestimmung", the suggest rules function returns patterns with low precision that have nothing to do with the label, but it doesn't find something like this:
(u_514 / gewidmet): Precision: 1.000, Recall: 0.127, Fscore: 0.226, True positives: 20, False positives: 0

pattern matching bug (caused by self-loops?)

This pattern on the BRISE dataset, for the label GebauedeHoeheMax:

(u_482 / betragen :2 (u_96 / m :0 (u_734 / .* :0 (u_127 / hoechstens))) :1 (u_123 / Gebaeudehoehe))

Doesn't match many sentences, such as no. 283, 288 when using the "all" and "gold" datasets in the directory /home/grecski/potato_debug_data/

Can this help with mining names (of people or concepts or groups or ...)?

Sorry about all the questions. Suppose I want to find sentences in a corpus that involve the relation X aka Y, and variants thereof, and where I'm pre-specifying what Xes I'm looking for. It's clear that this software can be used to find those sentences. But can it also help with getting what that Y in the sentences is? That is, suppose I have sentences like the following:

  1. Donald Trump, aka @TheRealDonald, is a former president of the United States.
  2. Donald Trump (or, as we call him, the orange cheeto) sucks.

Given how this system is already parsing the sentences, can the information from the parse be used to help mine alternative names like @TheRealDonald and the orange cheeto? Or, is there any way I can use this system to help with that? Or would I just have to take the predictions from this system and do further processing (e.g. noun chunking) by myself?

Tests for Matchers

I think it would be great if there were test cases for matchers now as we begin to support lot of things now. In the test folder there are some tests now for rules, we could add more there, especially when we introduce new mechanisms.

Thoughts?

Frontend UI improvement

  • Set aside featurizing and training to optimize training time
  • option to select subset of rules to evaluate (default is that all rules are checked
  • suggest new rules should add another 5 or 10, and should be on the left side
  • option to "keep" selected rules and rerank features based on remaining false negatives

error in training when using all classes of brise data

ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0.0
Traceback:
File "/home/recski/miniconda3/envs/brise/lib/python3.7/site-packages/streamlit/script_runner.py", line 354, in _run_script
    exec(code, module.__dict__)
File "/home/recski/projects/POTATO/frontend/app.py", line 1248, in <module>
    main(args)
File "/home/recski/projects/POTATO/frontend/app.py", line 1230, in main
    evaluator, data, val_data, graph_format, feature_path, hand_made_rules
File "/home/recski/projects/POTATO/frontend/app.py", line 270, in simple_mode
    data, st.session_state.min_edge
File "/home/recski/projects/POTATO/frontend/utils.py", line 254, in train_df
    features = trainer.prepare_and_train(min_edge=min_edge, rank=rank)
File "/home/recski/projects/POTATO/xpotato/models/trainer.py", line 60, in prepare_and_train
    return self.train(min_edge=min_edge, rank=rank)
File "/home/recski/projects/POTATO/xpotato/models/trainer.py", line 137, in train
    self.model.fit(train_X, train_Y)
File "/home/recski/miniconda3/envs/brise/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py", line 1558, in fit
    " class: %r" % classes_[0])

UI enhancement

TODOS until paper:

  • Implement “predict” mode, taking a rule-system and sentences, and generate labels based on the rule-system on the fly
  • When evaluated, graphs should always follow the same-order
  • Have a field, where you can write a number indicating which graph you want, and sentences should have ids
  • We should also have a dataset browser in supervised mode
  • Better error messages
  • Write TP, FN, FP next to precision, recall, fscore

Unsupervised

Some feature requests/issues in unsupervised mode:

  • Train! button adds unnecessary rule.
  • Graph viewer next to examples would be nice.

KeyError: "Constituency parser not trained with tag 'GW'"

Hey there!

I was trying out POTATO and I have been coming across this error consistently. It is an error from stanza spacy. They have updated their version to 1.4.0 but xpotato doesnt support 1.4.0. Do you have any quick recommendations?

Thank you!

sentence split in multiple 4lang graphs not merged

An example is sen. 235 in the dataset indicated in #31

Für die mit BB4 bezeichnete, der Errichtung von Anlagen zum Einstellen von Kraftfahrzeugen vorbehaltene Grundfläche wird bestimmt: Die Gebäudehöhe darf maximal 6,0 m betragen.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.