This repo is for task 2 of relation extraction problem
literatury review of the RE (done)
notes for concept and coding
main entry of code
python main.py
require stanfordcorenlp
- PDF files from Mendeley database
- Sentence splitting
- Tokenizing
- POS
- Name entity recognition
- Parsing
- Relation extraction So far, step 1 & 2 are finished by previous project. We apply the stanfordcorenlp api to do step 3-6, and focus on the RE. We are using rule-based approaches and manually define the rules. we save all sentences from one PDF file in the database for test, namely 'sen_pdf1.txt' in the repo. Parsing results are transfered as input of step 7 after step 1~6
a relation is said to be negated if no node in the candidate relation contains Number.
- effector of the relation: the name entity appearing first in the extracted relation, i.e. with the smaller sentence position
- The roles are switched if some form of passive construct is detected
Noun phrase chunks connected to each other by a and, or, nn, det, or dep dependency form an enumeration. If a noun phrase chunk contains more than one protein name, these are considered to describe alternative agents/targets.
The words contained in candidate relations are checked against a set of relation restriction terms.
- focus domain corpora will be generated by:
- scanning our database and checking noun frequency
- and check with public corpora in this field
- we can add a filter in NER
- Negation check
- find triple {NN VB CD} in a relation
- need more rules
- corpora
- try dependency path
- train the parser?