This task is divided into 3 small tasks
-
Find a frequency table of top 10 most frequent words in text. These words will be known as keywords. Associate a score with each keyword directly proportional to frequency.
-
Use Concept net to assign same score to words similar to these keywords. These similar words will also be used as keywords.
-
Split the text into sentences.
-
Assign Score to each sentence S using the following formula = (2.0titleScore+0.25sentenceLengthScore+1.0sentencePositionScore+2.0DbsSbsScore)/(2.0+0.25+1.0+2.0)
-
Choose top 4 sentences with highest scores.
count = Number Of common non-stop words between query and sentence
NOTE: words in sentence that are similar in concept to any word in query are also counted (done with the help of concept net web API)
titleScore = count / number Of words in query
sentenceLengthScore = 1.0-abs(number of words in sentence-20) / 20
where 20 is ideal sentence length
n = normalized Position of sentence in between 0 and 1
The following research paper has the values listed in page 3
http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings3/NTCIR3-TSC-SekiY.pdf
score = sum of scores of all keywords contained in the sentence calculated in first step
summationBasedSelection = score / totalWordsInSentence
The formula is well written in page 3 of the following paper
http://www3.ntu.edu.sg/home/axsun/paper/sun_cikm07s.pdf
DbsSbsScore = summationBasedSelection + densityBasedSelection
-
Create POS tree from sentences using open nlp tool.
-
Store nouns, adjectives, pronouns, verb, adverb, adjective, conjuction, preposition in different arraylist.
-
Create chunks from the sentences using ChunkerModel and ChunkerMe of open nlp tool thereby removing unneccessary conjuction.
-
From the POS arraylists remove unneccessary POS if any present from each sentences.
-
Remove unnecessary articles and prepositions.
- Remove all the stop words from every chunk of words.
Stop word list used : http://www.ranks.nl/stopwords
- Remove all the stop words from every chunk of words.
-
Coreference resolution using stanford parser.
-
Merge new chunks and generate output.
- ejml-0.19-nogui.jar
- ejml-0.19-src.zip
- englishPCFG.ser.gz
- joda-time.jar
- joda-time-2.1-sources.jar
- jollyday.jar
- jollyday-0.4.7-sources.jar
- json-20160212.jar
- wnl-1.3.3.jar
- opennlp-maxent-3.0.3.jar
- opennlp-tools-1.5.3.jar
- opennlp-uima-1.5.3.jar
- RiTaWN.jar
- SentenceSplitter.java
- stanford-corenlp-3.2.0.jar
- stanford-corenlp-3.2.0-javadoc.jar
- stanford-corenlp-3.2.0-models.jar
- stanford-corenlp-3.2.0-sources.jar
- supportWN.jar
- xom.jar
- xom-src-1.2.8.zip
- Use the above mentioned libraries while executing the jar file.
- In addition, keep these files into "lib" directory
- ejml-0.19-src.zip
- en-chunker.bin
- englishPCFG.ser.gz
- en-parser-chunking.bin
- en-pos-maxent.bin
- xom-src-1.2.8.zip
- Provide at least 1 GB of RAM while executing the jar file.
- (Use "-Xmx1024m" argument in addition)
- Afif Ahmed - [email protected]
- Sushanto Halder - [email protected]
- Sourav Maji - [email protected]
- Anit Kumar - [email protected]
- Debraj Dutta - [email protected]