A combined framework for content extraction
This is a repo to archive the python code used in my thesis.
Most ipython notebook files are used for testing except "eval.ipynb", which is used to run the experiments in the thesis.
-
BTE.py (Body Text Extraction)
-
CCB.py (Content Code Blurring)
-
CETD.py (Content Extraction via Text Density)
-
CETR.py (Content Extraction via Text Ratio)
-
CTTD.py (Compound Text-Tag Difference)
-
ConEx_dom.py (Combine dom-based algorithms)
-
ConEx_line.py (Combine line-based algorithms)
-
ConEx_token.py (Combine token-based algorithms)
-
ConEx.py (Combined above three parts)
-
process.py (Code for preprocessing and evaluation)
-
convert_XXX.py (Process XXX dataset)
-
kmeans.py (Customed kmeans algorithm for CETR-2D)