Code that was used to preprocess the entirety of english wikipedia to a section-level granular dataset for document reccomendation for the course project of UCLA CS 247 - Advanced Data Mining.
- Ensure the
enwiki-latest-pages-articles.xml
file is present in the same directory as therun.sh
file. - Follow the instructions on the run.sh file and make changes if necessary.
- On your terminal, run
bash run.sh
to complete dataset creation. - Output files and directories:
label2id.json
: name implied, a mapping of titles to idsid2label.json
: name implied, a mapping of ids to titlesenc_labs.npz
: npz file consisting of the labels ids.enc_txt.npz
: npz file consisting of the tokenized text.data/
: directory consisting of json files for each article.index.txt
: index of thedata/
directory that lets you traverse its contents easily.
Kindly follow instructions in the load_data.py file and use the starter code from there.
Here's a schema:
{
"id" : 123,
"sections" :[
{
"content" : "Section contents (Cleaned)",
"links" : [ "Page Title Linked", "Page Title Linked" ],
"title" : "Section Title: Subsection Title"
}
]
"title" : "Article Title"
}
Kindly follow the README file in the PlainTextWikipedia folder.
Kindly use the encoding.py file with updated filepaths to create tokenized embeddings.