Schools is one of the domains, which is a part of the IndicWiki Project.
The aim of this domain is to create a large number of articles (about 1 lakh) about schools across the states of Andhra Pradesh and Telangana. This domain has potential because of the importance of relevant information pertaining to schools, which would assist parents in making informed decisions about the education of their children. Hence, we are generating these data-rich articles in telugu for about 1 lakh schools, and uploading them to wikipedia, so that people who can read only in their native language (here, telugu) can benefit from this information.
Create virtual environment in the project folder using the following commands.
$ pip install virtualenv
$ pip install python3.7
$ virtualenv -p python3.7 schEnv
After the successful creation of virtual environment (venv), clone the repository or download the zip folder of the project and extract it. Run the following commands from the schools (root) directory to activate the virtual environment and install all necessary dependencies.
$ source schEnv/bin/activate
$ pip install -r requirements.txt
requirements.txt comes along with the Project Directory.
- Clone the repository into the local system.
- Enter the 'curDir' folder inside the cloned repo.
- Execute the file zeroMain.py with the help of command:
python3 zeroMain.py <code_start_index> <code_end_index>
, wherecode_start_index
andcode_end_index
correspond to the row numbers of the first and last desired articles (only serial order is possible). For examplepython3 zeroMain.py 30 50
would generate xml dump for schools in rows 30-50 (inclusive). By default, dump is generated for all schools (case where no command line arguments are provided, such aspython3 zeroMain.py
). - On executing the above command, different parts of article are written to an
articleParts.csv
file (as a dataframe), xml dump is obtained inonePage.xml
, titles of given set of schools is obtained as a dataframe intitle.csv
, any errors encountered are logged inerror.txt
. If command line arguments were passed, names of these files would have the range passed in their filenames (such asonePage30-50.xml
etc).
The folder curDir contains the entire implementation corresponding to article generation. This directory contains the codebase for the new model, converting knowledge base to intermediate structured article (dataframe of labeled article sentences) which can be modified and updated by users and then converting this data to XML page which can be imported in mediawiki.
It can be found here. All the below explained files and folders are inside this directory.
Github folder Link: https://github.com/indicwiki-iiit/schools/tree/main/curDir/data
- This folder contains various datasets (final and intermediate), and the implementations as to how they were obtained. The python scripts in this folder are described as follows:
- mergeDuplicates.py -> The number of rows were found to be more than the number of unique schools, because of repetitions caused due to a school having multiple instruction mediums. This script removes all duplicate rows with same school codes and replaces the 'Medium' column's value with list of all mediums associated with the respective school code.
- unify_pkls.py -> This script unifies all the pickles given in a folder into one pickle file. (Assumes the given folder contains a pickle named DISTRICTS, which is a pickle of a list of all file names in the given folder.)
- process_xlsx.py -> This script reads all the excel sheets in scraped_xlsx, filters & normalises them, concatenates them to allScraped.pkl and finally computes intersection of new data with halfDB.pkl and adds this intersection to oneKB.pkl
- readFreqTokens.py -> This script cleans up frequently occurring english tokens and their corresponding telugu translation, and stores them.
- readyCodes.py -> This script selects school udise codes which do not have duplicate school names, i.e, it selects codes for which articles can be generated.
- The pickle files and datasets in this folder are described as follows:
- halfDF.pkl -> This is the pickle file of a dataframe consisting of all records from excel sheet provided by Praveen Sir (the unified version of pickle files in 'excelDB_pkls folder).
- allScraped.pkl -> This is the pickle file of a dataframe consisting of all records from excel sheet which was scraped from udise website (these records are in Unified 'scraped_xlsx' folder).
- errors.pkl -> This pickle file corresponds to the dataframe of all schools which have to be manually checked, because of some discrepancies in their entries.
- oneKB.pkl -> This pickle file corresponds to the dataframe consisting of all error free records from both halfDB.pkl and allScraped.pkl. Further, it is updated everytime new records are added into allScraped.pkl.
- readyCodes.pkl -> This pickle file consists of school udise codes which do not have duplicate school names (articles can be generated for these).
- freqTokens.csv -> This file consists of the dataframe which comprises of translation and frequency for every english token in the dataset.
- freqTokens.pkl -> This pickle file consists of the same dataframe mentioned above, in .pkl format for easy loading.
- tokens.csv -> This file consists of various english tokens (in englisg) and their corresponding frequencies in csv format (for machine generated translation).
- The sub-folders present inside this folder are explained below:
- epoch -> This folder contains article parts, xml dump and titles for all 21k school articles generated based on oneKB.pkl.
- trials -> This folder was used just for testing purposes, generating articles for few schools to check functioning of template and xml generation implementation.
- excelDB_pkls -> This folder consists of all pickles of dataframes, one per district, extracted from the excel sheet given by Praveen Sir. One pickle file named 'DISTRICTS', which contains list of all districts pertaining to different school records, is also present in the excel sheet.
- processed_xlsx -> This folder consists of excel sheets which are already cleaned and concatenated to allScraped.pkl.
- scraped_xlsx -> This folder consists of raw scraped data in form of excel sheets (district-wise).
- titles_translation -> This folder consists of implementation for creating a proper one-one tokenwise mapping from english to telugu, given a reviewed dataframe of english school titles and their corresponding telugu translation. The code can be found in
create_title_dict.py
, and the english token - telugu token translation can be found intitleTokens
files. Some edge cases for which one-one mapping cannot be achieved are logged inedgeCases-titleTokens
files.
Github folder Link: https://github.com/indicwiki-iiit/schools/tree/main/curDir/scrape_new_data
- This folder contains implementation and datasets corresponding to newly scraped data from schools.org.in. The python scripts in this folder are described as follows:
- data_cleaning.py -> This file contains implementation which handles basic data cleaning, such as making missing values uniform, identifying edge cases and removing unwanted attributes.
- extract_info.py -> This file contains the actual scraping implementation for scraping the complete data for all schools in Andhra Pradesh and Telangana. Similar to
zeroMain.py
, this file can be provided with command line arguments for scraping data for a particular subsegment of schools. For examplepython3 extract_info.py 30 50
would scrape data for schools in rows 30-50 in intermediate dataframeschools_id_name_url_dataframe.pkl
(inclusive). By default, data is scraped for all schools (case where no command line arguments are provided, such aspython3 extract_info.py
). - extract_urls.py -> This file contains implementation for web crawling and obtaining, storing urls for each school from the two states, so as to minimize overhead while actually scraping content from them.
- The pickle files and datasets of this folder are described as follows:
- school_urls_dict.pkl -> This file consists of a dictionary, where school UDISE codes are the keys and that school's url in the above website is the corresponding value.
- schools_id_name_url_dataframe.pkl -> This pickle file consists of an intermediate dataframe, where each row contains a school's UDISE code, its corresponding name, and its url in schools.org.in.
- schools_id_name_url_dataframe.xlsx -> This excel file consists of the same intermediate dataframe above, in excel format.
- schools_org_data.xlsx -> This excel file contains the complete dataset for all the data scraped from all the schools.
- schools_org_data_part_i.pkl -> These 3-part pickle files contain the same information as in excel file above, but have been split to 3 parts due to size constraints.
Github folder Link: https://github.com/indicwiki-iiit/schools/tree/main/curDir/template
- This folder .
- new_teluguDFtext.j2 -> This file consists of the improved template to generate wikitext provided a school's data, corresponding to latest dataset (from schools.org.in)
- teluguDFtext.j2 -> This file consists of the template to generate the wikitext of a given school's data (based on old dataset - oneKB)
- teluguTitle.j2 -> This jinja2 template file generates the title for a given school.
Github file Link: https://github.com/indicwiki-iiit/schools/blob/main/curDir/createMgnt.py
- This file contains implementation which generates a dataframe where each row contains a school's managament details in english and its corresponding translation in telugu.
Github file Link: https://github.com/indicwiki-iiit/schools/blob/main/curDir/getJsData.py
- This file contains implementation which stores the data for each school (its title, code) in .js format in provided path.
Github file Link: https://github.com/indicwiki-iiit/schools/blob/main/curDir/help.py
- This file contains implementation of pre-processing and more convenient translation for frequently occuring words, phrases and numbers etc.
Github file Link: https://github.com/indicwiki-iiit/schools/blob/main/curDir/onePage.xml
- This file contains how a sample xml dump might look for a particular school.
Github file Link: https://github.com/indicwiki-iiit/schools/blob/main/curDir/schoolXML5k_1.xml
- This file contains how a sample xml dump might look for a multiple school articles in the same xml file.
Github file Link: https://github.com/indicwiki-iiit/schools/blob/main/curDir/template_functions.py
- This file contains various python functions which are used in edge case handling (different sentences based on attributes available) while rendering the article from template based on data provided.
Github file Link: https://github.com/indicwiki-iiit/schools/blob/main/curDir/trans.py
- This file contains implementation which deals with translation, transliteration of attributes along with string cleaning (with the help of regular expressions and libraries like anuvaad, deeptranslit) and exception handling.
Github file Link: https://github.com/indicwiki-iiit/schools/blob/main/curDir/transNNP.py
- This file acts a helper file for
trans.py
and assists in translation of english tokens.
Github file Link: https://github.com/indicwiki-iiit/schools/blob/main/curDir/write.py
- This file contains implementation which generates an article and its title by rendering the jinja templates, given a particular school's data.
Github file Link: https://github.com/indicwiki-iiit/schools/blob/main/curDir/zeroMain.py
- This file constructs article(s) for a given list of school code(s) and then segments and structures each article. It constructs an xml file for the given school code(s). (More details are explained in the section 'Guide to generate XML dump')
You can find the sample article here