This Python3 development project establishes an environment for data processing and file manipulation. It structures a PyCharm project, imports a massive dataset, and accomplishes three main tasks. It creates language-specific files and a comprehensive translation dataset ensuring clean file management with GitHub integration.
- Build a Python3 project with the structure of projects in PyCharm.
- Import the MASSIVE Dataset mentioned in the Data File above. In this dataset, the pivot language is English. Given that all the ids of the languages are matching, generate an
en-xx.xlxs
file for all the languages using the id, utt, and annot_utt. - For English (en), Swahili (sw), and German (de), generate separate
jsonl
files with test, train, and dev datasets respectively. - Generate one large
json
file showing all the translations from English (en) to xx with id and utt for all the train sets. Pretty print your json file structure.
functions.py
: Contains functions to answer the questions for generating files from Excel to Jsonl.main.py
: The main program file that loads, processes, and analyzes data.
LANGUAGE_SPECIFIC_FILES
: The output excel files that contain translations of all languages.JSONL_FILES
: Thejsonl
files that contain the pretty printedjsonl
formatted for each filtered file.COMBINED_TRAANSLATION.jsonl
: Largejsonl
file showing all the translations from en to xx with id and utt for all the train sets.
To set up and run the project, follow these steps:
-
Check Python Version: Ensure you have Python 3.x installed on your system. You can check your Python version by running the following command:
python --version
-
Create a Virtual Environment: It's a good practice to create a Python virtual environment to isolate project dependencies. You can create one using the following commands:
- On macOS and Linux:
python -m venv venv source venv/bin/activate
- On Windows:
python -m venv venv .\venv\Scripts\activate
-
Install Dependencies: Clone the repository and navigate to the project directory in your terminal. Then, install the required dependencies by running the following command:
pip install -r requirements.txt
-
Run the Generator Script (Windows using WSL and Linux Terminal): Execute the
generator.sh
shell script to generate project files. Depending on your platform, use one of the following methods:-
Windows Subsystem for Linux (WSL):
bash generator.sh
-
Linux Terminal:
./generator.sh
-
-
Check Output: After running the script, you can find the following logs and directories in the project directory:
generator.log
: This log file contains information about the generated files.files_count.log
: This log file will contain information about the count of generated files.language_specific_files/
: This directory will contain language-specific Excel files generated by the script.jsonl_files/
: This directory will contain JSONL files generated.
-
Deactivate the Virtual Environment: When you're done with the project, don't forget to deactivate the virtual environment using the command:
deactivate
- 137192 Eddy Bogonko
- 137938 Martin Mwangi
- 136603 Jane Daisy
- 146013 Amanda Karani
- 139991 Glen Musa
bogonkoEd/revisitPython is licensed under the MIT License.