This is an assessment for a job position at Veeam as a Python Developer in QA
Please implement a program that synchronizes two folders: source and replica. The program should maintain a full, identical copy of source folder at replica folder. Solve the test task by writing a program in one of these programming languages:
- Python
- C/C++
- C#
-
Synchronization must be one-way: after the synchronization content of the replica folder should be modified to exactly match content of the source folder;
-
Synchronization should be performed periodically.
-
File creation/copying/removal operations should be logged to a file and to the console output;
-
Folder paths, synchronization interval and log file path should be provided using the command line arguments;
-
It is undesirable to use third-party libraries that implement folder synchronization;
-
It is allowed (and recommended) to use external libraries implementing other well-known algorithms. For example, there is no point in implementing yet another function that calculates MD5 if you need it for the task โ it is perfectly acceptable to use a third-party (or built-in) library.
In my final approach I decided to implement my own version of the main functionality I need from filecmp.
For that I created a Comparer class that compare two paths and produces lists for different comparison scenarios that I can in turn use to sync based on each scenario.
-
source_only
- List of the content names that are only in the source folder
-
replica_only
- List of the content names that are only in the replica folder
-
common_dirs
- List of directory names that are common between source and replica
-
diff_files
- List of file names that are common between source and replica and have changed.
- Uses file name, last modified time and file size for a first comparison to avoid the memory consuming hashing function.
- Uses the md5 hashing function from hashlib if any of the previous checks fail.
- This comparison approach makes a tradeoff between security for performance, a production grade tool might have to focus more on security or provide a flag for the user to choose between which approach best suits him.
With that I can implement the Synchronizer class in the same way when using the filecmp lib in the naive solution
class Synchronizer:
"""
Synchronizer class to sync the source and replica folders
@param source: pathlib.Path
Path to the source folder
@param replica: pathlib.Path
Path to the replica folder
@param logger: logging.Logger
Logger object responsible for logging the actions to a file and to stdout
All methods are only performed in root level of the source and replica folders, thats why there is a recursive call to the synchronize method in the search_child_folders method.
@method add_missing_in_replica: Search for files and folders not present in replica but present in source and copy it to replica
@method remove_extra_in_replica: Search for files and folders not present in source but present in replica and remove it from replica
@method sync_changed_files: Search for files that have been changed and sync it to replica
@method search_child_folders: Recursively search common folders between source and replica
@method synchronize: Main method to synchronize the source and replica folders
"""
This class implements 4 methods that satisfies the requirements for the challenge.
-
add_missing_in_replica
- Checks the source_only list for files or folders that are only present in the source folder and copies them to the replica folder.
-
remove_extra_in_replica
- Checks the replica_only list for files or folders that are not present in the source and remove them from the replica folder.
-
sync_changed_files
- Checks the diff_files list for items that are presents in both folders but have different contents and them copies from source folder to replica folder.
-
search_child_folders
- Check for folders present in both source and replica folders and peform a recursion creating a new Synchronizer object and performing the same sync process in both child folders until there is no more child folder common between two parents.
Python v3.12
pip v23.3.2
- Clone the repository
git clone https://github.com/Desgue/Veeam_Python_Developer_Test.git
python main.py [--source <path_to_source_folder>] [--replica <path_to_replica_folder>] [--log <path_to_log_file>] [--interval <interval_number_in_seconds>]
- Description: Show the help menu that indicates what each command does and how to use it.
- Usage:
-h
or--help
- Description: Absolute path for the source folder.
- Usage:
--source <absolute_path_to_source_folder>
or-s <absolute_path_to_source_folder>
- Required: True
- Type: String
- Description: Absolute path to the replica folder.
- Usage:
--replica <absolute_path_to_replica_folder>
or-r <absolute_path_to_replica_folder>
- Required: True
- Type: String
- Description: Absolute path to the .log file, if the file do not exist it will be created.
- Usage:
--l <absolute_path_to_log_file>
or--log <absolute_path_to_log_file>
- Required: True
- Type: String
- Description: Specify the interval time to wich the program will perform the synchronization task. Expressed in seconds. Default is 60 seconds.
- Usage:
-i <interval_number_in_seconds>
or--interval <interval_number_in_seconds>
- Default: 60s
- Type: Integer
- Error handling could be improve, for that I need to read the docs of each lib I am using and understand what kind of exceptions can happen.
- Even tough I performed manual testing to ensure all behaviors function as expected, an automated test script can be created to check thoroughly.
- Better handling of the terminal interface to accept a more gracefull shutdown instead using ctrl+c to stop the script, thus making it possible to also log the end of script session for further analysis.
Total time spend In this project was about 8 hours spread between reading about folder synchronization, searching for and reading the docs of which libraries I decided to use and actually implementing and refactoring the code.