Based on the task that Mr. Zsolt Csapi from Sophos in Hungary sent me, I should develop an application that finds the cheapest meal of each day on the menu containing "csirkemell" (chicken breast) from page "https://www.teletal.hu/etlap/24" which is week 24 menu.
So I developed an application that is a bit more dynamic, which can search different components, e.g., "baconos", "gomba" or "marha". Also can crawl through other weeks' menus, e.g., "https://www.teletal.hu/etlap/22" or "https://www.teletal.hu/etlap/23" or any future generated menu.
P.S.: Search term should be in the Hungarian language!
There are a couple of challenges related to different sections, such as scraping, data processing, and concept.
Lazyload is Ajax/JS component that makes the website faster by loading content when the user reaches a specific position in the website by scrolling down or clicking. So the mechanism of lazyloading is rendering a simpler HTML version when the user opens the website. If the user scrolls down or clicks on some object, a javascript code will retrieve data and overwrite the current HTML code. So, in this case, we can not simply grab the HTML of the website because it would be a pre-lazyloading content in which no useful information is there!
If you look at the other tables, the concept is horizontal, meaning there is a row for every column (each represents the days of the week). But there is an exception! A menu called "Full Day Menu" is vertical, meaning all the rows of a column are for one meal and can only be bought as a pack! The screenshot attached below will clear everything up (click on the image to get a clear view).
It seems some days the restaurant is not working or on some specific days don't serve specific food. This may cause an empty cell (hole) in the menu. The screenshots attached below will clear everything up (click on the image to get a clear view).
The application contains 3 parts which will be described in different sections below!
As mentioned in the challenges section, there are two operational solutions for this part.
In this case, we should find the URL that the javascript tries to load the content. Which in this case, it is as below:
https://www.teletal.hu/ajax/szekcio?ev=<>&het=<>&ewid=<>&varname=<>
All the values can retrieve from the section tag in the primary HTML. For instance:
<section class="uk-section uk-section-xsmall uk-section-default teletal-fozelekek" ev="2022" het="23" ewid="162900767" section="Főzelék" lang="hu">
P.S.: After loading the web page to get values from section tags, we should save the PHPSESSID cookie and include it in our requests header. Because without this cookie, the webserver return error!
Pros:
- Faster scraping
- Less Resource Usage
- More Reliable (usually no need to change the code if the style of the site changes)
Cons:
- Not always easy to find out the way.
- Calling URL or API request format might change.
- Retrieved data might be encoded, and decoding throws the encoded javascript!
Using selenium library and web drive helps us open a browser (in this case, in the background) and simulate the scrolling down. After getting to the bottom of the page and all javascript executed by the browser javascript engine, it returns the final HTML source code.
Pros:
- The easiest way to scraping
- Guaranteed that get the final HTML source
Cons:
- More time to scrape and execute javascript.
- More resources for executing web driver.
- Need fo redeveloping if website style change.
There is another way to retrieve data, and it is from an XML file downloadable on the menu page. The problem with this method that causes it impracticable is that there is some missing information in it, including the prices of some meals or menus. For instance, "Full Day Menu". However %95 websites don't support this kind of exporting, so this method is not operatable in most cases!
After scraping data with the mentioned methods, as the scraped data are in HTML table format (as displayed below), we should first extract the raw data.
<table>
<tr>
<th>...</th>
<th>...</th>
<th>...</th>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
</table>
The next step is putting this raw data in a proper data structure, which in this case, I use a nested dictionary. A sample of imported data is shown below. The attached screenshot shows how I used keys namings (click on the image to get a clear view).
{"Table_Name" : {
"Row_Name": {
"Food_Type": {
"Days_of_week": [
{
"food_description": "<>",
"food_price": <>
}
.....
]
}
.....
}
.....
}
.....
}
Then with the recursive method, search in the dictionary and find a food description containing the initiate search term! If there is no search term in the description, the sub-root and root will be removed. After filtering the dictionary, we must get the minimum price for a specific day. In another method, run the mentioned method for each day of the week and store the result in a new dictionary.
The result dictionary can be exported in 3 formats:
- Text-based table (print in CLI or store in a file) - Default Exporting Method
- Save the table in HTML format and show it in the browser
- Export the data in JSON format (print in CLI or store in a file)
There are 5 files as described below:
reverse_enginning.py
: ContainReverseEnginning
class for scraping with reverse engineering method.selenium_scrap.py
: ContainSeleniumScrap
class for scraping with selenium method.table_process.py
: ContainTableProcess
class for processing scraped data, filtering and sorting class.output.py
: ContainOutput
class for creating formatted result (Table, HTML, JSON) as output.main.py
: Handle CLI arguments and execute the application based on other objects and classes.
Attributes:
__url
(attributes, protected, str) set by theurl
variable in class initiation.__ajax_url
(attributes, protected, str) set to ajax url staticly.__cookies
(attributes, protected, dict) set as empty dictionary in class initiation.__scraped_data
(attributes, protected, None) set as None in class initiation.__none_tables
(attributes, protected, list) set as empty list in class initiation.__tables_content
(attributes, protected, dict) set as empty dictionary in class initiation.
Methods:
scrape
(method, public, without argument, return dictionary object) is executingrequests
to get primary HTML content and assign it to__scraped_data
attribute withBeautifulSoup
HTML parser. also, assign the cookie to__cookies
attribute. then execute__get_tables_ajax
method. At the end returns__tables_content
.__get_tables_ajax
(method, protected, without argument, without return) get the relatedsection
tags and grab values of tags' attributes. Iterate the values and send a request with the saved cookie to the ajax URL. Assign retrieved data as dictionary value and table name as the key in__tables_content
. Append table names of those which can not retrieve in__none_tables
. Then execute__get_tables_website
.__get_tables_website
(method, protected, without argument, without return) is trying to retrieve the content of the tables (which is not found in Ajax) from the primary HTML source. Then join it with__tables_content
.
Attributes:
__url
(attributes, protected, str) set by theurl
variable in class initiation.SCROLL_PAUSE_TIME
(attributes, public, float) set to 0.7s for scrolling the page down in class initiation.__driver
(attributes, protected, object) set as chrome web driver in class initiation__scraped_data
(attributes, protected, None) set as None in class initiation__tables_content
(attributes, protected, dict) set as empty dictionary in class initiation
Methods:
scrape
(method, public, without argument, return dictionary object) is scrolling down with web driver and retrieving the full HTML content with selenium. Then assign it to__scraped_data
attribute withBeautifulSoup
HTML parser. Then execute__get_tables
method. At the end returns__tables_content
.__get_tables
(method, protected, without argument, without return) get the relatedsection
tags and data. Assign retrieved data as dictionary value and table name as the key in__tables_content
.
Attributes:
__data
(attributes, protected, object) set by thescrape_data
variable in class initiation.__all_tables_dict
(attributes, protected, dict) set as empty dictionary in class initiation.__week_days
(attributes, protected, list) set as list of days in a week (in Hungarian) in class initiation.__filtered_dict
(attributes, protected, dict) set as empty dictionary in class initiation.
Methods:
create_dict
(method, public, without argument, without return) is extracting the raw data withBeautifulSoup
and create a nested dictionary. Also, dealing with challenges #2 and #3. Then assign the final result to__all_tables_dict
.filter_dict
(method, public, with argument, without return) pass the search term as an argument and try to recursively find the search term in__all_tables_dict
. Eliminate all roots and subroutes if they don't contain the search term. Then assign the final result to__filtered_dict
.__get_min_price_by_day
(method, protected, with argument, return dictionary object) pass the day name (in Hungarian) as an argument, then return the minimum price food information for that specific day.get_cheapest_food_week
(method, public, without argument, return dictionary object) is calling__get_min_price_by_day
with each day in__week_days
and return a dictionary of all day's result.
Attributes:
__result
(attributes, protected, dict) set by theresult
variable in class initiation.__df
(attributes, protected, object) set topandas
DataFrame with givenresult
.
Methods:
to_html
(method, public, with argument, without return) passing a string as HTMLfilename
to save exported HTML and open it in browser. The defaultfilename
value isresult.html
to_table
(method, public, with argument, conditional return) passing a string as a textfilename
to save exported table formatted data in a text file. Passingprint_result
for either save it in a file or print it in CLI console. The defaultfilename
value isresult.txt
, and the defaultprint_result
value is True.to_json
(method, public, with argument, conditional return) passing a string as a textfilename
to save exported JSON formatted data in a JSON file. Passingprint_result
for either save it in a file or print it in CLI console. The defaultfilename
value isresult.json
, and the defaultprint_result
value is True.
A. Create a directory (Folder)
B. Download the source codes and move to the newly created directory in step A
C. Open a Terminal / Command Prompt / Powershell (based on your OS)
D. Change the directory to the newly created directory in step A
E. Create a venv environment with the below command
python -m venv .
F. Activate venv with the below command (based on your OS)
### For Command Prompt
.\Scripts\activate.bat
### For Powershell
.\Scripts\Activate.ps1
### For Linux Terminal
./Scripts/activate
Run the following command to install dependencies and required libraries automatically (based on your OS).
### For Windows
pip install -r .\requirements.txt
### For Linux
pip install -r ./requirements.txt
Now it is time to run the application with desired arguments. A list of arguments and descriptions is mentioned below.
Usage:
> python main.py [-h] [--scraping-method [<Scraping-Method>]] [--search-term [<Search-Term>]] [--link [<Teletal-Menu-Link>]] [--output [<Result-Output>]]
Options:
-h, --help show this help message and exit
--scraping-method [<Scraping-Method>]
There are two types of scraping methods for this application. Use "selenium" for scraping the web page using the Selenium library, or use "reverse" for scraping the
web page using methods that I found by reverse engineering! default is "reverse"
--search-term [<Search-Term>]
Enter the desired meal name as a search term. This will search the entire menu to find meals containing that component.
--link [<Teletal-Menu-Link>]
Specify the link to the menu on the Teletal website. e.g., "https://www.teletal.hu/etlap/24"
--output [<Result-Output>]
Use "html" for exporting data to an HTML file and show in the browser, or use "table" to show in tabled (markdown) format in a text file, or use "json" to show in
JSON format in CLI! default is "table".
For instance, the following command will execute with the selenium
scraping method and csirkemell
as the search term and https://www.teletal.hu/etlap/22
as the menu link and html
as the output format.
python main.py --scraping-method "selenium" --search-term "csirkemell" --link "https://www.teletal.hu/etlap/22" --output "html"