In this repository, I have incorporated the practical component of my undergraduate thesis, which represents the culmination of my bachelor's degree. In accordance with the ruling of the Attestation Commission dated June 19, 2023, as recorded in protocol #202, I have been awarded an outstanding grade of 97 out of 100 (A+).
Files:
alakv-6.xlsx - Dataset
alakvanalysis1.ipynb - Data Analysis Jupyter notebook
app.py - Flask file for integration with ML model
geocoder.py - geocoding coordinates to locations and vice versa
Development of an information system for predicting real estate prices using machine learning and neural networks
- Web Scraper Google Chrome Extension for scraping data;
- Python programming language for data analysis and web development;
- Jupyter Notebook for machine learning model development;
- Numpy, Pandas, Scikit-learn, Keras, Pickle libraries of Python for data analysis and model training/testing.
- Pycharm IDE;
- HTML, CSS, JS, Bootstrap for front-end development;
- Flask framework for machine learning model deployment;
- The Google Sheets Geocode Awesome extension for geocoding latitude and longitude of real estate.
The first and most crucial stage of machine learning is data collection. Data collection is the process of collecting and measuring information on variables of interest in a defined systematic manner that allows answers to research questions, testing hypotheses, and evaluating results. The research data collection component is common to all fields of study, including physical and social sciences, humanities, business, and more. Regardless of the field of study or the preference for defining the data (quantitative, qualitative), accurate data collection is important to maintain the integrity of the study. The selection of appropriate data collection tools (existing, modified, or newly developed), as well as well-defined guidelines for their proper use, also reduces the likelihood of errors.
Consequences of incorrectly collected data include:
In the process of data collection, it is very important to take into account possible errors and coding errors to be collected in the database in uniform units of measurement, and correct writing. The most important purpose of data collection is to ensure that rich information and reliable data are collected for analysis. The quality of the forecasting model is directly related to the quality of the data.
The website Krisha.kz is chosen as the main data provide for this project. Krisha.kz is a real estate website in Kazakhstan that provides information and services for those interested in buying, selling, or renting properties in the country. It offers a platform for real estate professionals and private individuals to list properties and reach potential buyers or renters. The website also provides a variety of tools and resources for real estate research and analysis, including property listings, property valuations, real estate news and market trends, and more. Krisha.kz aims to make the process of buying, selling, or renting real estate in Kazakhstan easier, more efficient, and more accessible for all users.
Extracting data from Krisha.kz is possible using a web-scraping tool.
First, the web scraper will be given one or more URLs to load before scraping. Then, the scraper loads the entire HTML code for the page in question. More advanced scrapers will render the whole website, including CSS and Javascript elements. Most web scrapers will output data to a CSV or Excel spreadsheet.
The web scraping tool is called WebScraper. Its form comes as a Chrome extension. Data preparation steps are described as follows:
- installation of WebScraper Chrome extension;
- opening Krisha.kz and choosing apartments/buildings/lands/offices in the city of Almaty in the search bar;
- opening WebScraper in the Developer Tools;
- creating sitemap alakv;
- as shown in the figure 1, pagination was chosen, which is each webpage the data will be extracted from;
- creating element card with child elements such as name, price and address of each property, as shown in the figure 2;
- exporting data and downloading as XLSX file.
Figure 1. Choosing pagination
Pagination is navigating through multiple pages to extract data.
Figure 2. Selecting element cards
When the data was fully scraped and exported as a XLSX file, it had been edited in the Excel program. The datasets cleaning process and the results are shown in the figures 3 and 4 respectively. In the before figure, there were unnecessary columns, values, missing values, etc.
Figure 3. Before data cleaning process in Excel
After the data cleaning process and feature engineering, new columns were added. Columns’ data types were changed.
Figure 4. After data cleaning process in Excel
In order to add needed features for data preprocessing, Google Sheets extention called “Geocode by Awesome Table” was used. It geocoded all the addresses with their longitude and latitude.
Data analysis is a set of methods and tools for extracting information from organized data for decision making. Analysis is not just processing information after it has been received and collected, it is a tool for testing hypotheses. The goal of any data analysis is to fully understand the situation under study (identify trends, including negative deviations from the plan, make predictions and make recommendations). To achieve this goal, the following data analysis tasks are set:
All data contains important information, but for different questions. All arrays need to be processed to extract useful data for specific situations.
In the process of data processing, preparation for analysis was carried out, as a result of which they were adjusted to the requirements determined by the specifics of the problem to be solved.
Pre-processing is an important stage of Data Mining, and if it is not performed, in the subsequent analysis, in many cases, analytical algorithms may be hindered or the results of their work may be incorrect. In other words, the principle of GIGO - garbage in, garbage out (garbage at the entrance, garbage at the exit) is implemented.
Data processing involves cleaning and optimizing data to prevent factors that diminish data quality and impede analytical algorithms, addressing duplicates, conflicts, false positives, missing values, noise reduction, outlier editing, and restoring the structure, completeness, integrity, and correct formatting of the data during the cleaning process. Data preprocessing and cleaning are important tasks that must be performed before using a dataset to train a model.
Raw data are often skewed and unreliable, and may contain missing values.
Using such data in modeling can lead to incorrect results.
Real data is collected for processing after various sources and processes. They may contain errors and damages that negatively affect the quality of the data set. These may be typical data quality problems:
The general view of the data after the data cleaning process is shown in figure 5. There are 6700 rows of data.
Figure 5. General view of the dataset
The next step is to create a preprocessing process for this collected data. The process started by loading the dataset in Jupyter Notebook. Information in the figure 6 indicates the number and types of available parameters.
Figure 6. Information about the data frame
Wallmaterial is a column with the type of wall materials. Values for each type are shown in the figure 7.
Figure 7. Information about wallmaterial
Values that had NaN type were removed and refilled, as in the figure 8. NaN, standing for Not a Number, is a particular value of a numeric data type which is undefined or unrepresentable, such as the result of zero divided by zero.
Figure 8. Filling missing values in wallmaterial
floorNumber is the number of the floor the apartment is located at. floorTotal is the number of total floors in the building. totalArea is the total area of the apartment. Missing values in these columns were less than 1%, therefore it was filled with average values. state is the state of the apartment. The types of states are shown in the figure 9.
Figure 9. Information about state
Coordinates of the apartments - longitude and latitude data - are shown in the figure 10.
Figure 10. Information about longitude and latitude
Price column’s descriptive statistics is shown in the figure 11. Mean price of all apartments is about 50 million KZT. It also shows that there is an apartment that is listed for 1 billion KZT.
Figure 11. Information about price
Year column’s descriptive statistics is shown in the figure 12. The oldest building listed was built in 1932. 50% of buildings were built in 2011.
Figure 12. Information about year
Feature engineering is the process of selecting, transforming, and creating features or variables from raw data in order to improve the performance of a machine learning model. This involves identifying the relevant variables, removing irrelevant or redundant ones, transforming variables to improve their quality, and creating new variables that may be useful for the model. Feature engineering is a crucial step in the data analysis process as it can significantly impact the accuracy and generalizability of a model. It requires a deep understanding of the data and the problem domain, as well as creativity and knowledge of various techniques and tools for feature selection and transformation.
Following features were added:
- priceMetr is the area of the apartment per square meter;
- distance is the distance from the apartment to the city center;
- azimuth is the angle relative to the north direction.
As can be seen in the figure 13, there were two categorical parameters after filling in incomplete data.
Figure 13. Parameters with categorical values
To work with these parameters, they were converted to numeric values, as shown in the figure 14. 3 values were assigned to “Wallmaterial” column and 5 values were assigned to “State” column.
Figure 14. Converting categorical parameters to numeric
As a result, the general view of the data is shown in the figure 15. It has 12 columns, meaning a dataset has 12 features.
Figure 15. Data after pre-processing
The last stage of data preparation is to select the target variable that for prediction. It is the price of an apartment per square meter. The next step is to select the parameters that will take part in the training of the model and form a new X dataset with no price indication, as in the figure 16.
Figure 16. Machine learning parameters
The train_test_split() function automatically divides X and y into 4 groups. This allows to check the quality of the model for unfamiliar data.
Model training and testing is a critical step in the development of a machine learning model. During training, the model is exposed to a set of labeled data, which it uses to learn the patterns and relationships between input features and output labels. The goal is to minimize the difference between the predicted output and the true output for each input. Once training is complete, the model is evaluated on a separate set of data, known as the test set, to assess its performance and generalizability. The results of the evaluation are then used to make any necessary adjustments to the model or to choose a different model altogether.
For the real estate price prediction task, following machine learning algorithms were used:
Development and training of a neural network model using Keras library has several stages. The necessary modules from TensorFlow and Keras are imported. The custom loss function called root_mean_squared_error_keras was defined. This function calculates the root mean squared error between the true labels (y_true) and predicted labels (y_pred). A sequential model is created using the Sequential class from Keras. This model consists of multiple layers defined using the Dense class. Each Dense layer represents a fully connected layer in the neural network. The specified number of units in each layer determines the dimensionality of the output space. The activation function relu (Rectified Linear Unit) is used for the intermediate layers, while the final layer is left without an activation function. The model is compiled by specifying the optimizer and loss function. In this case, the RMSprop optimizer is used, and the custom loss function root_mean_squared_error_keras is utilized. The fit function is called to train the model. It takes the training data train_X and corresponding labels train_y as input. The validation_data parameter is used to evaluate the model's performance on the validation data during training. The batch_size specifies the number of samples used in each gradient update, and epochs determine the number of times the training data will be iterated over.
The metrics for evaluating the models are:
Mean absolute error measures the average absolute difference between the predicted values and the true values. It provides a measure of the average magnitude of the errors.
Median absolute error is similar to MAE, but instead of taking the mean of the absolute differences, it takes the median. The median represents the middle value in a sorted list of values. MedAE is less sensitive to outliers compared to MAE and provides a robust measure of error.
Mean squared error measures the average of the squared differences between the predicted values and the true values. It gives higher weight to large errors since errors are squared before averaging.
Root mean squared error penalizes large errors more compared to MAE since it involves the square of the differences. It provides a measure of the typical magnitude of errors and is sensitive to outliers.
The trained models’ performances and scores are shown in the figure 17.
Figure 17. Models’ results
Based on the results, the random forest was chosen as the main model for the web application.
To train the random forest algorithm, a RandomForestRegressor object was created. It was saved as rf_model. And it includes a number of parameters (i.e. hyperparameters), namely:
Launching the model was done by using the fit method. The results of a random forest model is shown in the figure 18.
Figure 18. Results of random forest model
The predict method (val_X) validates the value of the X dataset. And the print_metrics() function takes predetermined and and predetermined estimates and prints metric values.
Figure 19. Important features ranking
In creating a model, there is a way to see the importance of labels when creating a random forest. In the figure 19, ranking of important features are shown in the following way:
- distance (0.339830);
- azimuth (0.186889);
- totalArea (0.179689);
- year (0.107004);
- floorsTotal (0.082462);
- floorNumber (0.054648);
- state (0.027602);
- wallmaterial (0.021874).
To check the performance of the model after training, the price of an apartment put up for sale on the website https://krisha.kz was predicted. The features of the apartment are shown in the figure 20.
Figure 21. Parameters of an apartment
To do this, a dataframe describing the parameters of this apartment was made, as shown in the figure 22. 10 features of an apartment were inputted.
Figure 22. Creating dataframe
Excess elements were removed from the DataFrame by filling in the missing parameters with the available parameters. The drop function was used for this. The offer price was predicted through the advanced rf_model.
Figure 23. Prediction result
The price of the apartment predicted by the model shown in the figure 23 – 33881000 tenge. As an 8-10% error was expected, the apartment's real price was lower than the predicted price of 2.6%.
Several machine learning models were developed, made numerical predictions for this test, verified the results, and did it all autonomously. Actually, the generation of predictions is only one part of the machine learning project, but at the same time, it was the most important part.
A DFD (Data Flow Diagram) is a graphical representation that illustrates the flow of data within a system. It shows how data is input, processed, and outputted by different components of the system. The DFD diagram is shown in the figure 4.6.1.
Figure 4.6.1. Model as a service for users
The web application is built on the basis of a model built on machine learning. After training the model, the model was saved so that it could be used without retraining. The following lines were added to save the model as a file .PKL for further use of the file. To load the model in the web app, the model load function was used.
The application was launched as a single module. To do this, a new Flask instance was started with the name argument to see if it can find the Flask template in the same directory as HTML (templates). Next, route decorator ( @app) was used, which activates the index function to display the URL. The Post method was used to transfer data to the server. Within the predict function, it accesses a dataset to predict several parameters of the apartment by taking it through the form [18]. The model takes the new values entered by the user and use it to predict the price of the apartment. To predict the price of an apartment, user needs to fill out a form for predicting several parameters of the apartment and click the predict button.
In the figure 4.6.2, the front and main page of the web application is shown. The website’s IP is on the address 127.0.0.1:5000. It has a menu bar and a landing page.
Figure 4.6.2. Main page
The form for inputting apartment details is shown in the figure 4.6.3. It allows to input 10 features of an apartment.
Figure 4.6.3. Apartment details inputting form
The apartment details for price prediction is shown in the figure 4.6.4. It is listed in the krisha.kz. 2 bedroom apartment, with 46.5 sqrm is on the list for 29 million KZT. It is located on the 3rd floor of the 4-floor building. The building was built in 1068 and it has a panel wallmaterial.
Figure 4.6.4. Apartment details for price prediction
The prediction process is shown in the figures 4.6.5 and 4.6.6 respectively. After inputting the the details, it is necessary to click “Submit” button.
Figure 4.6.5. Apartment details inputting for price prediction
Within a second, the machine learning model takes apartment details as an input and calculates the price of an apartment.
Figure 4.6.6. Apartment price prediction
Initial price of an apartment was 29000000 KZT, whereas the model gave and output of 28600000 KZT [20]. As expected from the machine learning model, it showed an error withing 8-12% range.