Report

Introduction

In this project, we try to use photos of streets to find some sociological information using machine learning methods. For example, the relationship between the number of cars in the street and the average value of property in the street, and we also have some other hypotheses, such as whether the percentage of Japanese cars in the street is related to the average education level in that street.

Timeline

Gather Information: Week1-2

Data Collection and Analysis: Week 3-4

Model Modification and Training: Week4-5

Conclude: Week6

Methodology

Data Collection

intro: FIPS Place Codes: Used to identify cities, towns, and villages in the United States. These codes are essential for describing and analyzing geographic location information in census data.

Image source：Google Street View API

Number of pictures: 20000

Number of FIPS: 200 randomly

Number of images per FIPS: 100

Data Analysis and Cleaning

intro:For all the downloaded street maps, use the API to recognize vehicles, including the number, brand, types, and series.The recognized data is stored in a database and output to a csv table after performing relevant statistical operations, including counting information such as the number of vehicles in a fips area.

Step1:use API

API: tecentcloudapi

Step2:store data in database

Database: MySQL

Step3:

Step4:output

Model Modification and Training

After seeking and analyzing the dataset,we try to use both regression model and classifiers model to discover the relationship between the amount and type of the vehicles and the median property in the certain area.

Regression Model

For the regression model, we selected the baseline regression model and the regression model with multiple independent variables. The baseline regression model was used to discover the relationship between the amount of the vehicles and the median property in the certain area. The regression model with multiple independent variables was used to discover the relationship between vehicles series and the median property, and the relationship between vehicles types and the median property. Among them, the vehicle series refers to Japanese, American, and other series. Vehicle types refer to pickups, SUVs, sedans, and others.

The Baseline Regression Model

The num_vehicles Model

The total number of vehicles in the region is considered as the independent variable, and the median property in the region is considered as the dependent variable. The result is shown below:

The Regression Model with Multiple Independent Variables

The series_of_vehicles Model

The two independent dependent variables are the proportion of Japanese series and the proportion of American series, and the dependent variable is the median property. The result is shown below: div style="text-align:center;">

The type_of_vehicles Model

The independent dependent variables are the proportion of pickups, SUVs and sedans, and the dependent variable is the median property.The result is shown below: div style="text-align:center;">

Classifier Model

Data Loading and Preprocessing:

The script loads two CSV files (data.csv and Florida_ct.csv) using pandas. The data from the two DataFrames are merged based on the 'FIPS' column. Additional columns are created based on calculations involving existing columns. The relevant columns for the analysis are selected and stored in the test_df DataFrame. A new column 'property_value_discrete' is created based on a threshold value for 'property_value_median'.

Data Splitting

The data is split into features (X) and the target variable (y). The feature set consists of selected columns, and the target variable is 'property_value_discrete'.

Model Training and Evaluation

A logistic regression model is instantiated, trained on the training data, and evaluated on both training and testing sets. A list of different classifier instances is created. (adding three new models: xgboost, lightgbm, catboost) A loop iterates through each classifier, fits it to the training data, and evaluates its performance on both training and testing sets. Metrics such as accuracy and log loss are calculated for each classifier.

Visualization

The model uses seaborn and matplotlib to create bar plots for visualizing the performance of each classifier. Two plots are generated: one showing the test accuracy of each classifier and another showing the test log loss.

Output

The model outputs the training and testing accuracy, as well as the training and testing log loss, for each classifier. It also generates visualizations of classifier performance in terms of accuracy and log loss.

Results

The test accuarcy:

The log loss:

Conclusion

According to the results of Regression Models, it seems that the number of vehicles does not have a corresponding linear relationship with the median property. At the same time, the relationship between the series of vehicles and the median property, and the type of vehicles and the median property is not simple linear.
Based on the results of the classifier models (highest test set accuracy of 71%), it seems to verify that there is some correspondence between these independent variables (vehicle series, vehicle types in the region) and the dependent variable (median property in the region). This verifies the feasibility of using these dependent variables to calculate regional property conditions.
Due to the limited time of this summer study, considering the difficulty of obtaining pre-processed data, that is, it takes a lot of time to transform the data, and the final training data is not much, which is likely to be the key factor restricting the performance of the model. In the subsequent tests, if the amount of training data is gradually increased, the relationship between the independent variable and the dependent variable will be clearer, and the accuracy of the test will be improved.

Reference

Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States
Combining satellite imagery and machine learning to predict poverty
Deep hybrid models with urban imagery
Learning representations of satellite imagery by leveraging point of interests

Contribution

Rongfei Zheng:data collection, extraction of information and modeling

Jingcheng Wang: Model modification and training

Junxi Wu: Model modification and training

zhengrongfei / uf_research Goto Github PK

uf_research's Introduction