How We Used Machine Learning to Predict Real Estate Prices

It’s hard to surprise anyone with artificial intelligence and machine learning nowadays. Even though it’s a young technology, the models and algorithms are already capable of completing many tasks. They can do anything from highly personalized customer service to sophisticated and creative paintings. AI&ML-powered solutions can bring some benefits to virtually any industry. Real estate is definitely one of them.

There is plenty of duties in real estate where machine learning algorithms can come in handy. For example, you can use it to estimate how the price of a particular property will change in the future. It will help brokers understand the market and plan their strategy based on real data. And we were lucky to work on such a solution. Here is our experience of building a machine learning model for real estate price prediction.

Real Estate Industry: Numbers To Pay Attention To

Let us share some interesting statistics that show what real estate looks like right now so we can see the big picture:

New York, San Francisco, and Chicago are one of the least competitive house markets in the United States. (Zillow)
The prices for recently listed homes have risen 13.5% compared to March of the last year and 26.5% — to the same month of 2020. (Realtor)
The median home price has reached $357,300 for 2022. (National Association of Realtors)
The number of days that houses stay on the market is decreasing: for March 2022, it was 38 days, compared to 49 days the previous year. (Realtor)
Finally, the most interesting one: Millennials are not the biggest homebuyers on the market, Gen X is, with 24% of all people who buy new homes. (National Association of Realtors)

All these stats and trends play a part when you estimate and predict the price of a house. Now, we are going to move on to the machine learning case we had recently where all the stats came in handy.

Real Estate Price Prediction Model: Why Did We Create It?

It's hard to estimate a price of a particular house and predict its behavior in the future. Even if this prediction covers only a short period of time. Too many factors can influence the final numbers. The two most influential of them would be:

the prices of the recently sold houses with similar parameters
dominating market trends

We can see from the previous part that both of these can change drastically with time and some measurements can be seasonal. Besides, there are many characteristics of the house itself that also count: the number of bedrooms, the building’s age, the overall condition, the quality of the neighborhood, proximity to shops, schools, and entertainment, and many more.

Also, it’s worth mentioning that professional opinion on a price can have several forms:

Comparative Market Analysis

Real estate professionals look for comparable homes in the area and define the value of a property based on how those houses behaved on the market. Comparable homes are chosen by size, number of rooms, style, and recent sales price.

Broker Price Opinion

A Broker Price Opinion (BPO) is another option for a person to get a professional opinion on the house. It is usually made by a professional broker who is aware of the local market. This way is common for short sales, foreclosure, or providing buyers and sellers with the listing price.

Such an amount of data that a broker should bear in mind while predicting the house price can get extremely voluminous. This task can be tedious even for the most experienced specialists. Besides, there’s still a possibility of a simple human error.

That was the exact reason why we started working on this project a client has come to us with: To automate house price prediction and minimize the influence of possible human error. Our main goal was to make a highly accurate machine learning model that will predict the house price in a month with an accuracy of around 85 to 90 percent.

How Did We Create It?

Now, the main part: What steps we took to create a machine learning model for price prediction, what tech stack we used, and what main challenge we faced during the process. Let’s start with the overall strategy. Our journey proceeded the following way:

Gathering data. Our first data source was the client itself. They provided us with several datasets, however, it wasn’t enough for model training. To solve this issue, we started the research on other sources that could provide us with real estate data. We used several available sources of information related to the real estate market in the U.S., as well as data related to the country’s economic conditions, in order to achieve a more representative data set.
Feature engineering. To predict the price, we have chosen the following features:

historical change in real estate price
property location
type of house
neighbors
presence or absence of a pool
other nontraditional variables

Hyperparameter tuning. Hyperparameters are meant to assist the model during the learning process. They are external to the model meaning they are imposed on it and it has no power to change them anyhow. They are also used only during the training itself and do not complement the model itself. This step was taken so we will have something to validate the model's results, control its behavior, and maximize the performance.
Studying variables. During the whole process of model training, we were constantly assessing and reevaluating the influence and relevance of each variable. These processes were meant to increase the accuracy of the final solution.
Iterating. When we finished the first version of the model, we started the process anew to polish the model and make sure its results are as accurate as possible.

Technology Stack

Our main tool was XGBoost, an open-source gradient-boosted decision tree library for machine learning. We used it for the Regression model. Other items on the list included:

Python 3.7—programming language
Pandas—data cleaning and analyzing
Scikit-learn—classification and predictive analytics

This toolset has helped us reach the desired accuracy.

The Main Challenge

Nothing is perfect in this world, and neither was our machine learning development process. For the most part, it was quite predictable and smooth, but at one moment, we faced the challenge of underfitting. Since the initial dataset was not of the best quality and was quite small, it was hard for the algorithm to find hidden trends and count the accurate results.

As we already mentioned, the solution was found in third-party data sources. The information we were able to find in public sources has helped us get back on track and train the model correctly.

The Results

The results were even better than we expected. The client expected accuracy of around 85-90% compared to the real prices. We were able to achieve 91%. Yes, not that much of a difference, but taking into account the circumstances, we couldn’t expect a better outcome.

Was It Successful?

In a nutshell, yes, it was. The model was definitely not perfect, the initial data was not the best, and the underfitting issue played its part, but it’s a good start. We were able to see that machine learning is a viable technology in real estate and that we can easily continue working on more cases. Besides, price prediction is not the only area to work with: It also can include home hunting and property evaluation.