NYC Airbnb Machine Learning Price Prediction Model
Overview
For this project, I utilized a large open-source dataset of Airbnb's in New York City in order to create a price prediction model using machine learning techniques. The main techniques I implemented are matrix factorization and regularization.
The dataset contains around 50,000 unique observation on individual Airbnb locations and their price point for 2019. This dataset is available at Kaggle.com at the link here: https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
The dataset contains 16 variables summarized below, but the main information I will use for this analysis are Neighborhood, Room Type, and Price. The way in which I evaluated the overall performance of the model will be to calculate Root Mean Squared Error (RMSE) of each model.
All of the code for this analysis is available on my github linked here:https://github.com/bkphillips/NYC_Airbnb_Price_Predict
Analysis
First looking at the price distribution, there are some very large outliers that skew the data that are most likely due to holidays or events. I decided to look at the price using a log distribution, where the average price of $152 becomes more apparent and the distribution looks somewhat normal.
The other key descriptive variable are Neighborhood Group, Neighborhood, and Room type.
There are 221 unique neighborhoods, so for the purpose of describing the dataset I will mostly show the 5 main groups in which they fall into. There are also 3 main room types: Entire home, private room, or shared room. Below you can see a count of the types of room in the different areas of NYC. You can see the majority of locations are in Manhattan and Brooklyn. They are also mostly entire home/apt. or private rooms.
Modeling and Results
In order to create my training and testing datasets, I decided to remove locations with a price of $0 or anything above $500 based on the outliers that were seen in the initial density plots. I then partitioned the data into 70% for the training and 30% for the testing set.
I then began testing just the average first model on the price data, which gave a RMSE of 85.4. When I added the neighborhood group effects (b_g), this brought it down to 80.6. When testing neighborhood effects (b_n), it had a much better performance of 74.6, so I decided to stay with just b_n. The fourth model then used the room type effects (b_t) which significantly brought down the RMSE to 64.5. Then I then tried regularizing the data because I figured that neighborhoods that had more listings probably have more trustworthy prices that are more accurate. This only brought down my RSME to 64.1.
Below is my final model used on the test set and the plot of the RSME's that were used to fine tune the lambda's for the regularization technique. You can see the optimal lambda for minimized RMSE is around 40:
Conclusion
Using matrix factorization of the key descriptive factors of the location, I was able to more accurately predict the price of each airbnb. I was surprised to see that regularization did not improve the prediction of the price by much. The large price outliers are a challenging aspect of this dataset. It would be helpful if there was further information given about each location that could help convey other aspects that lead to a higher or lower price such as the quality of the space, amenities, or walking score. Further information may help predict these outliers. I would also like to add a confidence interval that would likely fall within the majority of the given prices as seen by the random sample where the majority of locations are within $40 of the actual price.