Feature engineering is critical for obtaining good results from ML models; unprocessed data can be extremely difficult for algorithms to learn. The most important part of feature engineering in this application is the price distribution.
As can be seen in data visualization above, the price distribution tends to be very uneven with some very high-priced listings that can be outliers in the data. Such outliers will make it more challenging for algorithms to find patterns
in the data. Therefore, applying a logairthmic transformation, as seen in the figure below, helps make the distribution appear closer to a normal or Gaussian distribution, which is typically easier for algorithms to learn.
Other forms of feature engineering done in this project include changing features like neighbourhood group into categorical features and adding an extra feature for availability highlighting the importance of yearly-round available
listings.
Data cleaning can also help improve performance. Data cleaning includes removing certain features like id and host name as well as removing certain samples that do not have enough information and might mislead the model. Samples that
continue to lie outside the price distribution curve, after the logarithmic transformation, were also removed to help the model better recognize the overall pattern. The full procedure of feature engineering and data cleaning, along with
the code, can be found in the
data processing notebook.