Predicting Evictions in New York State Using R Supervised Learning Models
Introduction
Renter households in the U.S., especially those with low incomes, face the threat of eviction if they fail to meet timely rent payments, which exacerbates existing inequities in educational access, health care, employment opportunity, and more. Evictions have particularly disastrous effects for communities of color, who already face daunting barriers in the wider economy and the housing market due to ongoing discrimination and a history of of marginalization and disenfranchisement. Understanding the factors that lead to evictions within a certain geography is crucial in supporting policymakers’ ability to direct support and resources to the most overlooked communities to both prevent evictions and provide aid to evicted families.
The goal of this project is to firstly build and refine a model that can predict eviction rates across counties in New York, with the lowest RMSE serving as the metric that will select the best-performing model, and secondly discuss the variables that emerge as most influential in driving eviction rates, as well as any variable relationships that are revealed. For the sake of scope, I have narrowed the current analysis to New York in order to preserve computational resources.
To achieve this goal, I will first provide a brief background on relevant existing literature and contextualize my current analysis, move to discuss the data and variables I will utilize along with data challenges encountered, then to the methodology and machine learning techniques considered and chosen, provide a summary of my results, and close with a discussion of the interpretations we can pull from my results and an assessment of my project’s success against my stated goals.
Problem Statement and Background
The eviction crisis has long plagued renters in the U.S., with the number of evictions between 2000 and 2016 nationwide estimated to be 1 in 40 renter households (The Eviction Lab, 2018). The crisis has received heightened attention over the last year of a pandemic that has had people relying on shelter more than ever while unemployment rates have skyrocketed. Even with federal, state, and local eviction moratoria in place, an estimated 30-40 million people were at risk of eviction at the onset of the pandemic in March 2020 (Benfer et. al, 2020). New York in particular has faced a dire eviction crisis long predating the COVID-19 pandemic; affordable housing has long been in low supply (the average rent tripled in less than 20 years), with gentrification and displacement threatening the already-unstable housing market (Gilson, 2021).
Prior research has analyzed the disproportionate impact of evictions on communities of color with low-incomes. Compared to homeowners, renters tend to be lower-income and spend a greater share of their income on housing costs (Goodman & Ganesh, 2017). Moreover, filing and eviction rates for Black renters are more than twice than that of white renters (Hepburn, Louis, & Desmond, 2020). People of color (particularly Black and Latino populations) make up about 80% of those evicted (Hartman & Robinson, 2003). Therefore, it is evident that the system is under significant strain and requires a deeper understanding of the factors driving evictions in order for policymakers to formulate effective solutions to avoid the perpetuation of the modern renter-eviction crisis.
Building on this foundation, this project aims to take a predictive approach to evictions in New York by employing a broader cross-cutting perspective that draws in the intersection of race and economic characteristics. I aim to utilize machine learning techniques to build a model that allows us to understand what variables are most important in driving evictions so that we can successfully predict future eviction patterns, allowing us to put policies in place to prevent or mitigate those harms.
Data
To construct my working dataset, I merged data from the Eviction Lab at Princeton University with the Affirmatively Furthering Fair Housing (AFFH) Dataset from the U.S. Department of Housing and Urban Development, accessed through the Urban Institute, using the common variable county FIP code, and filtered the dataset down to include only NY counties, with the unit of analysis being county-year. Created in 2018 by aggregating public eviction records, the Eviction Lab dataset contains information about eviction and renter-related information per a given geography (such as county or tract) through 2000 to 2016 (Desmond et. al, 2018). The AFFH dataset, which was compiled in 2015 using sources such as the 2009-2013 American Community Surveys and the 2010 Census, contains economic variables across racial demographics for each tract (HUD, 2020). For my analysis, I then split my original dataset into training and testing data, where 75% of the data was to be used for training my machine learning models and 25% was to be saved, unseen, for testing the models’ success on future predictions on unseen data.
Outcome Variable
The outcome variable of interest, eviction rate, is the proportion of households in an area that received an eviction judgement which were ordered to leave out of the total number of renter-occupied households in the area (an eviction judgement comes from an eviction filing, filed by a landlord, that a court adjudicates), and takes on a value between 0 and 100. The variable was pre-populated by the Eviction Lab. In their methodology report, the Lab mentioned that New York is a unique case among states in that records are kept as “abstracted judgements” that are only included in the public record if the plaintiff pays to have them included, so New York’s eviction numbers may be undercounted compared to reality. This fact would soon emerge as an extremely important characteristic influencing my data, and will be discussed shortly in the following section delving into data challenges. The following two visualizations represent the outcome variable in the training dataset, first aggregated and then disaggregated by year. The data shows a significant skew towards values of zero, which I will also discuss in my challenges section.
Predictor Variables
To predict the outcome variable, I use the following 55 predictor variables from the AFFH dataset. HUD compiled these measures from analyzing the 2009-2013 American Community Surveys and the 2010 Census in order to help local governments understand housing need and racial disparities in their jurisdiction. They are all numeric measures with the exception of the year
variable, which I converted into a dummy in later pre-processing in order to functionally use it as a predictor variable without attaching values of ranked judgment to it. To see how the numeric predictor variables are distributed, please refer to the Appendix; due to the high number of predictor variables, including this in the main body disrupted the flow of this paper.
Challenges with the Data
Initial analysis revealed several important issues with the data. For example, please see the geospatial visualization below displaying the variation of eviction rates across New York counties in the entire working dataset (the eviction rate value displayed is that of 2016, because that year has the least missingness of all years, though still a lot; the counties in grey do not have any eviction rate value recorded for all years). It is immediately visible that a majority of the counties are missing eviction rate values for the dataset’s entire time period.
Indeed, 29 counties of the 62 counties in the dataset are missing all 17 values for the eviction rate variable between 2000-2016, with 20 more counties missing a range between 4-14 values. 682 of the 1054 observations (a total of 65%) actually lacked eviction rate values. Additionally, recall Figures 1 and 2, which revealed a skew towards zero among the eviction rates we did have. This meant that even the available data was an undercount of reality (even if some landlords in a county paid for public records, they did not do so for every eviction they litigated, and not all landlords in the county did). Most importantly, it meant that the statistical learning models I would employ would be capturing this skew; in fact, they would even be predicting it.
It became apparent that New York’s “abstracted judgements” was in fact quite a significant feature. I concluded that there were endogenous factors defining the remaining 13 counties which did have data. If eviction records are only publicly available if the plaintiff or landlord has paid for such, there are at least two significant characteristics about the counties that had recorded eviction information: plaintiffs have more money or resources that allow them the ability to pay this fee, and/or there are political push factors within the county that motivate plaintiffs to take this step. Both of these revealed that there is quite certainly a systematic difference dictating which counties were or were not included in my analysis. For example, the racial demographics of the missing and non-missing counties was very likely different: majority-white counties (who have more stable housing and socioeconomic resources) were probably overrepresented in my sample, while the one county with eviction rates closer to 6% was majority-Black and Latinx.
I overcame these issues in the following way (while not comprehensive, this strategy was necessary for the sake of the timeline and scope of this project). I imputed the missing values with the median eviction rate value, and ran my prediction models on this adjusted sample. I chose the median instead of the mean for imputation so that imputation would not be too biased by the existence of higher eviction rates.
Analysis
To prepare the training data for statistical learning, I pre-processed it with the recipes
package. As previously mentioned, a dummy variable for the year variable was taken, and all missing eviction rates were imputed with the medium. Additionally, several variables (including the eviction rate) were log-transformed to balance out their skewed distribution, offset by 1 to address instances when the variable’s value was zero. The final pre-processing step was to normalize all numeric variables to take the same scale with the package’s step-range
function, which constrained all numeric variables to a minimum of 0 (the minimum value) and a maximum of 1 (the maximum value) and proportionally retains the values’ spacing relative to each other. I applied these transformations to both the training and testing dataset so that all values remained consistent and proportional between both samples; any analysis undertaken with the processed training data would function the same on the testing data.
Before analysis, a precaution was taken to prevent training the statistical models to over-fit the data by employing k-fold cross-validation with the caret
package. The training data was divided into 5 folds, which allows training on any given 4 folds and testing on the last fold. This controlled introduction of random chance during our training stage allows us to strike an appropriate balance between bias (too much can lead to rigidity and under-fitting) and variance (too much can lead to over-fitting and poor future prediction).
Methods
For this analysis, supervised learning regression models were elected for use because eviction rate is a continuous quantitative outcome. Four models were run to try to achieve prediction with the least error: Linear Regression, K-Nearest Neighbors (KNN), Decision Tree (CART: Classification and Regression Tree), and Random Forest. The Linear Regression model aims to approximate predictions with a line of best fit that minimizes the distance between the predicted and the actual outcome. K-Nearest Neighbors takes the average of the values from the predicted value’s “K” number of nearest neighbors (my analysis tested 4 different values of “K”). The Decision Tree sorts predicted values based on their value compared to a threshold to the value with minimal error (I ran both a shallow and a deep tree). Finally, Random Forest creates many decision trees from different subsets of the data and samples across those trees to produce an average predicted value (my tests experimented with plugging three different values of predictors into the algorithm). Because of its bootstrap aggregation abilities, which allow resampling with replacement, The Random Forest model has stronger prediction power. However, this comes at with a downside of decreased interpretability, unlike models such as the decision tree and the linear model which allow us to understand the methods used and the importance of various variables in the model’s prediction.
Results
Predictive Performance
The results of the four models’ performance was evaluated based on RMSE (Root Mean Square Error, which is the standard deviations of the residuals between the predicted and actual outcome variable). From worst-performing to best-performing, the Linear Model returned a RMSE of 0.195, the deep Decision Tree 0.128, the shallow Decision Tree 0.127, the Random Forest 0.111, and finally KNN returned a RMSE of 0.109. The final number of neighbors in KNN used was five, which performed best of the four values tested. Because I had so many predictor variables (55), I was not surprised that I could not achieve a lower RMSE that was, for example, below 0.1; there was likely a lot of noise generated by all my independent variables of interest.
When the best-performing model, KNN, was employed on the reserved test data, it predicted eviction rates with a RMSE of 0.081 (and R-squared of 0.371). I was surprised that KNN’s prediction was slightly more accurate on the test data, but I concluded that this was likely due to the patterns of missingness of my working dataset. Because the test dataset was 25% of a dataset so strongly skewed towards zero, a majority of the outcome variable values in the test dataset were likely close to zero as well. Therefore, the model’s analytical abilities that were semi-successful in the larger, ‘noisier’ training dataset were even more successful in the smaller test dataset.
Variable Importance
Three additional methods were used to gain additional information about how KNN made its predictions. Variable permutation, which scrambles the values of a single variable at a time to assess its effect on predictive performance, revealed the following ten variables as the top ten most influential in correctly predicting the outcome variable. These results reinforced prior concerns that the models’ predictions were biased by New York’s unique record-keeping methods and factors that made some demographic groups, counties, or years more likely to be recorded. The variables present in this plot also supported my intuition that counties with Black and Latinx households were underrepresented in my sample.
Marginal Effects
Partial Dependency Plots were used to visualize each of the top five most important variables’ marginal effects on the outcome variable. For the percentage of white households with severe cost burden, the 2014 year indicator, the 2006 year indicator, and the percentage of white households with housing problems, higher values generally drove higher predicted eviction rates. On the other hand, evictions (or records thereof) seemed more likely to be lower in 2012. It appeared that given how low eviction incidence was across the board in my data, the prevalence of economic disadvantage of white households was actually the strongest predictor of high eviction rates.
The below Individual Conditional Expectation curves graph expected y-hat values for each observation in the data as the outcome variable is manipulated, which can reveal interactions between individual variables. All five variables revealed divergent behavior, implying heterogeneity: there are likely external factors and interactions at work, which is unsurprising given the biases that have been discussed at length throughout this paper.
Discussion
When I set out to complete this project, I expected to produce results that corroborated the racial disparities and disadvantages communities face in the economy and the housing market. What I instead accomplished was a case study in the patterns of eviction record keeping in New York. With such a limited sample size and such little variation in the outcome variable, my machine learning models ended up picking up the factors that made a county most likely to even have eviction rates recorded. My Variable Importance and Marginal Effects analyses implied that counties with higher percentages of white households are more likely to have recorded eviction rates; eviction rates turned out to be predicted by the prevalence of white households’ housing struggles (i.e. the worse that white households in a county fared, the more likely eviction rates were recorded in that county). Furthermore, the fact that year dummy variables appeared in the five most important variables meant that exogenous political or economic factors specific to those years made eviction rates more or less likely to be recorded in New York. If more time was available, I would experiment with running another set of models where I did not impute missing values with the median eviction rate value and removed them instead. I would be interested to see how the supervised learning methods work with only data actually recorded.
My original definition of success for this project had been to build a model that minimizes error and strengthen my understanding of how to employ supervised learning on a real-world topic of my interest; I feel confident in saying that I have certainly met those goals even in the face of difficulties with data. This project turned out to be an exercise in interpretation - it challenged my ability to be met with certain unexpected results and draw appropriate conclusions based on the context - and problem-solving upon being confronted with difficult and messy data, which is crucial in the practice of social sciences.
References
“Affirmatively Furthering Fair Housing (AFFH) Data Documentation.” (2016). U.S. Department of Housing and Urban Development, Office of Policy Development & Research (PD&R). https://www.huduser.gov/publications/pdf/FR-5173-P-01_AFFH_data_documentation.pdf
Benfer, Emily et. al. (2020). “The COVID-19 Eviction Crisis: an Estimated 30-40 Million People in America Are at Risk.” Aspen Institute. https://www.aspeninstitute.org/blog-posts/the-covid-19-eviction-crisis-an-estimated-30-40-million-people-in-america-are-at-risk/
“Data and Tools for Fair Housing Planning.” (2020). Urban Institute, https://www.datacatalog.urban.org/dataset/data-and-tools-fair-housing-planning
Desmond, Matthew, et al. (2018). “Eviction Lab National Database: Version 1.0.” Princeton: Princeton University, https://www.evictionlab.org.
Desmond, Matthew, et al. (2018). “Eviction Lab Methodology Report: Version 1.0.” Princeton: Princeton University, https://www.evictionlab.org/methods.
Gilson, Roger Hannigan. (2021). “New York Banned Evictions. The Housing Crisis Got Worse.” https://therivernewsroom.com/new-york-eviction-ban-housing-crisis/
Goodman, Laurie & Ganesh, Bhargavi. (2017). “Low-income homeowners are as burdened by housing costs as renters.” Urban Institute. https://www.urban.org/urban-wire/low-income-homeowners-are-burdened-housing-costs-renters
Greenberg, Gershenson, & Desmond. (2016). “Discrimination in Evictions: Empirical Evidence and Legal Challenges.” Harvard Civil Rights, Civil Liberties Law Review. https://scholar.harvard.edu/files/mdesmond/files/greenberg_et_al._.pdf
Hartman, Chester & Robinson, David. (2003). “Evictions: The Hidden Housing Problem.” Housing Policy Debate, 14(4). https://www.innovations.harvard.edu/sites/default/files/10950.pdf
Hepburn, Louis, & Desmond. (2020). “Racial and Gender Disparities among Evicted Americans.” The Eviction Lab at Princeton University. https://evictionlab.org/demographics-of-eviction/
Kuhn, Max. (2020). caret: Classification and Regression Training. R package version 6.0-86. https://CRAN.R-project.org/package=caret
Kuhn, Max and Hadley Wickham (2021). recipes: Preprocessing Tools to Create Design Matrices. R package version 0.1.16. https://CRAN.R-project.org/package=recipes