Understanding and predicting compliance for property maintenance fines in Detroit

Back when I first tackled this project I didn’t really understand, rather pay attention to, the actual business problem being addressed. Like a lot of learners who are just getting into data science or machine learning, I dove straight into the modeling part.

Now, as a more mature (hopefully) applier of data for problem solving, I made an attempt to redo this project with:

A more holistic approach
A clearer understanding of the business problem at hand
More data exploration and analysis

The project was originally created as a real data science challenge where the Michigan Data Science Team^[1] (part of the University of Michigan) worked with the Student Symposium for Interdisciplinary Statistical Sciences and the City of Detroit to predict blight compliance and understand why the rate of compliance was so low. The dataset was extracted from the Detroit Open Data Portal and is publicly available.

This is a relatively older project of mine but one that I decided to revisit recently. It was part of the capstone project for the Machine Learning course of the Applied Data Science with Python Specialization by the University of Michigan on Coursera.

Let’s dig into some of the background and motivation for this project.

Detroit’s blight problem

What exactly is blight?

Blight is a term used to signify property or land that is in poor condition due to a lack of upkeep and maintenance. As a result, the property eventually becomes unsafe or uninhabitable.

Property blight affects more than 20% of properties in Detroit, which is an astonishing statistic. To tackle this problem, both the state and federal government started an initiative back in 2013 to remedy the situation by issuing maintenance fines to residents and owners of the respective properties. Owners would be served a fine or blight ticket if their property was deemed as blight, as a measure to promote upkeep, cleanliness and safety.

Sounds like an easy solution, right? Not quite.

It turned out that less than 10% of blight tickets were paid leaving more than $70,000,000 in unpaid fines. To make things worse, the cost of professionally removing the blight was estimated to be around $2,000,000,000 over a period of five years, making the prevention of blight top priority as opposed to fixing it later.

How did the situation get so bad?

As we all may know, Detroit was the hotbed for the automotive industry in the U.S. in the 1900s with large manufacturers such as General Motors (GM), Ford and Chrysler thriving during the period. With the decline of these automakers, there was a subsequent loss of jobs and suburbanization of the city’s metro area.

This led to a mass evacuation of properties and lots in the more urban areas, which persisted for decades. A decline in population from 1,800,000 in 1950 to 700,000 in 2017 was reported which left a large number of properties and lots abandoned.

There has been a reported 65,000 mortgage foreclosures caused by high-interest rate sub-prime mortgages since 2005, out of which 56% of the properties resulted in blight. A survey conducted during the 2013-2014 period revealed that around 84,461 properties had been deemed to be blighted. In 2013, the Obama Administration established the Detroit Blight Task Force to battle the blight issue but it has continued to be a challenge.

This is where the application of data comes in.

Solving the problem with data

For this project, we had the following three goals in mind:

Predict if a person is going to comply with their property maintenance fine or not
Understand why people failed to comply
Make sure that fines are paid on time

By identifying the reason for blight non-compliance, the Detroit government could understand better how to approach the people responsible for the maintenance of their properties. If this can be done, it would help in solving the blight issue in two ways:

Enforce fine payment which would add to funds required to remove blight
Help property owners with their maintenance issues and improve upkeep, which as discussed earlier is a significantly cheaper alternative

Data description

The data used for this analysis was provided by the course and can be found on my Github profile:

Training set:
- Time span: 2004-2011
- 250,306 rows
- 34 variables
Test set:
- Time span: 2012-2016
- 61,001 rows
Addresses:
- Mapping from ticket ID to addresses AND
- Mapping from addresses to latitude and longitude coordinates

Each row in the train and test data corresponded to a single blight ticket, and included information about when, why, and to whom each ticket was issued. The target variable was compliance, which was True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible.

Since we were trying to predict whether a blight compliance fine would be paid or not, this was a binary classification problem with the following target variables:

Class 0: Non-compliance
Class 1: Comp;iance

Note that certain variables such as payment amount, balance due etc. were only included in the training data for informational purposes. Since this information was not available in the test set, we could not use them in the modeling process.

The description for each variable in the train and test sets can be be found in here.

Data cleaning and wrangling

Criteria for initial feature selection

This part was fairly simple after we took a cursory look at the variables and as well as any missing values. The following criteria was used for removing initial variables.

Variables that provided payment information, as that could lead to data leakage – Ex. Late fee
Variables with too many missing/null values – Ex. Violation zip code
Variables that contained too many categories
Variables that obviously did not seem to be good predictors of the target variable. Ex. State (since there was only one state, Michigan)

Dropping categorical variables with too many unique values OR categories

The number of unique categories for each remaining categorical variable is shown in the bar chart below, which was helpful in further feature selection. In some cases, a lot of unique categories could have been grouped into just a few categories for easier analysis. However, variables that provided information such as names or IDs were too unique to extract any useful information out of them.

Variables that containd too many unique values to be useful for prediction:

violator_name: ~120,000
violation_street_number: ~20,000
mailing_address_str_number: ~15,000
mailing_address_str_name: ~40,000
violation_street_name: 1,791
violation_code: 235
violation_description: 258
zip_code: 5,643

We kept ticket_issued_date since it was used later for feature engineering. The rest of the aforementioned variables were dropped.

Filtering the dataset to only include violations in Detroit

For this step, all values for the ‘city’ variable were first changed to upper case before filtering by ‘DETROIT’. Following this step, the city variable was dropped since all the values were now the same.

This left us with 96,001 non-null instances for the compliance variable, which indicated that the other locations contained a good amount of compliance values. The observations for other variables were also reduced to the same number of observations as the compliance variable to form cleaned dataset.

Exploratory data analysis

From the visualization of the agency name variable with compliance, it was tricky to determine clearly if a particular agency or agencies tended to favor a particular compliance value. This also had to do with the number of non-compliances (class 0) being far greater than the number of compliances (class 1). Hence, we had a class imbalance issue.

However, two categories that stood out were the Health Department and Buildings, Safety Engineering & Env Department which had slightly higher proportions of non-compliance instances than the others.

For the disposition variable, the Responsible by Default category had a significantly greater proportion of non-compliance values than the other categories which could be exploited to to distinguish between the classes of the target variable.

Plotting compliance across the fine_amount variable showed somewhat of a distinction where the median amount for compliance was lower than that of non-compliance.

Barely any distinction could be made across compliance for the discount_amount variable.

The judgement_amount variable showed significantly higher mean values for the non-compliance class which could be a distinguishing factor.

Data preparation

Feature selection

Based on the EDA, variables such as admin_fee, state_fee and clean_up_cost were removed since they had the same values for all observations and hence were not useful for distinguishing between the target variable categories. A snippet of the resulting dataframe is shown below.

Feature engineering

A new variable called time_to_hearing was created which captured the time difference in days between the hearing date and the date on which the ticket was issued. This variable turned out to be an important one for the model.

Below, we can see that the mean time to hearing for the compliance category is slightly lower than the non-compliance category which creates a slight distinction.

Handling missing values

Next, we dropped any observations with missing values from the dataset. Other ways to handle missing values could be imputation or rule-based prediction. However since the number of missing values in this case was low, it was safe to drop those rows. We also dropped the ticket_issued_date and hearing_date variables as we no longer needed them.

Pre-processing

We used label encoding to convert the categorical variables into discrete numerical values to be used in the baseline model. Another set of data was created using one hot encoding (OHE) to be used in an alternate model. For linear models, the data was further scaled using the min/max scaler so that all variables are weighted equally during the modeling process.

The training data was split into the explanatory and target variables, and then further split into train and test sets using the train_test_split() function of the scikit-learn library.

Modeling

To come up with the best model for our prediction, a few different algorithms as well as variations of those algorithms were explored. A model creation and diagnostic pipeline was created, both for tree-based and non-tree-based models, consisting of the following key steps:

Modeling
Prediction
Calculating prediction probabilities
Determining scoring metrics such as accuracy and precision
Visualizing ROC AUC
Visualizing precision/recall curve
Visualizing confusion matrix
Displaying most important predictor variables (only for tree-based models)

Baseline model

A simple logistic regression model was created as a baseline with the following variations:

With scaling and preprocessing (one hot encoding)
With no scaling or preprocessing
With one hot encoding but no scaling

We discovered that the model with OHE but no scaling yielded better scores. It yielded better accuracy, ROC AUC and recall scores than the model with scaling. Below is a table showing the comparisons. This baseline model was optimized using the grid search algorithm but no significant improvements were noticed.

Scoring metric	Model with OHE but no scaling	Model with OHE and scaling
Accuracy	0.939	0.931
ROC AUC	0.796	0.774
Precision	0.891	0.977
Recall	0.147	0.026
F1 Score	0.252	0.051

Tree-based models

The following tree-based models, along with their optimized versions, were tested:

Decision trees
Random forest
Gradient boosting trees
LightGBM (LGBM)

Most of the tree-based models performed slightly better than the baseline model in terms of the ROC AUC score which was the most important metric we took into consideration for the model selection process. Once the model was selected, we placed more emphasis on the specificity metric as described later.

Oversampling with minority class

Out of the aforementioned models, an optimized LGBM model seemed to perform the best. However, since we were dealing with class imbalance, it was crucial to oversample the minority class so that the number of observations it contained would be equally weighed across the two classes.

Why is handling class imbalance important? This is because a model created on an imbalanced sample tends to provide an inflated accuracy score due to most instances being predicted as the majority class. F or example, if there was a training set consisting of 100 instances with 95 negative classes and 5 positive classes, even if the model predicted everything as the positive class, it would be 95% accurate without even predicting a single positive class.

We used the Synthetic Minority Oversampling Technique (SMOTE) to create new class 1 samples to match the number of class 0 observations.

Picking the right metric

For our application of predicting which individuals were likely to be compliant and which were not, we needed to determine what the cost of a wrong prediction was. In other words, would false negatives be more costly or false positives?

Since it’s a tedious and expensive process for the authorities to rectify unpaid blight fines, the idea is to identify individuals who may not pay the fines on time and proactively take early steps to ensure that fines are paid or engage with them to understand why they are not able to maintain their properties.

As the costs to enforce the fines early are far less than doing so after an upaid fine, it would be a better solution to identify as many people who may be non-compliant which would be a true negative in this case. Hence, we wanted to reduce the number of false positives as much as possible which in turn would mean a greater specificity score.

Specificity = TN/(TN+FP)

Hence recall was not that important whereas specificity is the metric that needs to be optimized since it focused on reducing false positives and increasing true negatives. These considerations were the most critical for our application of identifying individuals who may fail to pay fines on time.

Final model

Next we aggregated all the models that we tested to compare scores for each metric. The results are shown in the table below.

The plot below provides a visual representation of the above comparison where we can clearly see that the biggest fluctuations are observed for precision and recall whereas metrics such as accuracy, ROC AUC and specificity are fairly consistent across each model with slight shifts.

Based on the reasoning above, there were several models that could work for our application. However, we picked Model 19 which was the oversampled and optimized LGBM model. A few reasons for picking this model were:

It provided a relatively high ROC AUC scores of 0.788
It provided high accuracy and specificity scores of 0.892 and 0.918
It would generalize to unseen data better since it was oversampled. Some of the other models where the minority class was not oversampled tended to have an unusually high specificity score greater than 0.99 which indicated that they possibly maybe predicting a lot of instances as non-compliant. Since that was the majority class, the bias was not completely apparent. Therefore, it was best to go with an oversampled model even though the resulting specificity was not the highest.
Since it was tree-based model, we could see the best predictors
LGBM is a faster version of Gradient Boosting and took less memory to operate

Model 19 diagnosis

The ROC AUC and precision-recall curves are visualized below. In case anyone needed a refresher, the points on the curve represent scores at different decision thresholds. For example, if we move down the precision-recall curve, the recall would increase but the precision would decrease. Ideally, we would like both of these metrics to be closer to 1 but that’s not a realistic case.

Below, we see the confusion matrix for model 19 where we can notice that a large number of non-compliance labels were accurately predicted (bottom-right quadrant). This is due to non-compliance being by far the majority label, as discussed earlier. We could possibly optimize these results further

In the bar chart below, we can see that time_to_hearing is and disposition are the most important predictors, followed by agency_name, judgement_amount, fine_amount and discount_amount. Recall that we needed to perform some feature engineering to create the time_to_hearing variable which turned out to be crucial for our analysis.

Results and further action

Predicting unseen data

We tested the final model on a “hold-out” test set containing 61,001 observations. This dataset also went through the same cleaning and pre-processing steps as the training set so that the predictions can be consistent with the what the model was trained on.

Based on the predictions, 11.48% of the instances were identified as compliant whereas 88.52% of the instances were identified as non-compliant. These statistics are similar to what we observed while diagnosing the final LGBM model as well. During the diagnosis, we identified 11.62% of the population as compliant so the numbers are fairly close. The training set, on the other hand, differed slightly where the percentage of compliant instances was 7.06%.

Below the scatter plots for all the predicted probabilities are shown for both the positive (compliant) and negative (non-compliant) cases where we can see that the bulk of the population for the negative class is above the 0.5 decision threshold. As discussed above, if we wanted to improve the specificity further, the threshold could be adjusted to a higher value in order to reduce the false positives.

The histograms below provide an alternate view of the probability distributions.

Actionable items

So we have determined that a massive 88.52% people in Detroit are non-compliant. What actions can we take?

We can work with the the City of Detroit to reach out to these 53,998 people to understand why they have not been able to pay their fines. There are a few options from there:
1. Make them pay the fines, which would then help in restoring the blighted properties
2. If they are in a difficult financial situation, provide them with the assistance needed to improve their situation.
3. Have them engage in community service or help the government restore the blighted properties in exchange for pardoning the fines
4. Help or train them in good property maintenance practices to prevent such occurrences in the future

Try to understand what specific factors cause these people to be non-compliant. From our analysis, the following crucial variables could be used to take action:
1. Time to hearing: In general it seems that people whose time to hearing is greater, tend to be non-compliant but only by a small amount.
2. Disposition: People who are responsible by default seem to have a distinctly higher percentage of non-compliance (~96%). Therefore, it may be good to target such individuals early to make sure they are aware of the fine and that they have access to the correct resources to be able to pay the fines.
3. Agency name: Two agencies that stood out as having slightly higher rates of non-compliant individuals were Building, Safety, Engineering and Environmental Department and Health Department. The city could work closely with these agencies to enforce the fines or engage with the appropriate individuals to help them.
4. Judgement amount: It turned out that individuals who are non-compliant tend to have a significantly higher judgement amount with a mean of ~$394 to as compared to ~$256 for compliant individuals. This goes back to the financial hardships that these people may be facing. It may also be the reason as to why they are not able to keep up with the maintenance of their properties. One way to tackle the situation early would be to identify lower income individuals or families who are fined in order to help them with their financial situation.

References

[1] Michigan Data Science Team – Detroit Blight Analysis (mdst.club)