Analyzing collision events in Seattle and predicting severe occurrences

Why do we care about severe collisions?

The safety of drivers, passengers, pedestrians and property is an ever-growing concern when it comes to the operation of vehicles on the road. With 6,452,000 motor vehicle crashes and 37,133 crash-related deaths in 2017 alone, significant costs in terms of life, money, and property are incurred as a result of collisions each year.

a


To make things more complicated, the number of vehicles in the US keeps growing every year, creating a need to adopt greater safety measures on the road. Insurance companies play a significant role in minimizing and handling automobile collisions, and they do so through services provided by charging insurance premiums. These insurance rates could be more accurately predicted based on factors such as location, weather, road dimensions, speed of traffic, date/time, whether the driver was DUI etc., most of which are dependent on the location.

In this report, we use collision data from the city of Seattle to implement a machine learning model to predict the severity of collisions using the above variables and others. Furthermore, we explore the variables in detail to find patterns and insights related to the geography of the area, trends across time and traffic variables that affect collisions.

aa

Audience

Some of the biggest insurers in the US such as Statefarm, Esurance and Allstate have stated location as one of their top criteria for determining rates. These and other insurance companies could use this model to determine more accurate rates for their customers. Moreover, they could also provide useful information to their clients such as looking out for red flags when buying expensive cars in more accident prone locations which would likely increase their insurance premiums.

Another audience for this model could be rideshare companies as well as companies such as Google and Apple who provide mapping apps. Furthermore, traffic control departments could collect this data firsthand and make it available to the aforementioned companies. 

a

Data description

The data was acquired from the City of Seattle Open Data Portal and consists of 212,760 instances of vehicle collisions in Seattle while the timeframe of the data ranges from 2004-2018. Spanning mostly the Seattle area, the data for each variable was extracted using the ArcGIS REST API in .csv format and made available on the Seattle GeoData page.

Collision event features such as date of collision, location of collision, type of collision (sideswipe, rear-end, parked car etc.), number of people involved, number of fatalities, junction type (intersection, driveway junction, mid-block etc.), weather conditions, etc. are described and analyzed in this post.

In total, 40 variables are included in the dataset but this number is later reduced to include only the ones crucial for building the model. As location played an important role in the analysis, addresses for each instance were extracted using Python’s reverse_geocoder library and added to the dataset. This library was used with latitudes and longitudes as inputs to retrieve the neighborhood for each instance, stored in a variable called Neighborhood in the dataframe.

The Tomtom API was used to retrieve the free-flow traffic speed at the closest road to each coordinate. This data was retrieved as a JSON file which was then parsed using the requests and json libraries in Python and finally stored as Speed in the dataframe. Additionally, the HERE API was used to extract the road length and road congestion data. Both these sets of information were estimated using the closest road to the provided coordinate and stored as Road Length and Road Congestion respectively in the main dataset.

a

Data wrangling and cleaning

The original dataset called Collisions.csv was loaded as a dataframe into a Jupyter Notebook, hosted by the SageMaker machine learning (ML) platform on Amazon Web Services (AWS). The shape of the dataset was confirmed to be 212,760 rows by 40 columns.

a


Shown below is a flowchart that highlights the data cleaning process. For details regarding the initial variable selection analysis, handling of missing values etc., take a look at the following notebooks (links added):

aasd


Exploratory data analysis

a.     Geographical analysis

In this section, the coordinates are used to visualize the crash locations across Seattle in order to get an idea of the areas of high collision density as well as areas of greater severity.

a


The figure above shows all neighborhoods in Seattle as blue markers, created using Python’s Folium library. It is instantly noticeable that the neighborhood right at the center of the map (Seattle) contains majority of the collision instances with the other neighborhoods lying at the fringes of the city.

a



The visualization above provides a clearer idea of all the collisions as a scatter plot of the coordinates. Due to the fairly large number of instances, the plot essentially maps out the entire Seattle area. On the map, the green points denote non-severe collisions while the red points denote severe cases. As expected, the green instances are much more prevalent. Despite the sparse nature of the severe points, they can still be seen across the map.

The severe and non-severe incident locations are plotted separately below which shows the first plot, in green, displaying the non-severe cases. The second plot shows the severe cases but they appear sparse due to the low frequency of the severe class. The third plot is the same as the second one but with the transparency increased so that the points can be visualized better.

a



In the first plot, the center of Seattle sees the highest density of collisions with several streets towards the north also depicting dense regions. The density is relatively lower in the south.

Here are some key areas where the density of severe collisions are relatively higher:

  • Center of the city: highest density of crashes
  • Aurora Ave North: road going north
  • Rainier Ave South: road going south-east
  • 15th Ave Northwest: road going north-west
  • Lake City Way Northeast: road going north-east
  • 24th Ave East: towards the east

Most of these areas are also dense for the non-severe cases but the non-severe plot also has other regions of higher density as opposed to the severe plot which mostly emphasizes just a few areas.

Distributions of severe and non-severe cases across all the neighborhoods are shown in the figure below. For both severity cases, Seattle is by far the most prevalent neighborhood. This makes sense as most of the other neighborhoods are at the edges of the city as we observed from the folium map earlier.

a



FThe next two most frequent neighborhoods are Shoreline and White Center for both severity categories. After that, neighborhoods such as Lake Forest Park, Bryn Mawr-SkywayandRiverton follow. An important point to note here is that the distribution of neighborhoods for both severity cases are almost identical. Hence, there is not much information to differentiate non-severe and severe instances.

a

b.    Analyzing weather related variables

For both severity cases, ‘Clear or Partly Cloudy’ conditions are the most prevalent possibly because there is more traffic on the roads under those conditions which in turn causes more collisions.

In order to normalize this issue, we would need to determine the rate of traffic flow under each condition and then divide the amount for each category by the corresponding traffic rate. Unfortunately, we don’t have access to such data at the moment. When it comes to the other weather categories, the distribution is once again similar for both severity cases.

a




The case is the same for Light Condition as most of the collisions are during the daytime and there is nothing to indicate an obvious difference between severe and non-severe cases.

a




Road Condition follows the same trend as seen below. The ‘Dry’ category is the most predominant possibly because most cars operate during these conditions whereby causing more accidents. For both severity classes, wet conditions are less frequent while icy conditions are rare.

a




c.     Analyzing address related variables

For the Address Type variable, the ‘Intersection’ category contains significantly lesser number of instances as that for the ‘Block’ category as shown below.

a


The plot below shows that ‘Intersection’ has a slightly higher frequency than ‘Block’ for the severe class whereas the non-severe class shows double the number of instances for ‘Block’ as that for ‘Intersection’ which also matches the overall trend for Address Type. Hence, this could be a telling distinction for our model.

a


For Junction Type, once again ‘Intersection’ collisions are more frequent for the severe class whereas ‘Mid-Block’ is more prevalent for non-severe cases. We are mainly considering the two most frequent classes for each severity as they cover most of the distribution.

a



d.    Analyzing variables related to the road

For the speed variable, the histograms for the two severity cases as well as the overall speed distribution are almost identical, as shown below. Hence, not much information can be gathered from this data. However, the boxplot reveals that the severe cases have a slightly higher median. The severe class has a median of ~41mph while non-severe has a median of ~38mph.

a



From the plots below, we can see that the distributions of road congestion across severity are almost identical. The median for non-severe instances is slightly lower (0.50823) than the severe (0.63665), as shown in the boxplots below. Note that for this metric, values closer to 1 have higher congestion rates while those closer to 0 have lesser congestion.

a



Similar to road congestion, the distributions for road length is also tight across the severity classes where the median only differs by ~0.04. Overall, despite the fact that speed, road congestion and road length don’t provide a clear distinction between the severity levels, they could still prove to be useful for our model.

a




e.     Timeseries analysis

The plot below displays the frequency of collisions across years showing an overall downward pattern. A closer look reveals a downward trend from 2006 to 2011 followed by an upward pattern till 2016. From 2017 to 2019, there is again a downward movement.

a


Looking at the average severe collisions rate over the years, there is a sharp increase from 2005-2006 followed by a slump till 2008. After another increase till 2009, there was a downward trend till 2011. There was another one year increase till 2012 followed by yet another downward pattern till 2015 followed by a continuous increase till 2019. Overall, there is a downward trend from 2006 to 2015 followed by an upward trend since.

a


The plot below shows the average severity rate across months where July and August clearly have a higher rate than the other months. April – June as well as the winter months of Feb and Dec see the lowest rates. The monthly and weekly collision frequencies are plotted below which shows the respective trends across time from 2004-2019.

a




Shown above is the average yearly, monthly and weekly collision rates across time. Note that the severe and non-severe class labels had to be converted to integers (1 for severe and 0 for non-severe). Hence, points that are higher in value represent more severe collision instances. All points have values much closer to 0 due to class imbalance.

Due to the much larger proportion of non-severe instances, an average of points at a certain point in time is likely to be skewed towards 0.

a


The hourly average severity collision rates are shown above with ‘0’ indicating 12am and ‘23’ indicating 11pm. Some trends clearly stand out. The hours between 1am and 6am see a higher rate of severe collisions. The case is similar for 4pm-6pm, 9pm and 11pm. Hence, between 4pm and 11pm, there is certainly an increasing trend in severe collision rate.

Surprisingly the rate is the lowest at 12am. The hours between 8am and 3pm see the lowest severe collision rate which is somewhat expected since the light conditions tend to be the most ideal during this period.

a


From the plots below, the downward trend for the non-severe case is much more pronounced with a higher moving average. The curve is smoother and has a distinct downward trend. For severe collisions, we can observe a general downward trend but a dip and then a rise between day 5 and 10 in the month. One way in which we can differentiate between the severity levels is by emphasizing the fact that the early days of the month see lesser severe collisions.

a



The trends for both severe and non-severe cases for day of the week are very similar as shown above. In both cases, Friday (4) sees the highest amount of collisions while Sunday (6) sees the lowest. These numbers make sense since as we progress through the week, the number of collisions increase gradually hitting the peak on Friday.

a

Modeling

The overall process for modeling, optimization and evaluation is shown in the flowchart below.

a



a. Feature Selection

Variables that only provide information about the collision incident once it has occurred were dropped from the dataset as they could not be used for predicting the severity of future collisions. These are variables such as Number of People Involved and SDOT Collision Description.

Weather, Light Condition, Road Condition and Neighborhood were also dropped following initial modeling as they did not necessarily improve the model. In fact, removing the Neighborhood feature improved the model performance. The following features were retained:

  • Address Type
  • Junction Type
  • Month
  • Hour
  • Day of the Week
  • Speed
  • Road Congestion
  • Road Length
  • Severity Description

a

b. Pre-processing

The categorical variables were converted to discrete integers using the LabelEncoder function in Python. Features that were encoded are Address Type, Junction Type, Month, Hour, Day of the Week and Severity Description. The encoded dataframe is shown below.

Note that feature scaling was not implemented initially as our selected model was LightGBM which did not require the features to be scaled as opposed to an algorithm such as Logistic Regression which is a linear model and hence required scaling for optimum performance.

a

c. Train test split

The datasets are further split into train and test samples. The train sample size was chosen as 70% of the original size. During the optimization phase, the test sample will be considered a validation sample. What we mean by this is that the train sample would be used for parameter tuning and cross-validation where it will be split into several random train and test samples in order to determine the best average score.

a

d. Oversampling of minority class

In the case where weather and neighborhood variables were included, Synthetic Minority Oversampling Technique (SMOTE) improved the model performance (ROC AUC score). SMOTE creates synthetic data points like the minority instances, making the representation of the two classes nearly the same whereby fixing the imbalance issue to an extent.

a


However, oversampling was not used for the final baseline model as it decreased the scoring metric. This is not surprising as oversampling only works well when the initial model is robust to begin with. In our case, the class imbalance was severe making oversampling ineffective.

a

e. Modeling, prediction and evaluation

We fit the training data with the LightGBM classifier and tested the model by making predictions on the x_test sample. For evaluating the model, accuracy, precision, recall, and ROC AUC were used as the main metrics, all of which were imported from the sklearn.metrics library. A first look at the results reveals the accuracy score to be 0.9842. The precision and recall scores values are same.

a


Another way of quickly visualizing these values is by using the classification_report function. Although, it does not show the accuracy and roc_auc scores. Looking at the classification report reveals the bigger picture.

a


The table above shows that the model is only predicting the majority class. This can be observed from the individual precision or recall value for the ‘1’ (severe) class which is a 0. In fact, if the predicted values are categorized into severe and non-severe values, we notice that only one case was predicted as ‘severe’ or ‘1’ which is actually a wrong prediction. This can be further analyzed using a confusion matrix as shown below.

a


The confusion matrix shows that 60,502 instances were correctly predicted as being non-severe (true negative), 970 were incorrectly predicted as non-severe (false negative), 1 was incorrectly predicted as severe (false positive) and none were predicted correctly as severe (true positive).

In order to explore this behavior further, we looked at the ROC AUC curve, shown below. The calculated ROC AUC score for this model was 0.654. This value is low mostly due to the large class imbalance in the data.

a


The ROC AUC curve displays the ROC AUC curves for both the severe and non-severe classes as well as the combined ROC AUC. Moving along the curve corresponds to changing our threshold value for the predictions. Therefore, depending on the application, different thresholds can be set in order to achieve a certain specification.

a

f. Model selection and optimization

Grid search was performed in order to tune the hyperparameters of the model. Cross-validation was used within the grid search algorithm with 5 folds in order to determine the best average ROC AUC score. This ensured that the optimization would not cause the obtained parameters to overfit the model.

Parameters maximum depth, number of leaves, learning rate, and minimum data in each leaf were tuned to values of 5, 10, 0.1 and 20 respectively while the best score was 0.6397. Applying the above parameters to the LightGBM model improved the baseline score from 0.654 to 0.659, which was further improved to 0.66 after applying scaling for the features using the MinMaxScaler function from sklearn. The resulting ROC AUC curve is shown below.

a


The confusion matrix below shows that even now, none of the severe cases are being predicted due to class imbalance. Hence, the thresholds were tuned to increase the number of TP results and decrease the number of FN results.

a


The plots below show a sweep of threshold values from 0 to 1 with a step size of 0.01 to visualize the change in the above metrics.

a


Looking at the plots above, the number of TP increases and FN decreases with decreasing threshold. We start to see significant increases in TP values only at a threshold of less than 0.05. At about 0.025, the TP and FN are the same.

The trend is same for TN and FP. The interesting thing to note here is that as we decrease the threshold, our accuracy decreases while our AUC remains the same as we are just moving along the AUC curve.

As threshold decreases, events go from TN to FP whereby decreasing the accuracy and increasing the misclassification error which as observed below. Similarly, values go from FN to TP for decreasing threshold. In our application where detecting severe collisions is the most important criteria, this is a fair tradeoff to consider.

a


The most important features for the model based on the training data are shown below. Road Length and Speed are the best estimators followed closely by Hour. This makes sense as we had observed some association between these variables and severity during EDA. The least important features are Day of the Week and Address Type.

a


For practical purposes, the importance numbers are more relative than absolute. For example, Address Type is not necessarily a bad feature, it just has less predictive power than a variable like Road Length.

The plot below shows ROC AUC scores for train and test samples across the number of leaves and max depth hyperparameters. Since the number of leaves is the outer loop, the points at 0, 10, 20 and so on represent different number of leaves while the points between 0-10, 10-20 and so on represent different max depth values.

The test sample performance remains fairly flat as compared to the train sample. Hence, higher train sample performance means greater overfitting. Based on this, points between 10-20 offer the best model in terms of overfitting. From point 20, each increase in test performance increases train performance by a higher rate, the ideal points being between 10-40. Anything above that would cause more and more overfitting.

a



g. Other models

Other baseline models that were considered were Logistic Regression and Random Forest which yielded ROC AUC scores of 0.629 and 0.545 respectively. For the logistic regression model, the meeting point between the TP and FN values during the threshold sweep was much earlier. Therefore, it would also be a reasonable model for the application of collision severity prediction as the threshold would only need to be adjusted by a small value.

a

Conclusion

We learned that certain areas of the city are more prone to severe collisions than others with the center of the city being the most at risk. Additionally, routes along major roads going north, south and east also appeared to have a relatively higher density of severe incidents.

Class imbalance made prediction of the severe class difficult. Since not many of the features explained the variance in severity classes well, the final model was not robust enough in terms of accuracy, ROC AUC, precision and recall scores. Hence, oversampling methods such as SMOTE were not effective in dealing with the imbalance problem.

Surprisingly, location-based features such as Weather, Road Condition, Light Condition and Neighborhood did not provide much useful information to differentiate between the two severity categories. The more important variables turned out to be the ones related to traffic flow, road dimensions, date and time.

An optimized LightGBM model provided a ROC AUC score of 0.66 for the severe class. Although, the default threshold of 0.5 prevented the model from predicting any severe instances. Therefore, the threshold had to be lowered to less than 0.1 to see some correctly predicted severe cases whereby decreasing false negatives and increasing false positives. However, this was achieved at the expense of greater false positives and lesser true negatives as a result.

A good threshold to use is 0.02 which yields 493 true positives, 474 false negatives, 17697 false positives and 42771 true negatives. At this threshold, the accuracy is reduced to 0.7042 with a corresponding misclassification error of 0.2957. Since we are more interested in predicting severe collisions at any cost, even a lower threshold would work which would generate more true positives. Despite a higher false positive rate, it would be better to have many false alarms if it meant that we could improve the true positive rate as well.

The optimized parameters provided the least overfitting model. This was observed when the performance using both train and test scores were plotted for different values of parameters maximum depth and number of leaves which are often used for regularizing the model.

Since the data covered collisions only in the Seattle area and majority of the features were location-based, the results of the model for predicting severe collisions cannot be generalized to the US. However, similar analytical techniques can be adopted for different cities and states across the country as well as other countries to create useful models and gain insights through exploratory data analysis (EDA). It can be assumed that random sampling was adopted during the retrieval of this data for purposes of any statistical analysis.

Resources