What exactly is an ETL process?

We see the acronym ETL thrown around a lot in the context of data science and analytics. This buzzword is generously sprayed across descriptions for analytics roles and also for online courses on platforms such as Coursera, Databricks, Datacamp etc.

And there is good reason for that. Almost everyone who has dealt with collecting, wrangling and storing data has been involved in the process at a small or large scale.

So what exactly is an ETL process?

ETL stands for extract, transform and load, which are crucial steps required in any data processing task. It is the process of extracting data from various types of sources, wrangling/cleaning it and then storing it in a formatted manner within a data warehouse or database or dataframe for easy querying and analysis. Furthermore, the process should ideally be automated.

There are a few subtle differences between a data warehouse and a database, the main one being that the data in a warehouse is ready-made for analytics while the data in a database primarily exists for read/write purposes. In that respect, a warehouse is more similar to a dataframe. Hence we will use the dataframe in this post as a way to describe the the ETL process with examples in Python.

At a small scale, the ETL process could resemble this:

  • Importing a CSV dataset using Python’s Pandas library
  • Cleaning/wrangling the data to handle missing values or formatting mislabeled instances
  • Loading the cleaned data back into a dataframe for efficient querying and analysis

On a large scale, the process could be part of an extensive data pipeline residing in a cloud infrastructure such as Amazon Web Services (AWS), Microsoft Azure or Google Cloud Platform (GCP). In such scenarios, the process could look something like this:

  • Extracting the data from various APIs or a software product (also called data ingestion) and into a data lake
  • Taking the raw data from the data lake and transforming it into a more usable format through various cleaning steps
  • Storing the cleanly formatted data into a data warehouse such as RedShift (AWS) or BigQuery (GCP)

Regardless of the scale of the process, both examples described above have the potential to be automated.

Storing the formatted data inside a data warehouse or dataframe is a critical step in a data pipeline as it makes the subsequent analytics or machine learning work significantly more convenient and efficient. Therefore, ETL is a process that is likely used, both consciously or sub-consciously, by most people dealing with data at any level.

Why is this concept important to understand?

To better understand the context in which ETL is used, it’s important to first understand where it lies within the overall data science workflow. It generally sits somewhere between the business question formulation and the modeling/analytics steps. Hence, it forms the crux of a data science problem and is personally my favorite aspect of a project since I really get to be hands-on with the data.

It’s important distinguish between a data pipeline and a data science workflow. The former represents the actual flow of data while the latter is a concept or methodology which is much larger in scope and encompasses the former which deals directly with the actual.

The overall data science methodology can be summarized by the following key steps:

  1. Formulating the business problem – the business question
  2. Converting the business question to a data problem – the data question
  3. Deciding what kind of data is required to answer the question
  4. Figuring our the various sources of data required and the required tools to extract the data
  5. Making sure that the collected data is representative of the problem being solved
  6. Cleaning/processing the data to transform it into a usable format
  7. Analyzing and visualizing the data
    • Exploratory data analysis (EDA)
    • Analytics or dashboards
  8. Feature engineering, pre-processing and modeling
  9. Automating and productionizing the model for usage in a practical sense
  10. Deriving meaningful insights from the model or analysis to answer the initial questions

The ETL portion of the above process is captured in points 4-6. So you can see that it forms a central part of the whole sequence and might need to be iterative to get the most meaning out of the data.

Three simple ETL examples using Python

Now that we have established at a high level what the process looks like, let’s delve into a few examples of what it actually involves. These examples are from projects that I have done and are all in Python.

Visualizing U.S. unemployment data using web-scraping

This first example is effective in portraying a quick and simple ETL process used to scrape, clean and present U.S. unemployment data. The overall project was to create a dashboard with several economic and financial indicators, with the Jupyter Notebook containing other similar ETL processes for the other indicators.

For now, we will focus on the unemployment data which was scraped from the Bureau of Labor Statistics website. A snippet of the code for the process is shown below.

The data was extracted in XML format using Python’s Beautiful Soup and Requests libraries, which was subsequently parsed and stored in a dataframe. A snapshot of the raw XML data is shown below.

Since the date was not explicitly returned as part of the raw data, some date manipulation needed to be done to map the unemployment rates to the year and month. The final cleaned data was stored in a dataframe as shown below.

Storing the formatted data in a dataframe (similar to a database) allowed for easy analysis and plotting to visualize the trend over time.

Storing neighborhood coordinate and venue information from a CSV file and APIs

Yes, that’s right. The data extraction, cleaning and storage process for this project was especially messy and even involved some manual steps. However, this is something one could expect in real-life scenarios where raw data from various sources need to be collected and combined in a manner where the two compliment each other. And frankly, I enjoy this process immensely as I get to be more hands-on with the data.

To provide a brief overview of the project, I was trying to determine the best neighborhood in the city of Atlanta to open a retail store. All the project information can be found on Github. The data collection for this project involved the following key steps:

  1. Collect all the neighborhood names and their population
  2. Extract the coordinates for each neighborhood
  3. Extract venue information for each neighborhood using the coordinates

The neighborhood names and population were first copied from Wikipedia and into a CSV file, which then allowed the data to be directly imported into a dataframe in Python. Once we had this information, steps 2 and 3 could be done exclusively in the Jupyter environment.

A snippet of the initial neighborhood information as part of a dataframe is shown below.

In order to extract the neighborhood coordinates, the Nominatim API of the geopy library was used which returned the latitude and longitude information when the neighborhood name was provided as an input. The example below shows the process for one neighborhood – Whittier Mill Village Atlanta.

This process was then scaled to extract the coordinates for all neighborhoods with the code shown below. Since the neighborhood names were so specific, there were instances where the appropriate coordinates were not returned based on the inputs provided in which case an error handling mechanism was included which assigned dummy values of 998 and 999 for missing latitudes and longitudes respectively. This allowed me to surgically add those values in later.

Unfortunately, in 27 cases the coordinates were not returned while 4 neighborhoods had the wrong coordinates which meant that we still had 130 neighborhoods successfully returned. To save time and effort, the remaining 31 coordinates were manually added from a quick online search and were used to replace the dummy values. Some more data manipulation had to be done to fit the missing/wrong coordinates into the main dataframe.

Our dataset was starting to take shape.

Step 3 involved extracting all venues from each neighborhood using the Foursquare API. Some sample code and returned raw JSON data is shown below. As you can see, you would first need to create the client ID and client secret credentials to get started with the API and then use a specifically formatted URL to fetch the venue information, followed by some JSON parsing.

In the example above, the venue returned was called ‘Jamrock Jerk Center’, classified as a Caribbean Restaurant (venue type). The most important information we needed from this data was the venue type to compare different neighborhoods and put them into different clusters based on the most common types.

The rest of the extraction process was automated using the code below.

At last, the dataset below is the final product.

Below are a few interesting insights and visualizations generated by analyzing the above dataset.

Analyzing world happiness data from a Google Sheet

Frankly, this analysis was done purely from a fun perspective as opposed to having a goal to solve a specific problem. It’s also arguably the simplest example in this post. I hope to make it useful in the future but it’s just for exploration and practice at the moment.

Originally, the data was available as a CSV file but I transferred it to a Google Sheet in order to prevent the need for uploading an external file. This was done to avoid mounting a virtual drive on Google Colab. The notebook is available here. In case you would like to use the CSV file, that section is still in there but has been commented out along with the virtual drive mounting.

Here is some documentation on working with the gspread Python library to interact with Google sheets on Colab as well as a Medium article with a step-by-step walkthrough.

This one’s going to be fairly quick with the key step being able to connect to the Google Sheet. The entire process is shown in the snippet below.

The resulting dataframe is shown below along with a visualized correlation matrix for the numerical variables.

The above examples show some of the different aspects and challenges involved in the ETL process. They also portray the variety in processes for different types of projects. Despite having been done outside a complex cloud data infrastructure, they are a close approximation to the larger process.

Hope these examples helped shed some light on the ETL process and described what it means at a conceptual level where it could be adapted to different scenarios.