My journey to Google Colab through various cloud platforms

Google Colab is one of the best and most convenient ways to run Jupyter Notebooks. However, it took me a while to stumble upon this platform and slightly longer to truly appreciate what it had to offer.

A lot of my early work with Python was done on a local computer using IronPython which is an interpreter that comes installed with the widely popular data science platform Anaconda.

At that point, I hadn’t explored Jupyter Notebooks and was wildly under-utilizing the Anaconda suite of tools and libraries. I was first exposed to IPython/Jupyter Notebooks while doing an online data science specialization on Coursera where they provided their own interactive notebooks.

This paved the way for me to work on Kaggle competitions. Kaggle also provides a markdown file option (which I had previously used in RStudio) but I was more comfortable with notebooks and adopted it to a larger degree after completing a couple of competitions on Kaggle.

IBM Cloud – Watson Studio

My first experience with a cloud based platform was during yet another online certificate, this time the IBM Data Science Professional Certificate. As the name would suggest, the course promoted the use of IBM’s fairly capable cloud platform and it’s machine learning branch – IBM Watson Studio.

Having done a couple of projects using Watson’s Jupyter Notebook service, one involving clustering and the other using statistical inference, the speed constraint was not apparent right away. The greatest limitation that I encountered was the lack of a free service without hitting the memory capacity.

The notebook option for Watson measured usage in terms of “capacity units per hour” with higher RAM options costing more. The image below shows all the available kernels on this platform. I would generally select one of the regular Python 3.6 options but as you can see, options for R, Spark and Scala also exist. In some cases, a combination of languages could be used interchangeably such as Spark/Python or Spark/R but that’s a topic for a different time.

Notebooks in the Watson environment can be created and handled without much fuss except you may need to ‘unlock’ the notebook from time to time before being able to execute it, which I thought was odd. Other than that, adding datasets and using them within the notebooks was pretty straightforward.

The actual notebook interface is fairly user-friendly and sophisticated in terms of looks and functionality. However, I have to say that I found the functionality aspect as well as the ease of use to be slightly better on the next few cloud platforms that I came across.

Having said that, if you are only working with IPython Notebooks for data analysis and minimal machine learning tasks, Watson Studio is more than capable.

Microsoft – Azure Notebooks

For my project dealing with the prediction of severe collisions in Seattle, I picked Azure for three main reasons.

  1. I wanted to try a new cloud platform.
  2. Azure was and continues to be known as one of the best providers of cloud services along with Amazon Web Services.
  3. I wanted to use a service where dealing with datasets and notebooks was relatively effortless.

As a result of the above criteria, I decided to explore Azure as it was a well known player in the battle for cloud services and one that would keep adding useful features and support for years to come. The available kernels are shown below and as we can see, there are far less options than it’s IBM counterpart.

However, keep in mind that this is just the Microsoft Azure Notebooks service which is not the same as Microsoft’s Azure ML product, which I believe provides the capability to run fully automated machine learning pipelines (including Jupyter Notebooks) and handle data at a larger scale.

Getting to the page where you can create new notebook instances or access existing ones is really simple on Azure which probably has to do with the Azure Notebooks service specifically being built to work with notebooks. However, I bring this up as that was one of my main concerns while choosing a cloud platform, see criteria #3 above.

Once inside the ‘My Projects’ section, you can run your notebooks and also terminate the execution. This area has a Github-esque feel where you can clone your directory and also create a ‘readme’ markdown file for documentation similar to Github.

When a notebook instance is opened, it takes you to the familiar Jupyter interface where you can easily handle notebooks, datasets and other files. It also provides the option of choosing the type of kernel that you want to run.

The actual notebook interface is also very similar to the one we are used to while using Anaconda on a local computer.

At this point, I preferred using Azure due to the simplicity and ease with which you could just login to your account and start editing your notebook within a few clicks. That’s where the IBM platform fell short in my opinion.

And, you can run Azure Notebooks for FREE albeit possibly at a lower speed. But again, if you are looking to do anything up to classical machine learning with reasonably sized datasets, it’s great. It’s also faster than your local computer which was my primary reason to switch to a cloud service.

However, we have two more platforms to talk about which means that I did in fact run into one of the limitations described above – speed. The task that I had at hand was to fetch traffic speed and road class data based on location coordinates and I had to do it for ~200,000 points. I used the TomTom API for this task.

So, I started off by handling some erroneous returns from the API and monitored the amount of time it would take for 1000 transactions before I kicked off a large process. I then extrapolated that value for 200,000 points and estimated the amount of time it would take. To my shock at the time, the process would have taken days! And at that point I knew that I would need to explore more cloud services.

Amazon Web Services (AWS) – SageMaker

One thing I have to mention here is that I actually did try AWS before Azure but what stopped me from proceeding further was the specific AWS service that I had initially selected.

A common service used for running notebooks or other processes within the AWS platform is EC2. One can create EC2 instances where processes including notebooks can be run on virtual machines. The advantage of using virtual machines is the incredible processing speed that you get.

However, I found the setup process challenging and was unable to successfully get anything going. At the time, my computational requirements were not considerable, allowing me to explore other options.

Fast forward a few weeks and I found myself looking back through the massive AWS catalog which is when I came across their machine learning environment called SageMaker. This is the equivalent of the IBM Watson Studio and the Azure ML platforms, providing a host of analytical tools and services.

Accessing notebooks through SageMaker is relatively simple but there is a cost if a certain RAM threshold is exceeded. However, the purpose for which I moved to this new service was served as the estimated time for the API transactions that I was trying to execute was almost halved! There was no looking back after that.

Once again, the notebook environment was similar to Azure as both were built on the Jupyter framework. Therefore, handling notebooks as well as datasets was simple and almost as convenient as working on your local computer.

Below is a snippet of the functions I created to safely extract the speed and road data that I was interested in.

SageMaker served me well for a while but there were a couple of drawbacks which led me to explore more options.

  1. It’s not completely free (or not that I know of)
  2. Computational power (again)

Once again, I was faced with the challenge of executing code without the kernel crashing. This time it was a Natural Language Processing (NLP) project where I was trying to format and clean text data worth more than 400,000 documents. I was able to get through most of the formatting but hit a road block while trying to tokenize the text into words for topic modeling.

Google Cloud – Colaboratory (Colab)

There were some positive and negative aspects of Google Colab that were immediately apparent. For any new platform, the first thing I tend to look at is how easily I can import and export data. Once you have the data or the information, you can do anything with it using Python.

Colab provided the least user-friendly method for importing and re-using datasets. If you have a Google account (which most people do), a directory is automatically allocated to you for Colab purposes. However by default, anything you save or load on this drive will disappear on your next session.

To prevent this from happening, one would need to mount a virtual drive using their Google credentials. This process would need to be repeated each time you access Colab. This may be helpful in terms of security or minimizing directory size but could be rather exasperating for users, especially when switching from something like AWS, Azure or even Anaconda.

A snippet of the mounting process is shown above where you could specify a virtual directory name, in this case ‘/content/drive’ where the ‘content’ folder existed by default in the root directory. Once the code is run, you would need to follow the URL, authenticate your Google profile and copy/paste the generated ‘authorization code’ into the provided space shown in the image above. The resulting hierarchy of the directories is shown below.

The notebook usability on the other hand is fantastic, not to mention the clean aesthetics. It allows you to have collapseable sections once you split the notebook into a numbered hierarchy using the markdown feature. Additionally, the code cells contain an auto-complete feature which could be useful to fill in Python functions while you type or recall functions or attributes which you may have forgotten.

Finally, the biggest reason to switch to Colab was the incredible speed which was available through the use of GPUs at no additional cost. Yes, it was FREE! This allowed me to successfully parse through all the text documents for my project and significantly reduce the total runtime for all other functions as well.

Given the free GPU support, the inconvenience of having to mount a virtual drive was a minor one. It’s safe to say that I would probably be coming back to Colab for the foreseeable future.

Quantifying the comparisons

I have provided overall scores out of 10 for each platform (see table below) based on the features they provide. Colab receives the highest overall score mostly due to it’s speed and cost. The local computer came in at second place with a score of 50 which came as a surprise to me at first. But that actually makes sense because if I am looking to do simple data analysis, my local machine is possibly capable enough and is also the most convenient.

If I absolutely needed more speed at the same cost (free), I would simply switch to Colab and not bother with the other platforms.

Keep in mind that this scoring is based solely on my experience and my requirements at the time of using these services. If you work at a company whose entire data analysis framework is based on Azure or AWS, there’s a good chance you might be using services from those platforms. Furthermore, both Azure and AWS are known as the two best cloud platforms for AI and analytics because they have an extensive portfolio for automating and productionizing data processing, data analysis and machine learning models.

In my case, I was not interested in using data on such a large scale. The scoring is purely based on that.

Let me know in the comments which platforms you have used to analyze data or build models, or if you have had similar experiences to mine.