Three overlooked but powerful Python functions to replace the ‘replace’ function for hacking dataframes

If you are a frequent Python user, you are probably aware of the powerful Pandas library, the bread and butter of data analysis in this language. Furthermore, you might also be aware of some of the key functions that make it easier to work with dataframes such as:

  • describe
  • isnull
  • merge
  • to_datetime
  • value_counts
  • str
  • dropna
  • groupby
  • sort_values
  • concat
  • pivot
  • plot
  • apply

The list goes on and on.

These are by no means an exhaustive list of dataframe-related functions available in Python. Also, note that some of these functions are Pandas functions (pd.concat, pd.to_datetime, etc.) while the rest are dataframe functions (df.dropna, df.pivot, etc.). However, they are all functions that can be used to slice and dice dataframes.

Another key dataframe function is df.replace which we are going to cover in greater detail in this article. This function has been the cornerstone of both data cleaning and feature engineering as part of the machine learning workflow for many users.

Let’s take an example where you have four entries labeled ‘Finland’ but one entry mislabeled as ‘Vinland’. This is where the replace function comes in handy as a quick data cleaning solution. This is what the syntax would look like:

df_world = df_world.replace(‘Vinland’, ‘Finland’)

In the above example, every instance of ‘Vinland’ in the dataframe would be replaced with the string ‘Finland’. The code can be customized to just include a single column in the dataframe as well using the following code: 

df_world[‘Country or region’] = df_world[‘Country or region’].replace(‘Vinland’, ‘Finland’)

The function also works perfectly for replacing a list of strings to a different string(s) as shown in the example below where ‘Vinland’, ‘Denmark’ and ‘Norway’ are all grouped together into a single category called ‘Scandinavia’.

Speaking of grouping, oftentimes during feature engineering, there is a need to group continuous values into discrete categories to make the algorithm interpret that feature more efficiently. In the example below, all scores above 7.5 are labeled as ‘0’ and the rest labeled as ‘1’.

However, it took two lines of code to produce this solution where a separate variable called ‘Score 1’ had to be created. Things get even more complicated with the replace function when a condition is two-sided.

This is where three other powerful functions come into play that can prove to be better alternatives to the replace function.

np.where()

Note that this function is part of the Numpy library and not the Pandas library.

The greatest advantage of the np.where function is that it allows an if-else condition on a range of values in a single line of code. It is equally useful for categorical variables but it does not have a significant advantage over the replace function in this case.

A sample syntax is shown below:

df_world[‘Score 1’] = np.where(df_world[‘Score’] > 7.5, 1, 0)

In the example above, any values in the df_world[‘Score’] series that are greater than 7.5 are being converted to ‘1’ and all other values are being set to ‘0’. However, the result is being stored in a new series called ‘Score 1’. Results for the above sample is displayed below using the same dataset as shown earlier.

As we can see, the code performed the same function as the first example using replace but in fewer lines.

df.between()

This function serves as a complimentary function to np.where.

A key advantage of the between function is that it can accept range of continuous values bounded on both sides. In other words, it can accommodate both a greater-than and less-than condition while np.where can only accept one of the two conditions at a time.

Sample syntax:

df_world[‘Score 1’] = np.where(df_world[‘Score’].between(7.5, 7.6), 1, 0)

The example below implements the above code where it converts all ‘Score’ values between 7.5 and 7.6 to a ‘1’ and the rest to a ‘0’. Yes, it is a little bit more involved in terms of the logic and code but it’s more powerful.

df.mask()

Finally, we get to the mask function which is not as powerful as the between function but is still a good alternative to the replace function when it comes to substituting a range of values. Note that it does not allow an if-else type condition, which is where it lacks.

The syntax is as follows:

df_world[‘Score 1’] = df_world[‘Score’].mask(df_world[‘Score’] < 7.5, 0)

Shown below is an implementation of the above code with the mask function which performs a similar task to the code shown as part of the where function.

As you can see, two separate lines of code (and conditions) had to be executed to derive the intended results.

Conclusion

To recap, replace is a great function for smaller scale substitution operations which do not require too many conditions. However, the where, between and mask functions provide greater flexibility when dealing with:

  • A long range of continuous values
  • Multiple conditions

A link to the code (used for the examples) and dataset is provided here. The Jupyter Notebook contains additional examples of how the aforementioned functions work, providing a comparison to the replace function.

That’s the end of this short-but-sweet article. Hope this is helpful! Let me know in the comments which of these functions you tend to use frequently and what some of the most complicated ‘replace’ tasks you have dealt with are. Also, feel free to share which other underrated Pandas functions you tend to use often. Cheers!