Dear Analyst #52: Extracting text from the middle of a cell using World Happiness data
Podcast: Play in new window | Download
Subscribe: Spotify | TuneIn | RSS
Sometimes you have a long list of data in a column and you want to pull out a specific string of characters in the cell and put the output in another column. I’ve done this countless times and was reminded of this task after watching this meetup on Power Query vs. Excel formulas from Danielle Stein Fairhurst (I talked at length about Power Query vs. formulas in the previous episode about custom data types). This episode dives into the nested formula and other solutions to extract text from the middle of a cell. You can copy the Google Sheets from this episode or watch the tutorial below.
World Happiness Report data
Before I get into the solution, wanted to talk a bit about the data set used in this episode: The World Happiness Report.
This report aims to measure the state of happiness around the world. The first report was published in 2012. Five years later, the report ranked 155 countries by their happiness levels and was released at an event celebrating the International Day of Happiness (March 20th) at the United Nations.
What I find interesting about the actual poll is that the “main life evaluation” question asked to respondents uses a mechanism called the Cantril Ladder:
Please imagine a ladder with steps numbered from zero at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time? (ladder-present) On which step do you think you will stand about five years from now? (ladder-future)
I think it’s a report with an audacious goal of quantifying and standardizing happiness levels around the world. The Sustainable Development Solutions Network has published the raw CSV data from 2015-2019 on Kaggle.
The Problem
I edited the data set from 2019 a little to demonstrate the problem we need to solve. In the screenshot below, you’ll notice in the Data
column the Overall Rank
, Country
, and Score
values are all combined into one value:
Our goal is to extract “Finland” from cell B2, “Denmark” from cell B3, etc. We want to put the country into an empty column, and then we can hide the Data
column or completely delete it assuming we copy and pasted the values in our final output column.
Detecting the pattern
You’ll notice the pattern with how the data is formatted. The country is contained within opening and closing parentheses. The rank of the country precedes the country, and the overall happiness score is at the end of the value preceded by a dash. We can combine the MID()
and FIND()
functions to get just the country in between the two parentheses.
Using MID() and FIND() to extract text from a cell
I created a new column next to column B called Extract Country
and put this formula in this column to extract the country:
=mid(B2,find("(",B2)+1,find(")",B2)-find("(",B2)-1)
To better why the formula works, I’ll break down the formula into numbered parts. Read below to see how each numbered parts works within the broader context of the formula:
In plain English, this formula looks for the first opening parentheses “(“, finds the number of characters to “pull” from the cell by subtracting the position of the opening parentheses “(” from the position of the closing parentheses “)”, and then gives you the result.
1. MID() function
The main function that makes this whole formula work is the MID()
function. Take a look at the MID()
function definition:
MID(string, starting_at, extract_length)
Anytime I think about extracting a string of characters of text from a cell, I always think about using the MID()
function first. All you need to provide to the MID()
function is the position of where you want to start “extracting,” and then the number of positions to return. For instance, let’s say you wanted to extract the word “Green” from the movie title The Green Mile:
There are 14 total “positions” in the value “The Green Mile.” You have to count any spaces in between the words too. The word “Green” starts at position 5, so we know this will be the first argument in our MID()
function.
How many “positions” does the word “Green” take up? The answer is 5 (from position 5 to 9). So the number 5 will be the second argument we need to make the MID()
function work. Assume “The Green Mile” is in cell A1 and your MID()
function is in cell B1. The formula would look like this to get the word “Green” into cell B1:
With all the countries in our World Happiness data, we don’t know the starting position of each country in the Data
column. We do know, however, that each country is contained within parentheses “()”, so we can use the FIND()
function as a “hack” to find the first position in the cell.
2. First FIND() function
The FIND()
function returns the position of a character in your value. The function takes two arguments (with an optional third one):
FIND(search_for, text_to_search, [starting_at])
For our purposes, we only use the first two arguments: search_for
and text_to_search
. For instance, if we wanted to find the position of the “r” in the word “Green,” we would just do a formula like this:
The result of the formula in B1 is the number 2. Why? The letter “r” is in the second position of the word “Green.” The “n” letter would be position 5.
Going back to our full formula, why do we write this FIND()
function to get the starting position for the country?
find("(",B2)+1
We are “finding” the first opening parentheses in cell B2, but we need to add a “1” to the result. If we don’t add the “1”, then we’ll incorrectly start extracting data starting with the opening parentheses instead of the first letter of the country. Let’s take cell B2 as an example. It currently contains the value “1 (Finland) – 7.769”. If we visualize each position like we did with “The Green Mile,” it looks like this:
Guess what happens if we just do the formula find("(",B2)
? It will output the number 3 since that’s where the first opening parentheses appears. We really want position number 4 since that’s where the “F” in “Finland” starts. By simply adding a “1” to this part of the formula, we guarantee we’ll always get the first letter of the country after the opening parentheses.
3. Second FIND() function to “find” the last letter of the country
The only thing left to do is tell the formula how many positions to output on our spreadsheet. The tricky part about this step is that the number of characters for each country varies. We can’t simply plug a number into the formula because the number of positions in the country “Finland” is different from the country “Costa Rica.”
For “Finland,” we would want to plug in the number 7 into the MID()
function but for “Costa Rica,” that number is 10.
How do we account for theses varying lengths?
The first step is to find the position of the closing parentheses:
This part of the formula looks like this:
find(")",B2)
It looks similar to step 2, but we’re looking for a closing parentheses “)” instead. This will give us the number 11 as shown above. Once we have this number, we can subtract the position of the first opening parentheses (which we already have!) to get the number of total positions to get back from the formula. This leads to the final step:
4. Last FIND() function
We know we are starting from position 4 in the “Finland” example, and we know the closing parentheses is position 11. What if we simply subtract the two numbers to get the total length of the country name? That’s exactly why we add the final FIND()
function below:
find(")",B2)-find("(",B2)-1
The first FIND()
function above is from step 3, and the second FIND()
function is the new thing we’re adding in this step. Why do we have that minus “1” at the end? Take a look at the positions in our “Finland” example here:
If we didn’t have the minus “1”, the result of the two FIND()
functions would be 8. This isn’t right because the country “Finland” only has 7 positions.
So if we take position 11 – position 3 – 1, we’ll get the number 7 which is the correct number of positions to return back for our MID()
function. The formula could’ve been changed to add a “1” on the second FIND()
function so it looks similar to the function in step 2, but then you have to put parentheses around that FIND()
function (which I think makes the formula look more complicated:
find(")",B2)-(find("(",B2)+1)
Checking your work
It’s worth scrolling through your results or doing some filtering on the new column of data you created to see if there are any errors. For instance, you’ll see on row 104 we have an error:
We didn’t account for countries that have other values in parentheses in them. In this case, you have to decide if you want to include “(Brazzaville”) with the country “Congo.” You could re-write the formula to account for one-off cases like this, but from scrolling through the data set it doesn’t look like we have any other countries that have something else in parentheses.
This is the step in the data cleaning process that requires you to make decisions around manually checking your spreadsheet versus spending the time writing a formula to account for these edge cases.
Luckily, we only have 150 or so rows in this spreadsheet so it’s easy enough to scroll through and see if we have other issues like “Congo.” Imagine if your spreadsheet is 150,000 rows. By quickly scrolling through the spreadsheet, you may miss some examples of errors. So it would make sense to re-write the formula to account for these edge cases. You might consider cleaning your data with Power Query or OpenRefine too to make your life a bit easier.
Other methods of getting the country
Like many data cleaning tasks in Excel or Google Sheets, there are multiple ways of doing things. The most obvious alternative solution is to do Text-to-Columns.
Text-to-Columns
I received this feedback in the past on using Text-to-Columns for simple scenarios like this. If the data set is small like the one we have here, then using Text-to-Columns means we don’t have to play around with this messy formula. Here’s how you would do the first step with Text-to-Columns:
From there, you can select column C and do Text-to-Columns again and use the closing parentheses “)” as the delimiter.
Power Query
This solution is only available in Excel, and requires you to leave the Excel interface and load up Power Query to clean your data. These scenarios where you need to extract data from a cell are actually what Power Query was made for. The argument for using Power Query gets larger when you are working with bigger data sets. I would watch the video linked at the top of this post to see how you would extract the text using Power Query.
Pick a solution depending on your data pipeline
For a simple scenario like this (~150 rows), Text-to-Columns is probably the fastest way to get the task done, but is it the most scalable?
There are a variety of factors you should consider before picking a solution mentioned in this episode:
- Is this one-off or recurring analysis?
- Will you get new data every day, week, month?
- Will you be collaborating with others on this spreadsheet? Will someone else inherit your spreadsheet?
- How big is the data set? Will it increase every day, week, month? And by how much?
- Is the data getting pulled from a database?
- Does your team have licenses for data cleaning tools? If not, will they be purchasing licenses in the future?
As I explained in the previous episode, I prefer “seeing” how my data is transformed right in my spreadsheet. This means I defer to using formulas assuming that the data set isn’t too large (sub 10,000 rows) and it’s a one-off analysis. This means that other analysts who inherit the spreadsheet can see how you transformed the data. If you did Text-to-Columns which results in hard-coded values, it will be more difficult for your colleague to figure out how you transformed the data (that you pulled from a database, perhaps) to the cleaned data set.
There are advantages with Power Query, of course. Aside from the ease of use, you can see the “steps” with your data transformation. This allows you to move forwards and backwards between the different stages of your data cleaning process. Your organization might already have data connections set up between your SQL database and Power Query, so it’s a no-brainer to do as much data cleaning in Power Query first before the data gets into Excel.
The big caveat is that you (and most likely your colleagues) will need to know how to use Power Query and your organization needs to be a Microsoft shop to use Excel and Power Query in the first place. I’m a big proponent of free and open-source tools because I believe they can solve 80% of the data cleaning and analysis you need to do on a daily basis. This is why I defer to Google Sheets (using Google products may not be “free” depending on your stances on data privacy :)) and OpenRefine for data transformation/cleaning.
There are many other solutions not mentioned in this post that might make sense for your use case and organization. To wrap up, think long term about how your organization will use, collaborate with, and act upon your data.
Other Podcasts & Blog Posts
In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:
Trackbacks/Pingbacks
[…] in Excel or Google Sheets to get data to look exactly the way you like (see previous episode on extracting text from a middle of a cell), playing with these arrays of data should be pretty straightforward for […]
[…] episode 52, I talked about how to extract text from a cell. You might use this formula to extract a certain value from a cell when your database or CSV export […]