# Dear Analyst #52: Extracting text from the middle of a cell using World Happiness data

Sometimes you have a long list of data in a column and you want to pull out a specific string of characters in the cell and put the output in another column. I’ve done this countless times and was reminded of this task after watching this meetup on Power Query vs. Excel formulas from Danielle Stein Fairhurst (I talked at length about Power Query vs. formulas in the previous episode about custom data types). This episode dives into the nested formula and other solutions to extract text from the middle of a cell. You can copy the Google Sheets from this episode or watch the tutorial below.

## World Happiness Report data

Before I get into the solution, wanted to talk a bit about the data set used in this episode: The World Happiness Report.

This report aims to measure the state of happiness around the world. The first report was published in 2012. Five years later, the report ranked 155 countries by their happiness levels and was released at an event celebrating the International Day of Happiness (March 20th) at the United Nations.

What I find interesting about the actual poll is that the “main life evaluation” question asked to respondents uses a mechanism called the Cantril Ladder:

Please imagine a ladder with steps numbered from zero at the bottom to 10 at the top. The top of the ladder represents the best possible life for you and the bottom of the ladder represents the worst possible life for you. On which step of the ladder would you say you personally feel you stand at this time? (ladder-present) On which step do you think you will stand about five years from now? (ladder-future)

I think it’s a report with an audacious goal of quantifying and standardizing happiness levels around the world. The Sustainable Development Solutions Network has published the raw CSV data from 2015-2019 on Kaggle.

## The Problem

I edited the data set from 2019 a little to demonstrate the problem we need to solve. In the screenshot below, you’ll notice in the `Data` column the `Overall Rank`, `Country`, and `Score` values are all combined into one value:

Our goal is to extract “Finland” from cell B2, “Denmark” from cell B3, etc. We want to put the country into an empty column, and then we can hide the `Data` column or completely delete it assuming we copy and pasted the values in our final output column.

### Detecting the pattern

You’ll notice the pattern with how the data is formatted. The country is contained within opening and closing parentheses. The rank of the country precedes the country, and the overall happiness score is at the end of the value preceded by a dash. We can combine the `MID()` and `FIND()` functions to get just the country in between the two parentheses.

## Using MID() and FIND() to extract text from a cell

I created a new column next to column B called `Extract Country` and put this formula in this column to extract the country:

`=mid(B2,find("(",B2)+1,find(")",B2)-find("(",B2)-1)`

To better why the formula works, I’ll break down the formula into numbered parts. Read below to see how each numbered parts works within the broader context of the formula:

In plain English, this formula looks for the first opening parentheses “(“, finds the number of characters to “pull” from the cell by subtracting the position of the opening parentheses “(” from the position of the closing parentheses “)”, and then gives you the result.

## 1. MID() function

The main function that makes this whole formula work is the `MID()` function. Take a look at the `MID()` function definition:

`MID(string, starting_at, extract_length)`

Anytime I think about extracting a string of characters of text from a cell, I always think about using the `MID()` function first. All you need to provide to the `MID()` function is the position of where you want to start “extracting,” and then the number of positions to return. For instance, let’s say you wanted to extract the word “Green” from the movie title The Green Mile:

There are 14 total “positions” in the value “The Green Mile.” You have to count any spaces in between the words too. The word “Green” starts at position 5, so we know this will be the first argument in our `MID()` function.

How many “positions” does the word “Green” take up? The answer is 5 (from position 5 to 9). So the number 5 will be the second argument we need to make the `MID()` function work. Assume “The Green Mile” is in cell A1 and your `MID()` function is in cell B1. The formula would look like this to get the word “Green” into cell B1:

With all the countries in our World Happiness data, we don’t know the starting position of each country in the `Data` column. We do know, however, that each country is contained within parentheses “()”, so we can use the `FIND()` function as a “hack” to find the first position in the cell.

## 2. First FIND() function

The `FIND()` function returns the position of a character in your value. The function takes two arguments (with an optional third one):

`FIND(search_for, text_to_search, [starting_at])`

For our purposes, we only use the first two arguments: `search_for` and `text_to_search`. For instance, if we wanted to find the position of the “r” in the word “Green,” we would just do a formula like this:

The result of the formula in B1 is the number 2. Why? The letter “r” is in the second position of the word “Green.” The “n” letter would be position 5.

Going back to our full formula, why do we write this `FIND()` function to get the starting position for the country?

`find("(",B2)+1`

We are “finding” the first opening parentheses in cell B2, but we need to add a “1” to the result. If we don’t add the “1”, then we’ll incorrectly start extracting data starting with the opening parentheses instead of the first letter of the country. Let’s take cell B2 as an example. It currently contains the value “1 (Finland) – 7.769”. If we visualize each position like we did with “The Green Mile,” it looks like this:

Guess what happens if we just do the formula `find("(",B2)`? It will output the number 3 since that’s where the first opening parentheses appears. We really want position number 4 since that’s where the “F” in “Finland” starts. By simply adding a “1” to this part of the formula, we guarantee we’ll always get the first letter of the country after the opening parentheses.

## 3. Second FIND() function to “find” the last letter of the country

The only thing left to do is tell the formula how many positions to output on our spreadsheet. The tricky part about this step is that the number of characters for each country varies. We can’t simply plug a number into the formula because the number of positions in the country “Finland” is different from the country “Costa Rica.”

For “Finland,” we would want to plug in the number 7 into the `MID()` function but for “Costa Rica,” that number is 10.

How do we account for theses varying lengths?

The first step is to find the position of the closing parentheses:

This part of the formula looks like this:

`find(")",B2)`

It looks similar to step 2, but we’re looking for a closing parentheses “)” instead. This will give us the number 11 as shown above. Once we have this number, we can subtract the position of the first opening parentheses (which we already have!) to get the number of total positions to get back from the formula. This leads to the final step:

## 4. Last FIND() function

We know we are starting from position 4 in the “Finland” example, and we know the closing parentheses is position 11. What if we simply subtract the two numbers to get the total length of the country name? That’s exactly why we add the final `FIND()` function below:

`find(")",B2)-find("(",B2)-1`

The first `FIND()` function above is from step 3, and the second `FIND()` function is the new thing we’re adding in this step. Why do we have that minus “1” at the end? Take a look at the positions in our “Finland” example here:

If we didn’t have the minus “1”, the result of the two `FIND()` functions would be 8. This isn’t right because the country “Finland” only has 7 positions.

So if we take position 11position 31, we’ll get the number 7 which is the correct number of positions to return back for our `MID()` function. The formula could’ve been changed to add a “1” on the second `FIND()` function so it looks similar to the function in step 2, but then you have to put parentheses around that `FIND()` function (which I think makes the formula look more complicated:

` find(")",B2)-(find("(",B2)+1)`

It’s worth scrolling through your results or doing some filtering on the new column of data you created to see if there are any errors. For instance, you’ll see on row 104 we have an error:

We didn’t account for countries that have other values in parentheses in them. In this case, you have to decide if you want to include “(Brazzaville”) with the country “Congo.” You could re-write the formula to account for one-off cases like this, but from scrolling through the data set it doesn’t look like we have any other countries that have something else in parentheses.

This is the step in the data cleaning process that requires you to make decisions around manually checking your spreadsheet versus spending the time writing a formula to account for these edge cases.

Luckily, we only have 150 or so rows in this spreadsheet so it’s easy enough to scroll through and see if we have other issues like “Congo.” Imagine if your spreadsheet is 150,000 rows. By quickly scrolling through the spreadsheet, you may miss some examples of errors. So it would make sense to re-write the formula to account for these edge cases. You might consider cleaning your data with Power Query or OpenRefine too to make your life a bit easier.

## Other methods of getting the country

Like many data cleaning tasks in Excel or Google Sheets, there are multiple ways of doing things. The most obvious alternative solution is to do Text-to-Columns.

### Text-to-Columns

I received this feedback in the past on using Text-to-Columns for simple scenarios like this. If the data set is small like the one we have here, then using Text-to-Columns means we don’t have to play around with this messy formula. Here’s how you would do the first step with Text-to-Columns:

From there, you can select column C and do Text-to-Columns again and use the closing parentheses “)” as the delimiter.

### Power Query

This solution is only available in Excel, and requires you to leave the Excel interface and load up Power Query to clean your data. These scenarios where you need to extract data from a cell are actually what Power Query was made for. The argument for using Power Query gets larger when you are working with bigger data sets. I would watch the video linked at the top of this post to see how you would extract the text using Power Query.

## Pick a solution depending on your data pipeline

For a simple scenario like this (~150 rows), Text-to-Columns is probably the fastest way to get the task done, but is it the most scalable?

There are a variety of factors you should consider before picking a solution mentioned in this episode:

• Is this one-off or recurring analysis?
• Will you get new data every day, week, month?
• How big is the data set? Will it increase every day, week, month? And by how much?
• Is the data getting pulled from a database?