Dear Analyst #37: Text manipulation functions to extract domain names from email addresses
Podcast: Play in new window | Download
Subscribe: Spotify | TuneIn | RSS
In Excel or Google Sheets, text manipulation is usually associated with data cleaning, data cleansing, and data transformation. Sometimes your data is “dirty” and needs to be categorized in a different way or you need to “extract” a piece of text from a another piece of text. In this example, we use a combination of the FIND
, RIGHT
, and LEN
functions to extract the domain name from an email address (e.g. the “tesla.com” from “john.smith@tesla.com”). Here’s the Google Sheet if you want to make a copy for yourself to follow along.
Start with finding the @
The first step is to use the FIND
function to find the location of the “@” symbol in the email address. The FIND
function takes two required arguments and one optional argument. You’re basically find the index location of where that characters or string exists within the cell:
In the case of “john.smith@amazon.com,” the FIND
function would return 11 since the “@” symbol starts at the 11th position within the email address. Pretty simple right?
Nesting LEN inside the RIGHT function
The next part is a little trickier. Now that we know the position of the “@” symbol, we want all the characters after the “@” symbol to get the domain of the email address. There are multiple ways of doing this, but I chose to use the RIGHT
and LEN
functions. To make this more clear, I could have put the LEN
function in its own column, but decided to next it within the RIGHT
function:
The RIGHT
function takes two arguments and simply returns the number of characters from the “right” of the text you give it (in this case the email address). Since we don’t know how many characters to pull from each e-mail address, we use the result of the LEN(A2) - B2
formula which tells us how many characters to pull from the right of the email address.
LEN(A2)
gives us the length of the entire text (for “john.smith@amazon.com” it’s 21). If we subtract the index position of the “@” symbol from that length, we’ll get the exact number of characters to pull for each unique email address. Pretty nifty.
Note: The “Position of @” column also could’ve been nested in the 3rd column (and replaced the current cell reference of B2
).
I typically use a combination of FIND
, LEN
, and MID
to extract the text I need from a longer piece of text. Once you master these few functions, you’ll be able to to pull anything you want out of a long piece of text to get “clean” data.
Other Podcasts & Blog Posts
In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:
- The Tim Ferriss Show #444: Hugh Jackman on Best Decisions, Daily Routines, The 85% Rule, Favorite Exercises, Mind Training, and Much More
- EconTalk: Robert Lerman on Apprenticeships
No comments yet.