Dear Analyst #74: Quick hack to count the number of words in a cell with LEN and SUBSTITUTE

While this little Excel/Google Sheets trick is a pretty straightforward hack, it led me to think about how we use our tools, how we stretch the capabilities of our tools, and think outside the box to come to arrive at a solution. Seems like a lot for a formula trick on counting the number of words in a cell to bring to the table. Perhaps I’m looking into it too much. Perhaps it’s just me rambling and opining on something meaningless. Or perhaps, it might cause you to stop, think, and reflect for even a minute about something as trivial as counting words in a cell. Come on this journey with me and learn how a stupid formula trick triggered my synapses to fire in a million directions. Link to the Google Sheet with the formula is here.

Video tutorial of the formula to count the number of words in a cell:

Why would you want to count words in a cell?

It’s a great question. Maybe you need to see how many words are in a paragraph before you submit some online form that only allows you to submit an answer with 150 characters or 50 words or less. I’ve probably had to count the number of words in a cell a handful of times and it was probably for cleaning data purposes (more on this later). What may be more common is counting the number of characters in a cell to detect anomalies. In any event, the formula you use for counting the number of words in a cell is similar to how you might count the number of characters in a cell. If you want to skip straight to the answer (or doing a search on Google and maybe this will be highlighted in yellow):

=len(A2)-len(substitute(A2," ",""))+1

This formula should work in Excel and Google Sheets and A2 contains the cell with the words you are trying to count. Let’s break this down a bit more, because when you break things down that’s where the real learning takes place and it may spur other ideas you can incorporate into your spreadsheets.

Counting and substituting stuff

Composability gives formulas some pretty amazing capabilities. Greater than the sum of its parts kind of thing. One their own, the LEN() and SUBSTITUTE() functions do pretty standard things. The LEN function simply counts the number of characters (including spaces) in a cell:

SUBSTITUTE acts as you might expect. Turn all the “A”s in a cell into “X”s. Turn all the 5s into 9s. The first argument is the cell that contains the data, the second argument is what you are searching for in the value to replace, and the third argument is what everything in the second argument should be replaced with. In the example below, We are looking for all the spaces in cell A5 and replacing them with an empty string. Note the syntax here. A space (what we are looking for) is denoted by two double quotes with a space in between: ” “. An empty string is two double quotes next to each other with no space in between them: “”. The result, in this case, are sentences with no spaces in between them:

Composability is where the magic happens

What happens when you combine the two formulas together? You may be an Excel or Google Sheets guru and have built advanced nested formulas before. When I step back and see how these formulas–when combined–create interesting results that you wouldn’t have expected. People say we’re just number crunchers and just know when and how to use formulas correctly.

I’d say the composability of formulas is what inspires creativity and makes building a model a creative endeavor. When you deconstruct someone else’s deeply nested formula, you learn something new about the formulas and a use case for the formulas you wouldn’t have otherwise figured out on your own (unless you’re Googling stuff, of course). Interestingly, the top result on Google right now for “composability” is an article on DeFi (decentralized finance). This is ahead of links to the dictionary definition, Wikipedia, and HP Enterprise. I only bring this up because I’m a believer and user of various DeFi platforms and I dig into this stuff in the 2nd half of episode 72.

Ok back to the magic. We have this random text with no spaces in it. Let’s put this formula inside the LEN function and see what happens. We get the length of the sentence or paragraph with no spaces in it:

Why is this important? Well, if we have the length of the sentence with spaces, and the length of the sentence without spaces, we can subtract the latter from the former and get pretty close to the actual count of words in the sentence:

If you look at the text in cell A2, there are 11 words in “It’s easy to forget that as recently as six days ago.” There are 10 spaces in between those words. That’s the result of the formula in column D. It’s one less than what we need, so we just add a 1 to the end of the formula and we’re done:

Are we counting spaces or counting words?

The answer is: does it really matter? Just because the goal is to count words, does that mean mean you can’t count spaces to get to the solution?

During WWII, the Germans had a bomber plane called the Dornier 17. Most of these bomber planes have been destroyed during the war, but divers found one at the bottom of the North Sea off the UK coast in 2008. A team from Imperial College London was tasked to salvage the plane from further corrosion so that the plane could be showcased at a museum. The team tried all sorts of solutions and methods, but found out that spraying the plane with lemon water would not only prevent the corrosion, but would also clean the plane as well. This was a long-winded example to show that there are many ways to get to a solution :).

Within the first 5 seconds of being told you need to count the number of words in a cell, did you think about what Excel or Google Sheets function would actually count the words in a sentence? Maybe you thought of a creative use of the COUNTA() function or maybe there’s some sort of loop you can push all the words from the cell into and set some sort of counter every time the loop found a word.

Like many other hacks in Google Sheets and Excel, the solution involves detecting a pattern in the data and then figuring out which functions allow you to manipulate that pattern for your use case. The first step is just thinking that counting spaces may be a more reliable solution than counting the actual words. The next step is knowing that you can count spaces and characters with functions we have at our disposal in Google Sheets.

The best tool for the JTBD (job to be done)

Going back to the original question: why would you need to count the number of words in a cell? Let’s assume you have a business reason for counting the words in the cell because you need to take that number and use it somewhere else in your spreadsheet. From a tool perspective, I’d argue that Google Sheets is not the right tool for counting the number of words in your sentence or paragraph.

The formula itself seems a bit inefficient too right? You have to find the length of a paragraph by its characters, do this whole substitution thing, and then count that long string of characters again. Isn’t this what tools like Word and various online word counters are built to do? Even as I’m typing this post in WordPress, I can see the number of words as new words are typed.

Even if Google Sheets and Excel are not the right tools for counting the words in your paragraph(s), I think it shows how much we are willing to stretch our tools to find a solution to our problem. For diehard spreadsheet users, coming up with a formula for unconventional use cases is what makes using spreadsheets “fun” and creative like I stated earlier. There’s joy in knowing you’re exploring the frontiers of your tools and have discovered new treasures and lands that others may not have found.

Other word count algorithms

If Microsoft Word or online word counters are indeed the “right” tools for counting the words in your paragraph, how do they count their words? Maybe they also count spaces? They must use some advanced algorithm that is 10X better than this LEN and SUBSTITUTE solution.

The answer is I don’t really know. After some quick research, it looks like the algorithm in Microsoft Word also counts spaces, so perhaps its algorithm is not too far from our formula! Word counting is actually a pretty challenging problem because handling special characters and other edge cases can throw off the count. Microsoft even shows inconsistent word counts depending on what parts of the UI in Windows and Word you are using.

Some people have built their own algorithms (like this one in C#) that mimic the word count algorithm in Word. This page actually walks through the algorithm for counting words step-by-step. I liked this pseudo code the author wrote in Java to describe the algorithm:

public class Word count {
  static char[] separators = {' ', '.', ','};
  static boolean state = true;
  static int countWords(String str) {
  boolean state = true;
  int word count = 0;

  for (int i = 0; i < str.length(); i++) {
    if (SeparatorArrayContains(str.charAt(i)) || str.charAt(i) == '\n' || str.charAt(i) == '\t') {
      state = true;
    }
    else if (state == true) {
      state = false;
      word count ++;
    }
  }
  return word count ;
}

static boolean SeparatorArrayContains(char c) {
  boolean found = false;
  for (int k = 0; k < separators.length; k++) {
    if (separators[k] == c) {
      found = true;
    }
  }
  return found;
}

At a high-level, the algorithm is also looking at the separators and delimiters in your text to count the number of words. This means this solution to count the number of spaces to get the number of words is most likely similar to how other applications count words too! A simple one line nested formula with LEN and SUBSTITUTE can have the same power as this pseudo Java code is pretty darn cool.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting: