Dear Analyst #32: How to use the QUERY function in Google Sheets on COVID-19 data

The QUERY() function in Google Sheets gives you the ability to quickly filter and sort your data similar to how you might get data from a database. If you write SQL queries, the QUERY() function feels easy and natural to use. There are a few caveats as I discuss in this episode. If you want to follow along with the exercises I discuss in this episode, make a copy of this Google Sheet which contains the QUERY() functions I mention in the episode.

Basic query to find confirmed cases greater than 50,000

Our data set is from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. The data shows confirmed cases, deaths, and recovered cases by country (188 countries) on May 1st:

The first query simply pulls back the list of countries and confirmed cases where the number of confirmed cases is greater than 50,000. Notice how you reference the column letter name versus the actual name of the column in the header row:

The first parameter is covid_data which is a named range in Google Sheets. In this case, it references cells A1:E188 in our data set.

More SQL-like commands

You can do many database-like commands with the QUERY() function. The next example shows how you can use the ORDER BY command to find countries with deaths between 0 and 5 and the resulting list is sorted in descending order:

Check out Ben Collins’ blog post about the QUERY() function to see some of the other SQL commands you can use.

Adding in new calculated columns

In the third query, we get a little more advanced and use the LABEL command to create a new “column” called Case Fatality Rate. This calculation is simply Confirmed / Deaths. Unlike SQL, you put the LABEL at the end of the command instead of in the beginning of the SELECT statement:

Coming from SQL, you’ll need to account for the difference in the order of commands in the query in order for it to work correctly.

Inability to select column names

You’ll notice that you don’t put the actual names of the columns in your header row in the query. This can be a pro or con of the QUERY() function depending on how your underlying data set is structured.

Columns are changing a lot

If you underlying data is constantly “shuffling” where columns are moving around and the structure of the data is not set in stone, the QUERY() function will most likely break because you’re referencing the column letter instead of the column name like in a traditional SQL query.

Columns are fixed

If your columns are not shuffling around a lot, this syntax of selecting the column letter may actually be easier for you. This is because you don’t have to type out the long column name in the QUERY() function. If data is simply getting appended to the bottom of your data set, then the QUERY() function should work fine for you because the letters of the columns will always reference the correct columns of data.

PivotTables vs. the QUERY() function

One of the reasons I don’t use the QUERY() function too often is because I find PivotTables to be easy enough to use to filter, sort, and aggregate my data to do my analysis. Additionally, your columns can move around in the underlying data set and the PivotTable will still work since it’s not referencing columns by letter but rather by the name in your header row.

Plotting trend lines for COVID-19

One of the articles I discuss in this episode is this Vox article about how the Council of Economic Advisers may have applied a stock trendline in Excel to “forecast” deaths as a result of COVID-19. The article discusses the concept of “smoothing out” volatile data versus prescribing a forecast, and that line between these two concepts is a bit blurry. This is the cubic chart in Excel which you can easily build from the trendline features in Excel:

And then this is the chart from a CEA Tweet that appears to show the cubic trendline as a potential forecast:

To better visualize observed data, we also continually update a curve-fitting exercise to summarize COVID-19's observed trajectory. Particularly with irregular data, curve fitting can improve data visualization. As shown, IHME's mortality curves have matched the data fairly well. pic.twitter.com/NtJcOdA98R
— CEA45 Archived (@WhiteHouseCEA45) May 5, 2020

SUM by David Eagleman

A book I discuss at the end of this episode is SUM: Tales from the Afterlives by David Eagleman. I read a chapter from the book called Incentive and how it relates to some recent shows I’ve been watching like Westworld and Devs. Highly recommend checking out the book.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The Trump administration’s “cubic model” of coronavirus deaths, explained by Matthew Yglesias
Jocko Podcast #222: Life is a Challenge. Life is Suffering. So Live With Fortitude. With Dan Crenshaw
SUM: Forty tales from the afterlives by David Eagleman

Trackbacks/Pingbacks

Dear Analyst #69: Import data from another Google Sheet and filter the results to show just what you need • - May 10, 2021
[…] QUERY() function is a relatively advanced function in Google Sheets. Episode 32 was all about how to use the QUERY() function. The reason why it’s not used as much is because it requires you to know a little bit of SQL. […]