The post Dear Analyst #43: Setting up workflows that scale – from spreadsheets to tools & applications appeared first on .

]]>- The skills you’ve learned in Excel/Google Sheets — include data structuring — translate to building workflows for any part of your business
- Thinking beyond spreadsheets as a way to do data analysis or “number crunching”
- Any tool that helps automate or solve some workflow at your company can be built with spreadsheets
- Why learning spreadsheets can set you up well for learning “no-code” tools

During the presentation, I showed actual spreadsheets (Excel and Google Sheets) I’ve built in the past for freelance clients and friends. The main concept I’m trying to convey is that each of these spreadsheets look and feel more like an *application* rather than a *model *that forecasts out certain values. Each of these examples consists three core elements:

**Database**– A place to store information**User Input –**Fields and forms for someone to fill out**Calculations/Display**– Formulas (e.g. “business logic”) to make the spreadsheet output something for you (the administrator) or the user

*My 2 cents*: When you’re building an application in a spreadsheet, you’re extending the original purpose and audience Excel and Google Sheets was meant to serve: financial models for accountants. *But this is what makes the spreadsheet so versatile. *The fact that an analyst can string together formulas to make a spreadsheet look and feel like an application is what gives the spreadsheet power. This innovation also pushes Microsoft, Google, and other platforms to release new features that give analysts the ability to build tools, not just models.

I’ve written extensively about this subject in the past, so will leave my soliloquy at that. On to the examples

The first example I discuss is this bachelorette party planning Google Sheet I built for a friend. This spreadsheet has been duplicated quite a few times by friends of friends, and all it does is help a to-be bride plan figure out which weekend works best to have a bachelorette party.

The key insight is that the *database* is everything from column B onwards and row 3 and below. All the availability for each person is captured in each of these cells and there’s some conditional formatting to give the bride a visual indicator to see when a weekend is available.

The *user input* is the ability for each friend who is shared the Google Sheet to edit the cells. “Yes,” “No,” and “Maybe” are the only inputs that matter for this Google Sheet. Finally, the *calculations* are in rows 31-33 which tallies up the user inputs for each weekend so the bride can see which weekend is the “most free” for her friends.

There are countless iPhone and Android apps you can download to do this exact same thing, but this spreadsheet just does one thing and one thing well: help brides figure out which weekend to plan a bachelorette party.

This splitting costs with friends blog post is by far the most popular post on my blog since I published it in 2014 (thanks Google search!). Every day I still get requests to give people edit access to the Google Sheet (please just make a copy of it instead of requesting edit access). Here’s the Google Sheet if you want to make a copy for yourself.

Similar to the previous example, the *database *is all the items, costs, and who participated in the cost from rows 2 and down. The *user input* are the cells themselves, but the most important part of the Google Sheet are the 1s and 0s from column C onward. Those 1s and 0s represent whether a friend or family member “participated” in the cost. This allows the spreadsheet to do some basic calculations to figure out who owes what.

Rows 26-28 are the *calculations* that the trip organizer can see at a glance to see who is owed or who owes money. Again, numerous apps and custom tools you can pay for or download to split costs with friends, and this Google Sheet mimics the features of those apps in a more bare bones way.

This example shows when the spreadsheet is really extended beyond what it was intended to do. This was for one of my consulting clients who needed a new CRM system for managing new patients at their clinic.

The Excel file basically lets the operations manager at the clinic quickly “move” new patients from one spreadsheet to another using a VBA macro. To mimic the look and feel of an application, I drew these blue and green buttons using the shape feature in Excel and tied a macro to each button. The *database *consists of patient details, the *user input* is simply each row of data, and the *calculations* involve these macros that move data from one spreadsheet to another.

This gets into an important concept that an Excel file or Google Sheet are not that great for: *workflows*. Since everything is usually calculated in real-time in a spreadsheet, it can be difficult to do a *if-this-then-that* type of workflow without using a macro or script (see my last post on automating a tedious filling values down task).

The rest of the presentation includes tool and tips for building applications with other no-code tools. Slides are below:

Original talk from Webflow’s No-Code Conference in 2019:

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

*No other podcasts for this episode given how long this episode is!*

The post Dear Analyst #43: Setting up workflows that scale – from spreadsheets to tools & applications appeared first on .

]]>Summary of presentation

* The skills you've learned in Excel/Google Sheets — include data structuring — translate to building workflows for any part of your business* Thinking beyond spreadsheets as a way to do data analysis or "number crunching"* Any tool that helps automate or solve some workflow at your company can be built with spreadsheets* Why learning spreadsheets can set you up well for learning "no-code" tools

Spreadsheet examples from presentation

During the presentation, I showed actual spreadsheets (Excel and Google Sheets) I've built in the past for freelance clients and friends. The main concept I'm trying to convey is that each of these spreadsheets look and feel more like an application rather than a model that forecasts out certain values. Each of these examples consists three core elements:

* Database - A place to store information* User Input - Fields and forms for someone to fill out* Calculations/Display - Formulas (e.g. "business logic") to make the spreadsheet output something for you (the administrator) or the user

My 2 cents: When you're building an application in a spreadsheet, you're extending the original purpose and audience Excel and Google Sheets was meant to serve: financial models for accountants. But this is what makes the spreadsheet so versatile. The fact that an analyst can string together formulas to make a spreadsheet look and feel like an application is what gives the spreadsheet power. This innovation also pushes Microsoft, Google, and other platforms to release new features that give analysts the ability to build tools, not just models.

I've written extensively about this subject in the past, so will leave my soliloquy at that. On to the examples

Bachelorette planning Google Sheet

The first example I discuss is this bachelorette party planning Google Sheet I built for a friend. This spreadsheet has been duplicated quite a few times by friends of friends, and all it does is help a to-be bride plan figure out which weekend works best to have a bachelorette party.

The key insight is that the database is everything from column B onwards and row 3 and below. All the availability for each person is captured in each of these cells and there's some conditional formatting to give the bride a visual indicator to see when a weekend is available.

The user input is the ability for each friend who is shared the Google Sheet to edit the cells. "Yes," "No," and "Maybe" are the only inputs that matter for this Google Sheet. Finally, the calculations are in rows 31-33 which tallies up the user inputs for each w...]]>

The post Dear Analyst #42: Filling values down into empty cells programmatically with Google Apps Script & VBA tutorial appeared first on .

]]>*See the video below if you want to jump straight to the tutorial:*

You’ve inherited a spreadsheet and the data structure looks like this:

It’s a list of data but there are empty cells in column A. This is usually a *category *or *dimension *in your data set that needs to be “filled down” so that the data set is complete. In the Google Sheet, each row represents one person that is associated with a given SPAC, but the `SPAC Ticker`

column is incomplete. You’ll usually get this type of data structure through the following:

- Data was manually created by someone who didn’t fill down the values in column A since they thought it was a “category”
- You are working a data set that originally came from a PivotTable but you only have the “values” from the PivotTable, not the PivotTable itself

This data structure is a problem because if you want to do any type of analysis on this data, it will be extremely difficult since you have *missing values* in column A. Sorting, filtering, and PivotTables are all out of the question if your data set looks like that screenshot.

Totally doable for this Google Sheet. This is what you could do:

All I’m doing above is the following (on PC):

**SHIFT+CONTROL+DOWN ARROW –**Select all the empty cells from the current cell with a value up until the next cell with a value**SHIFT+UP ARROW –**Reduce the selection by one row**CONTROL+D****–**Fill the value from the first cell in the selection down**CONTROL+DOWN ARROW**– Skip to the next value that needs to be filled down

The obvious tradeoff here is time vs. human error. Every time I have to do this task on a spreadsheet, I think about whether it was worth filling the values down “manually” using keyboard shortcuts or using a VBA script (in Excel) to do this programatically. It really depends on the number of rows. For the example SPAC Google Sheet, doing this with keyboard shortcuts takes 10 seconds tops. If this spreadsheet was 1,000,000 rows, then we have a problem.

Don’t worry, I got you. Here’s the script you can use to do this programmatically.

First off, here’s the script you can use for Google Sheets (gist here). Just 14 lines of code and you’re good to go:

Never used macros or Google Apps Script before? It’s super simply. First go to **Tools** then **Script Editor:**

You may be asked to authenticate your Google account so just hit Yes to all those screens. Copy/paste the script into the editor:

Go to **File **and **Save **in order to save the script into the Google Apps Script project. Go back to Google Sheets and go to **Tools**, **Macros**, and click **Import **to import the `fillValuesDown`

function into Google Sheets. Now you can use this function as a macro in your Google Sheet:

You can close out the Google Apps Script editor and now click on **Tools**, **Macros**, and click on **fillValuesDown** to run the script on your dataset:

The script utilizes the Spreadsheet service for Google Apps Script to access the data object for your Google Sheet (more on that below). The script is really only 12 lines long, and does the following in sequential order:

- Sets the
`spreadsheet`

variable so that we can use the active worksheet you’re on - Sets the
`currentRange`

variable to start from A2 to the last row in the table - Two more variables are set:
`newRange`

to store the new range of values we want to put into column A, and`newFillValue`

which is kind of like an intermediate variable used in the loop - The script goes through all values in
`currentRange`

(including the blank ones) and adds all the correct values to the`newRange`

array - The
`currentRange`

is then set equal to`newRange`

to get all the “correct” values into column A

On the backend, the `currentRange`

array looks like this:

`[['HZAC'], [], ['FST'], [], [] , []...]`

The purpose of `newRange`

is to create a new array that is a complete list of values:

`[['HZAC'], ['HZAC'], ['FST'], ['FST'], ['FST'] , ['FST']...]`

When I first started learning macros, the first thing I did was record my keystrokes and break down what the backend “code” looked like. Here’s what recording a macro looks like:

When you open up the script editor, you’ll see this:

There’s a lot of `activate()`

and `getCurrentCell()`

functions being called. You can then deconstruct all these *keystrokes* to build a script that accomplishes the task. But here’s the key difference between recording keystrokes versus working with the data object:

You are programming keystrokes instead of the Google Sheets application.

Other advantages of programming the *application* instead of the *keystrokes*:

- Utilizes less compute resources and runs faster
- Easier to debug
- Easier to adapt to more scenarios and use cases

In the *keystroke* world, you are literally telling Google Sheets to select cells, select ranges, and moving the cursor around which doesn’t seem like a big deal. When you are working with hundreds of thousands of rows, this could cause serious performance issues. Since Google Apps Script runs in the cloud, you may not see these performance deficiencies, but you’ll definitely see this in your Excel workbooks.

Speaking of Excel workbooks…

The structure of the VBA script is pretty similar to the Google Apps Script, but it’s just a little different syntax. I’m not going to walk through the tutorial of how to set this up since it’s pretty similar to Google Sheets. In the VBA script, you do end up doing some “cell selection” like in line 8. Most of the script, however, is working with the Excel data object model so the script should run pretty quickly regardless of the size of your Excel file.

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

- Developer Love #3: Developer Experience Teams with Peggy Rayzis of Apollo

The post Dear Analyst #42: Filling values down into empty cells programmatically with Google Apps Script & VBA tutorial appeared first on .

]]>See the video below if you want to jump straight to the tutorial:

https://www.youtube.com/watch?v=t-32QkyjKVE&feature=youtu.be

Why is this data structure a problem?

You've inherited a spreadsheet and the data structure looks like this:

It's a list of data but there are empty cells in column A. This is usually a category or dimension in your data set that needs to be "filled down" so that the data set is complete. In the Google Sheet, each row represents one person that is associated with a given SPAC, but the SPAC Ticker column is incomplete. You'll usually get this type of data structure through the following:

* Data was manually created by someone who didn't fill down the values in column A since they thought it was a "category" * You are working a data set that originally came from a PivotTable but you only have the "values" from the PivotTable, not the PivotTable itself

This data structure is a problem because if you want to do any type of analysis on this data, it will be extremely difficult since you have missing values in column A. Sorting, filtering, and PivotTables are all out of the question if your data set looks like that screenshot.

Solving this with keyboard shortcuts

Totally doable for this Google Sheet. This is what you could do:

All I'm doing above is the following (on PC):

* SHIFT+CONTROL+DOWN ARROW - Select all the empty cells from the current cell with a value up until the next cell with a value* SHIFT+UP ARROW - Reduce the selection by one row* CONTROL+D - Fill the value from the first cell in the selection down* CONTROL+DOWN ARROW - Skip to the next value that needs to be filled down

The obvious tradeoff here is time vs. human error. Every time I have to do this task on a spreadsheet, I think about whether it was worth filling the values down "manually" using keyboard shortcuts or using a VBA script (in Excel) to do this programatically. It really depends on the number of rows. For the example SPAC Google Sheet, doing this with keyboard shortcuts takes 10 seconds tops. If this spreadsheet was 1,000,000 rows, then we have a problem.

Don't worry, I got you. Here's the script you can use to do this programmatically.

Using Google Apps Script in Google Sheets

First off, here's the script you can use for Google Sheets (gist 49588

The post Dear Analyst #41: How to do a VLOOKUP to the “left” without INDEX/MATCH with TikTok data appeared first on .

]]>`VLOOKUP`

to the “left” e.g. your lookup column is `INDEX/MATCH`

strategy is the most commonly-cited strategy when good ‘ol `VLOOKUP`

is not at your disposal. In this episode I walk through a strategy that allows you to use `VLOOKUP`

: array formulas. Skip to strategy #3 below if you want to see the answer. Associated Google Sheet for this episode if you want to follow along.
*See the video below if you want to jump straight to the tutorial:*

If you are new to why `VLOOKUP`

won’t work in this scenario (see Google Sheet), take a look at the data data structure below:

We have `ID`

in column A and we want to find `Company Name`

and `Market Cap`

in columns C and D, respectively, for these `ID`

s. The `ID`

in column A is the unique identifier for the row, and we need to do a lookup to `Company ID`

in column I.

While you can eyeball the result for the first row (“Triller” is the company for `ID`

3), we want to find a scalable solution using formulas.

As you start writing the `VLOOKUP`

formula in column C, you’ll start to notice the problem: the `Company ID`

column is not the *first *column in your table to lookup the `ID`

value in column A:

Here are a few strategies for solving this problem (#3 is probably the one you haven’t seen before).

This is not the most ideal solution, but you could just simply cut and paste the `Company ID`

column and move it to the left-most “first” column of your lookup table. In Excel you would have to do a cut and paste, but in Google Sheets you can just drag and drop the column into the proper position:

Now the `VLOOKUP`

for `Company Name`

will work correctly since `Company ID`

is the first column in your lookup table:

I don’t like this strategy because it involves some manual cutting and pasting of columns. If your lookup table isn’t static (e.g. might be sales data that gets added daily), then you might be ruining the “structure” of your data on subsequent updates. Let’s see what else we can do.

Also not an ideal solution, but it works in one-off cases where your data is static and you don’t care about showing your back-end work to a colleague. It looks like data is duplicated, but you’re basically referencing existing columns in your table so that those columns appear to the “right” of your lookup column:

Now you can do a `VLOOKUP`

for columns I to K to get the `Company Name`

and `Market Cap`

values to show up in columns C and D:

A relatively unknown feature in Google Sheets is you can create your own “tables” using array formulas. An array is simply a range of cells, and you can separate different range of cells using a semicolon. To create an array, you put curly brackets around your ranges. Here’s how an array of columns F and G would look like:

What’s the result? You simply get a reference to the two ranges after you enter the formula:

The key here is that you can create *any order *of range references in the array formula. We could’ve put G2:G6 first and F2:F6 second, and you would’ve seen the values in `Website`

first followed by `Company Name`

after entering the formula.

Knowing this, we can create our own lookup “table” using the array formula syntax like so:

Notice how the second argument in the `VLOOKUP`

formula is no longer a table, but rather an array of column I followed by columns F to H. In this array, the second “column” is `Company Name`

since we are saying column F is the second range of cells after column I. `Market Cap`

is now the fourth column in this array:

In order to fill this formula down, we need to turn the range references in the array formula into absolute references as shown above.

As mentioned at the beginning of this post, this is the most common method for looking up values to the left. I won’t give a detailed explanation of how `INDEX/MATCH`

works, but here’s how you would get the `Company Name`

given the data structure:

I’m a little torn between strategies #3 and #4 since `INDEX/MATCH`

is the go-to method for looking up data to the left, and is also more performant than `VLOOKUP`

on large data sets. The fact that the array formula in strategy #3 doesn’t involve a nested formula makes it potentially easier to debug in complicated spreadsheets. I haven’t used an array formula in many `VLOOKUP`

situations since I learned `INDEX/MATCH`

such a long time ago, but I may try this strategy in the future.

Of course, this all becomes irrelevant if you have the `XLOOKUP`

function at your disposal which became available to certain Office 365 subscribers about a year ago (September 2019). This video is a fun poke at `XLOOKUP`

, but also holds some truth for the `VLOOKUP`

purists out there (start watching at 1:19):

I talk about this in the 2nd half of the episode, but thought it would be worth sharing a passage from *The Critique of Pure Reason* as it relates to betting on your convictions. Listen to the *Knowledge Project* episode for the full background:

The usual touchstone, whether that which someone asserts is merely his persThe usual touchstone, whether that which someone asserts is merely his persuasion — or at least his subjective conviction, that is, his firm belief — is betting. It often happens that someone propounds his views with such positive and uncompromising assurance that he seems to have entirely set aside all thought of possible error. A bet disconcerts him. Sometimes it turns out that he has a conviction which can be estimated at a value of one ducat, but not of ten. For he is very willing to venture one ducat, but when it is a question of ten he becomes aware, as he had not previously been, that it may very well be that he is in error. If, in a given case, we represent ourselves as staking the happiness of our whole life, the triumphant tone of our judgment is greatly abated; we become extremely diffident, and discover for the first time that our belief does not reach so far. Thus pragmatic belief always exists in some specific degree, which, according to differences in the interests at stake, may be large or may be small.

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

- The ShopTalk Show #424: Web Components, Frameworks vs Vanilla, Accessible Numbers, and SVG Memory Usage
- The Knowledge Project #89: Maria Konnikova: Less Certainty, More Inquiry

The post Dear Analyst #41: How to do a VLOOKUP to the “left” without INDEX/MATCH with TikTok data appeared first on .

]]>Was trying to find some gif associated with "looking up"

See the video below if you want to jump straight to the tutorial:

https://www.youtube.com/watch?v=6JluR45VJl4

Why the VLOOKUP won't work

If you are new to why VLOOKUP won't work in this scenario (see Google Sheet), take a look at the data data structure below:

We have ID in column A and we want to find Company Name and Market Cap in columns C and D, respectively, for these IDs. The ID in column A is the unique identifier for the row, and we need to do a lookup to Company ID in column I.

While you can eyeball the result for the first row ("Triller" is the company for ID 3), we want to find a scalable solution using formulas.

As you start writing the VLOOKUP formula in column C, you'll start to notice the problem: the Company ID column is not the first column in your table to lookup the ID value in column A:

Here are a few strategies for solving this problem (#3 is probably the one you haven't seen before).

Strategy #1: Move the lookup column to the first column position

This is not the most ideal solution, but you could just simply cut and paste the Company ID column and move it to the left-most "first" column of your lookup table. In Excel you would have to do a cut and paste, but in Google Sheets you can just drag and drop the column into the proper position:

Now the VLOOKUP for Company Name will work correctly since Company ID is the first column in your lookup table:

I don't like this strategy because it involves some manual cutting and pasting of columns. If your lookup table isn't static (e.g. might be sales data that gets added daily), then you might be ruining the "structure" of your data on subsequent updates. Let's see what else we can do.

Strategy #2: Make copies of the columns to the right of the lookup column

Also not an ideal solution, but it works in one-off cases where your data is static and you don't care about showing your back-end work to a colleague. It looks like data is duplicated, but you're basically referencing existing columns in your table so that those columns appear to the "right" of your lookup column:

Now you can do a VLOOKUP for columns I to K to get the Company Name and Market Cap values to show up in columns C and D:

]]>

The post Dear Analyst #40: A spreadsheet error from two Harvard professors leading to incorrect economic policies after 2008 recession appeared first on .

]]>*See the video below if you want to jump straight to the tutorial:*

Economists Carmen Reinhart and Kenneth Rogoff published a paper in 2010 called *Growth in a Time of Debt* (originally published in the American Economic Review) where they argued:

[…] median growth rates for countries with public debt over 90 percent of GDP are roughly one percent lower than otherwise; average (mean) growth rates are several percent lower.

In 2013, PhD students Thomas Herndon, Michael Ash, and Robert Pollin of the University of Massachusetts, Amherst had re-created the study from Reinhart and Rogoff’s paper as part of their PhD program. The students had to analyze the original Excel files that Reinhart and Rogoff used, and they weren’t able to replicate the original results. They cited in their own paper entitled *Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff* :

[…] coding errors, selective exclusion of available data, and unconventional weighting of summary statistics lead to serious errors that inaccurately represent the relationship between public debt and GDP growth among 20 advanced economies in the post-war period.

Reinhart and Rogoff suggested that the debt/GDP ratio and economic growth is simply a correlation, and that correlation still holds after correcting for the spreadsheet mistakes. However, that correlation is not as strong as their original paper posited.

The implications of their findings resulted in news outlets, politicians, and policymakers using the 90% benchmark as a signal that a country is heading for low economic growth. Some notable examples:

- 2012 Republican nominee for the US vice presidency Paul Ryan included the paper in hi proposed 2013 budget
- The
*Washington Post*editorial board takes it as an economic consensus view, stating that “debt-to-GDP could keep rising — and stick dangerously near the 90 percent mark that economists regard as a threat to sustainable economic growth.” - Austerity measures are put into place around the world despite the advice from economic advisers, pushing unemployment rate above 10% in the eurozone

The three main errors that Herndon, Ash, and Polling discovered are the following:

- Years of high debt and average growth where selectively excluded from the data set
- Countries’ GDP growth rates were not properly weighted
- Summary table excludes high-debt and average-growth countries

This video illustrates the three individual problems with the spreadsheet really clearly:

If you fix these errors, the average real GDP growth rate for countries carrying a public debt-to-GDP ratio of over 90% is actually **2.2%**, not **-0.1%**. In the Google Sheet I shared, you wont’ see the correct 2.2% average growth rate since I’m not doing the full analysis and focusing on the third Excel error stated above.

The third error of incorrectly excluding high-growth countries from the average GDP growth rate is a particularly egregious mistake, and Reinhart and Rogoff admit that they made this simple cell referencing mistake. As you can see in the screenshot below, they simply omit rows 45 to 49 in their `AVERAGE`

formula:

Here are three methods Reinhart and Rogoff could have used to ensure that they referenced the correct cells to avoid this mistake:

After you select all the cells that contain GDP growth rates in column G, you can look at the dropdown in the bottom right of Excel or Google Sheets to see the average. No formulas required:

You can also get other summary stats like the `SUM`

, `MIN`

, and `MAX`

of your selected range of cells. Probably the easiest method to get a quick sanity check of your averages that you’ve calculated in lines 26-27 of the Google Sheet.

This one is my preferred method, and is quite common in financial models. Usually you’ll see this type of “error checking” when you want to make sure you’ve captured the correct cell references for a `SUM`

formula, but with some extra work you can check for averages too.

You start by writing a formula *below* your actual summary stats (in this case starting on line 28 of the Google Sheet) and create a `SUM`

formula of the data:

The big question is this: how do you know if you’ve referenced the correct cells in your “checksum” formula? The hope here is that by writing the `SUM`

formula for the second time, in theory, you won’t make the same mistake twice. Obviously this is a big assumption in this method, but let’s assume you’ve properly made the reference for this internal error-checking formula.

The next formula below the “checksum” is a “count” formula:

Notice how it’s not a `COUNT`

formula. This is because the table contains the “n.a.” text so a `COUNTA`

formula would be incorrect since it would count all values in the column. We only want the numeric values, hence the reason for using `COUNT`

.

Finally, the “checkaverage” formula compares your *actual* average in line 26 with the result of `checksum`

/ `count`

. If the values aren’t equal, then you’ll get the text “Error” as the result of the `IF`

formula:

Since line 26 references the “incorrect” averages used in Reinhart and Rogoff’s paper, we get errors across the board. This “checksum” or “checkaverage” methodology gives you a visual indicator on whether your calculated results are properly referencing all the cells in the range instead of a subset. Instead of writing a “checksum” and “count” formula, you could simplify the “checkaverage” formula to this:

We simply put the `SUM`

and `COUNT`

formulas inside the first argument of the `IF`

statement.

This method also relies on you selecting the proper cells to build your PivotTable. Again, assuming you don’t make the same mistake twice, selecting the cells in the range should be a pretty simple task. After you select the cells (B4:G24 in this case), you build a PivotTable with `Country`

in the Rows and the four debt/GDP buckets in the values. You then summarize each metric with the `AVERAGE`

selection:

The “Grand Total” on the last line of the PivotTable contains the average across all growth rates. You can then compare these numbers to your computed numbers on the first sheet that contains your table.

People don’t check their analyses with the the above 3 methods because it takes extra work and…well…people are lazy. In addition to putting in error checks to ensure you are not making simple spreadsheet errors like this, there are other strategies you can use to ensure others can replicate your work to detect potential errors.

For Reinhart and Rogoff, they didn’t make their full underlying data public. They only shared their spreadsheet after Herndon, Ash and Pollin reached out to them as the trio was trying to replicate their results. Some other strategies:

- Upload your results to a public repository like GitHub early on in your analysis and “open source” your data
- Write detailed steps on experimental design, procedures, equipment, data processing, and statistical methods used so others can replicate your experiment

I really liked this quote from a commenter about the Excel error on this Stat Modeling blog:

I’d like to see how many researchers expose themselves to such criticism. Uploading a raw dataset is one thing but allowing people to see all your intermediate calculations in messy detail is rare.

Too often we’re caught up in doing all the number crunching ourselves and then sharing the output once we think we’ve crossed finished the analysis. As this example suggests, sharing your data set and model *as you are doing the analysis* can prevent a blunder like this from happening.

In the second half of this episode, I discuss an article in *The Verge* about how the HUGO Gene Nomenclature Committee had to rename gene names because of Excel’s simple feature of auto-formatting dates. Gene names like “MARCH1” and “SEPT1” get re-formatted to the dates “1-Mar” and “1-Sep” when these values are entered into Excel. I thought this was interesting to see the scientific community bending to this standard feature in Excel given the widespread use of Excel in the scientific community.

- The Verge: Scientists rename human genes to stop Microsoft Excel from misreading them as dates
- This Week In Startups #948: HackerOne CEO Mårten Mickos shares insights on how he grew his bug bounty army to 400,000 strong by providing a path to hack for good, most common security vulnerabilities, worst security breaches, hacking the Pentagon, protecting the open source that unites us & scaling a company culture that defaults to disclosure

The post Dear Analyst #40: A spreadsheet error from two Harvard professors leading to incorrect economic policies after 2008 recession appeared first on .

]]>See the video below if you want to jump straight to the tutorial:

https://youtu.be/mXUynkQQ1uM

Background

Economists Carmen Reinhart and Kenneth Rogoff published a paper in 2010 called Growth in a Time of Debt (originally published in the American Economic Review) where they argued:

[...] median growth rates for countries with public debt over 90 percent of GDP are roughly one percent lower than otherwise; average (mean) growth rates are several percent lower.

In 2013, PhD students Thomas Herndon, Michael Ash, and Robert Pollin of the University of Massachusetts, Amherst had re-created the study from Reinhart and Rogoff's paper as part of their PhD program. The students had to analyze the original Excel files that Reinhart and Rogoff used, and they weren't able to replicate the original results. They cited in their own paper entitled Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff :

[...] coding errors, selective exclusion of available data, and unconventional weighting of summary statistics lead to serious errors that inaccurately represent the relationship between public debt and GDP growth among 20 advanced economies in the post-war period.

Reinhart and Rogoff suggested that the debt/GDP ratio and economic growth is simply a correlation, and that correlation still holds after correcting for the spreadsheet mistakes. However, that correlation is not as strong as their original paper posited.

Why this was a big deal

The implications of their findings resulted in news outlets, politicians, and policymakers using the 90% benchmark as a signal that a country is heading for low economic growth. Some notable examples:

* 2012 Republican nominee for the US vice presidency Paul Ryan included the paper in hi proposed 2013 budget* The Washington Post editorial board takes it as an economic consensus view, stating that "debt-to-GDP could keep rising — and stick dangerously near the 90 percent mark that economists regard as a threat to sustainable economic growth." * Austerity measures are put into place around the world despite the advice from economic advisers, pushing unemployment rate above 10% in the eurozone

3 main Excel spreadsheet problems with the model

The three main errors that Herndon, Ash, and Polling discovered are the following:

* Years of high debt and average growth where selectively excluded from the data set* Countries' G...]]>

The post Dear Analyst #39: Generate a random list of names from a list of popular 90s TV characters appeared first on .

]]>`RAND`

function, but let’s take it a step further and say you want to give the end user the ability to dictate the Start with your list of names in column B. This can be any list you want to randomize. My list is just a bunch of TV characters from shows I watched when I was a kid.

In column A, you put the `RAND`

function and copy it all the way down to the bottom of our list. You’ll get a decimal with random numbers. Doesn’t look that useful now, but this random number column will drive the rest of the tool to generate your list of random names:

It sounds kind of weird, why would you sort a random list of numbers? What does that even mean? As you have probably seen, every time you refresh your Google Sheet or commit an Excel formula by hitting ENTER, all those random numbers in column A will change. This means if you sort this list of random numbers, the sorted list will change too. I put a space in column C so in cell D2, you enter this formula:

The `SORT`

function takes in a range of cells as the first parameter, the sort `index`

as the 2nd (which is just the number column we ant to sort on, column #1), and then `true`

or `false`

for sorting in ascending or descending order. You can also put 0 to indicate `false`

which is what I did in this example to sort in descending order.

The nice thing about the `SORT`

function is that it automatically fills the formula down to the bottom of your data set. This is a relatively new function in Excel since it kind of acts like dynamic array formulas or array-entered formulas. The formula kind of “spills” down for you as your list grows so you don’t have to worry about dragging the formula down until the last row in your data set.

What does this column of sorted random numbers do for us? Well, we know that each random number in this *sorted* column corresponds to one of the numbers in column A where we generated the random number. So in column E, we just do a `VLOOKUP`

using column D as our lookup value and columns A:B as our lookup table to get the name associated with the random number in column D:

This is not the usual way you might use `VLOOKUP`

because you’re usually using `VLOOKUP`

with some unique identifier as the lookup value. Column A isn’t really a unique “TV character ID” since that “ID” changes all the time with the `RAND`

function. We don’t really care about that, because now when you refresh the Sheet, column E will always have a random list of names:

In the above gif I’m just pressing COMMAND + R a few times to refresh the Sheet so that the `RAND`

function in column A constantly changes.

We could stop here since you now have a random list of names in column E. Let’s take this a step further and give the end user the ability to *choose the number *of random names from the list.

We’re already doing some hacking with `VLOOKUP`

and using it in a way that it probably wasn’t made to use, so let’s do something similar with the `OFFSET`

function. Cell H5 is just my “user input” cell where I’m getting the number of results from the user. This is a hard-coded number the user has to input. Then in cell H2, I have this `OFFSET`

formula:

Let’s break this down by each parameter:

**E2**– This is the “starting point” for my`OFFSET`

function**0**– I don’t want to move any rows up/down**0**– I don’t want to move any columns up/down**H1**– References my user input cell indicating how many*rows*of data I want to return from my`OFFSET`

(e.g. “height” of the range)**1**– How many columns to return (e.g. “width” of the range)

Now as you put a number in cell H1, the list of random names will grow and shrink. If you put a number that is more than the list of names you have, then it will just return the max number of names from your list (in random order, of course):

A caveat I point out at the end of this episode is that while you *can *do this random list of names generator in Excel or Google Sheets, a spreadsheet may not be the best tool for the job. There are hundreds of random list generator apps that may be built specifically for your industry be it education or hospitality. Sometime it’s just easier to do it in a spreadsheet because all our data is there, but constantly question if the tool you are using is the right one for the job.

There’s a similar template in the Coda gallery which generates a random list of teams of players based on the number of teams and players you have. Just another nifty way at approaching the same problem in a different tool. *Disclosure: I work at Coda*.

- Google Cloud Platform Podcast #226: Documentation in Developer Practices with Riona Macnamara

The post Dear Analyst #39: Generate a random list of names from a list of popular 90s TV characters appeared first on .

]]>https://www.youtube.com/watch?v=icKppdnxJRk

Create your list in column B

Start with your list of names in column B. This can be any list you want to randomize. My list is just a bunch of TV characters from shows I watched when I was a kid.

Source: Fandom

In column A, you put the RAND function and copy it all the way down to the bottom of our list. You'll get a decimal with random numbers. Doesn't look that useful now, but this random number column will drive the rest of the tool to generate your list of random names:

Sort this random list of numbers

It sounds kind of weird, why would you sort a random list of numbers? What does that even mean? As you have probably seen, every time you refresh your Google Sheet or commit an Excel formula by hitting ENTER, all those random numbers in column A will change. This means if you sort this list of random numbers, the sorted list will change too. I put a space in column C so in cell D2, you enter this formula:

The SORT function takes in a range of cells as the first parameter, the sort index as the 2nd (which is just the number column we ant to sort on, column #1), and then true or false for sorting in ascending or descending order. You can also put 0 to indicate false which is what I did in this example to sort in descending order.

The nice thing about the SORT function is that it automatically fills the formula down to the bottom of your data set. This is a relatively new function in Excel since it kind of acts like dynamic array formulas or array-entered formulas. The formula kind of "spills" down for you as your list grows so you don't have to worry about dragging the formula down until the last row in your data set.

A good 'ol VLOOKUP

What does this column of sorted random numbers do for us? Well, we know that each random number in this sorted column corresponds to one of the numbers in column A where we generated the random number. So in column E, we just do a VLOOKUP using column D as our lookup value and columns A:B as our lookup table to get the name associated with the random number in column D:

This is not the usual way you might use VLOOKUP because you're usually using VLOOKUP with some unique identifier as the lookup value. Column A isn't really a unique "TV character ID" since that "ID" changes all the time with the RAND function. We don't really care about that, because now when you refresh the Sheet, column E will always have a random list of names:

...]]>

The post Dear Analyst #38: Breaking down an Excel error that led to a $6.2B loss at JPMorgan Chase appeared first on .

]]>I’m not going to pretend like a know the intricacies of all the financial products involved here, so you can read the Wikipedia article if you want the full details. In 2012, there was a CDS (credit default swap) product called CDX IG 9 that the trader at JPMorgan may have made large bets on, and ended up on the wrong side of the bet. The London trader’s name is Bruno Iksil, and it was a classic scenario of a gambler trying to get out of his losses by doubling down on black at the roulette table.

Multiple investigations were taken by the authorities in the U.S. and U.K., the the investigations show that a variety of institutional failures may have facilitated the large bets made by the London Whale. This HBR article by Ben Heineman, Jr. provides a nice summary of all the key players:

**London traders**– The traders simply didn’t understand the complexity of the derivative products they were buying and selling**Chief Investment Office****(CIO)**– The head of the CIO didn’t monitor the trading strategies and put in the proper controls for the portfolio of products the office was buying. The Value at Risk (VaR) model was flawed (see more below).**Firm-wide Leaders**– Not enough oversight by the CFO and CEO (Jamie Dimon)**Board and Risk Policy Committee**– The committee was told that everything was fine with the CIO, and didn’t get accurate pictures of what risk officers really felt about the risky trades being made.

There is a 130-page report created by JPMorgan Chase in 2012 which details what happened internally that led to this debacle. In my opinion, the juicy stuff starts in the appendix starting on page 121 of the report. I read off some parts of this appendix in this episode, but the appendix basically details issues with the VaR models created by one of the quantitative modelers at JPMorgan to more accurately value the complex traders that were happening. Or at least they thought the model was more accurate.

At the very end of the appendix, there’s a section called “Discovery of Problems with the New VaR Model and Discontinuance” where the report details the Excel error that contributed to the large inaccuracies in how the model valued risk.

This is how the error is described in the report (emphasis mine):

Following that decision, further errors were discovered in the Basel II.5 model, including, most significantly, an operational error in the calculation of the relative changes in hazard rates and correlation estimates. Specifically,

, as the modeler had intended.after subtracting the old rate from the new rate, the spreadsheet divided by their sum instead of their average

*Note: I don’t have domain expertise in VaR models, synthetic credit derivatives, or trading in general. The following example is my over-simplification of the error based on what’s written in the report.*

The report talks about hazard rates (for what I assume relate to the default of corporate loans in this case) and how the changes in the hazard rates were improperly calculated. Here’s a simple table from the Google Sheet showing fictitious dates, hazard rates, and the change in rates:

Now here’s what happens when you apply a `SUM`

vs. an `AVERAGE`

to the “Change in %” column:

This is hitting the border of my knowledge of growth rates and time periods, but the *sum of changes* will always be 5X the *average of changes* given there are 5 values we are summing/averaging.

The magnitude of the difference between the `SUM`

* *and the `AVERAGE`

is not what I think is interesting, but rather the *absolute difference *between the

`SUM`

and `AVERAGE`

. Here is a chart plotting the same data:Based on this chart, can you estimate what the *average* of the Change in % is? Looks like something around 0%, but 3% doesn’t feel *that *far off. The point I’m trying to make is that unless you are monitoring the `SUM`

and `AVERAGE`

consistently over time to detect any outliers, it will be difficult to know whether you made the formula mistake in the first place. With the presence of outliers, it makes it more clear that you might have an error in your model. Here’s the other table from the Google Sheet with intentionally skewed hazard rates:

Here we see the magnitude of the difference is still 5X, but the *absolute difference* is much wider. This would cause an analyst to look deeper into the model and try to figure out why there is such a large discrepancy. But this is only because there are fictitious hazard rates. In the case of JPMorgan Chase, my hunch is that the gap between the lower and upper bound of daily hazard rates was really narrow, so detecting a change like this would’ve been very difficult without the proper controls in place.

This reminds me of the tale of the boiling frog:

Urban myth has it that if you put a frog in a pot of boiling water it will instantly leap out. But if you put it in a pot filled with pleasantly tepid water and gradually heat it, the frog will remain in the water until it boils to death. (Source)

Without a really hot pot of boiling water, it was too late for JPMorgan to detect there was something wrong with the CDS trades, and the proverbial frog boils to death.

One frame for this egregious Excel error is Hanlon’s Razor:

“Never attribute to malice that which is adequately explained by stupidity”, known in several other forms. It is a philosophical

razorwhich suggests a way of eliminating unlikely explanations for human behavior. (Source)

Perhaps the modeler cannot be blamed for his Excel error because it was an error that he had know way of knowing or predicting. I’m not trying to remove blame from the modeler, but it’s an interesting frame to analyze the problem because this is an spreadsheet error that is difficult to prevent unless you have other models and risk controls that are able to predict this type of error in advance. There are many other cases of Excel errors that led to false calculations that cost firms millions of dollars, and it’s hard to say if one can blame the modeler for “malice” or plain stupidity.

Quick plug for a new Excel class I just launched today on Skillshare. It’s an intermediate Excel class for cleaning and analyzing data.

- a16z Podcast: The Future of Decision-Making–3 Startup Opportunities

The post Dear Analyst #38: Breaking down an Excel error that led to a $6.2B loss at JPMorgan Chase appeared first on .

]]>Derivative of a derivative

I'm not going to pretend like a know the intricacies of all the financial products involved here, so you can read the Wikipedia article if you want the full details. In 2012, there was a CDS (credit default swap) product called CDX IG 9 that the trader at JPMorgan may have made large bets on, and ended up on the wrong side of the bet. The London trader's name is Bruno Iksil, and it was a classic scenario of a gambler trying to get out of his losses by doubling down on black at the roulette table.

Source: The Fiscal Times

Multiple investigations were taken by the authorities in the U.S. and U.K., the the investigations show that a variety of institutional failures may have facilitated the large bets made by the London Whale. This HBR article by Ben Heineman, Jr. provides a nice summary of all the key players:

* London traders - The traders simply didn't understand the complexity of the derivative products they were buying and selling* Chief Investment Office (CIO) - The head of the CIO didn't monitor the trading strategies and put in the proper controls for the portfolio of products the office was buying. The Value at Risk (VaR) model was flawed (see more below).* Firm-wide Leaders - Not enough oversight by the CFO and CEO (Jamie Dimon) * Board and Risk Policy Committee - The committee was told that everything was fine with the CIO, and didn't get accurate pictures of what risk officers really felt about the risky trades being made.

Appendix of the Task Force Report by JPMorgan

There is a 130-page report created by JPMorgan Chase in 2012 which details what happened internally that led to this debacle. In my opinion, the juicy stuff starts in the appendix starting on page 121 of the report. I read off some parts of this appendix in this episode, but the appendix basically details issues with the VaR models created by one of the quantitative modelers at JPMorgan to more accurately value the complex traders that were happening. Or at least they thought the model was more accurate.

At the very end of the appendix, there's a section called "Discovery of Problems with the New VaR Model and Discontinuance" where the report details the Excel error that contributed to the large inaccuracies in how the model valued risk.

The $6.2B Excel error

This is how the error is described in the report (emphasis mine):

Following that decision, further errors were discovered in the Basel II.5 model, including, most significantly, an operational error in the calculation of the relative changes in hazard rates and correlation estimates. Specifically,]]>

The post Dear Analyst #37: Text manipulation functions to extract domain names from email addresses appeared first on .

]]>`FIND`

, `RIGHT`

, and `LEN`

functions to extract the domain name from an email address (e.g. the “tesla.com” from “john.smith@tesla.com”). Here’s the Google Sheet if you want to make a copy for yourself to follow along.
The first step is to use the `FIND`

function to find the location of the “@” symbol in the email address. The `FIND`

function takes two required arguments and one optional argument. You’re basically find the *index location *of where that characters or string exists within the cell:

In the case of “john.smith@amazon.com,” the `FIND`

function would return 11 since the “@” symbol starts at the 11th position within the email address. Pretty simple right?

The next part is a little trickier. Now that we know the position of the “@” symbol, we want all the characters *after* the “@” symbol to get the domain of the email address. There are multiple ways of doing this, but I chose to use the `RIGHT`

and `LEN`

functions. To make this more clear, I could have put the `LEN`

function in its own column, but decided to next it within the `RIGHT`

function:

The `RIGHT`

function takes two arguments and simply returns the number of characters from the “right” of the text you give it (in this case the email address). Since we don’t know how many characters to pull from each e-mail address, we use the result of the `LEN(A2) - B2`

formula which tells us how many characters to pull from the right of the email address.

`LEN(A2)`

gives us the length of the entire text (for “john.smith@amazon.com” it’s 21). If we subtract the index position of the “@” symbol from that length, we’ll get the exact number of characters to pull for each unique email address. Pretty nifty.

*Note: The “Position of @” column also could’ve been nested in the 3rd column (and replaced the current cell reference of B2).*

I typically use a combination of `FIND`

, `LEN`

, and `MID`

to extract the text I need from a longer piece of text. Once you master these few functions, you’ll be able to to pull anything you want out of a long piece of text to get “clean” data.

- The Tim Ferriss Show #444: Hugh Jackman on Best Decisions, Daily Routines, The 85% Rule, Favorite Exercises, Mind Training, and Much More
- EconTalk: Robert Lerman on Apprenticeships

The post Dear Analyst #37: Text manipulation functions to extract domain names from email addresses appeared first on .

]]>Start with finding the @

The first step is to use the FIND function to find the location of the "@" symbol in the email address. The FIND function takes two required arguments and one optional argument. You're basically find the index location of where that characters or string exists within the cell:

In the case of "john.smith@amazon.com," the FIND function would return 11 since the "@" symbol starts at the 11th position within the email address. Pretty simple right?

Nesting LEN inside the RIGHT function

The next part is a little trickier. Now that we know the position of the "@" symbol, we want all the characters after the "@" symbol to get the domain of the email address. There are multiple ways of doing this, but I chose to use the RIGHT and LEN functions. To make this more clear, I could have put the LEN function in its own column, but decided to next it within the RIGHT function:

The RIGHT function takes two arguments and simply returns the number of characters from the "right" of the text you give it (in this case the email address). Since we don't know how many characters to pull from each e-mail address, we use the result of the LEN(A2) - B2 formula which tells us how many characters to pull from the right of the email address.

LEN(A2) gives us the length of the entire text (for "john.smith@amazon.com" it's 21). If we subtract the index position of the "@" symbol from that length, we'll get the exact number of characters to pull for each unique email address. Pretty nifty.

Note: The "Position of @" column also could've been nested in the 3rd column (and replaced the current cell reference of B2).

I typically use a combination of FIND, LEN, and MID to extract the text I need from a longer piece of text. Once you master these few functions, you'll be able to to pull anything you want out of a long piece of text to get "clean" data.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

* The Tim Ferriss Show #444: Hugh Jackman on Best Decisions, Daily Routines, The 85% Rule, Favorite Exercises, Mind Training, and Much More* EconTalk: Robert Lerman on Apprenticeships

]]>

The post Dear Analyst #36: What The Economist’s model for the 2020 presidential election can teach us about forecasting appeared first on .

]]>The model utilizes machine learning and multiple data sources and it’s easy to get caught up in the details. Here are the key takeaways as described by Dan Rosenhack, the data editor at *The Economist*:

- Machine learning is used to create equations to predict the 2020 presidential outcome
- Early polls are not as reliable early on in the election cycle
- Partisan non-response bias can result in a supporter being more likely or unlikely to respond to a pollster when there is extremely good or bad news about that supporter’s party or candidate

**A caveat**: *The Economist*‘s model and the various forecasting techniques they use are definitely outside of my knowledge and skillset. Most of this episode is me learning more about the model and interpreting some of the results. You don’t have to be a statistics programmer or data science professional to appreciate what the data team has done at *The Economist*. If you are working with data in any capacity, pushing yourself to learn about subjects that push your comfort zone will only make you more knowledgable about the data analysis process.

One key finding from the model is that polls conducted in the first half of the year during the election cycle are a pretty weak predictor of results. On the other hand, fundamental measures like the president’s approval rating, GDP growth, and whether there is an incumbent running for re-election are much better predictors. This chart shows the difference between poll results and fundamentals for predicting the outcome in 1992:

The model primarily relies on these fundamental indicators, but over time the polls become a better indicator for predicting the outcome. In the last week leading up the election in November, more weight is applied to the polls than the fundamentals.

This visualization below shows that early polls tend to *overestimate* a party’s share of the vote (in this case the Democratic share) compared to fundamental indicators. As you get closer to election day, however, the polls start to become a better predictor:

One downside *The Economist* points out with other models that try to forecast the presidential election is that equations are created that *overfit* to historical data points. Think about it: if you tried to create an equation to predict who would win the NBA championship in 2020 based on 1990s data, you may create an equation that leans heavily to the Bulls. Unfortunately, Michael Jordan isn’t playing anymore and the 2020 NBA season is now being played in a bubble in Orlando.

*The Economist* utilizes machine learning to better predict the outcome of the presidential election and utilizes two techniques which I’ll try to explain in layman’s terms from reading the post:

- Elastic-net regularisation – Simplify the equation you’re using to predict the outcome
- Leave-one-out-cross-validation – Split your data into pieces and apply the machine learning to each piece to predict outcomes

#2 is a pretty common technique I’ve seen used in finance. Take actual results and see if you can predict what *would’ve happened *if you applied your forecast to last quarter or last year.

In the context of the presidential election, let’s say the model is trying to predict what the outcome of the 1948 election would’ve been (the incumbent Harry Truman defeated Thomas Dewey). The training model is done on all the other years of data *except *for 1948. Then use the learnings from these other years to see which model was best at predicting the outcome in 1948.

The model also looks at state-level polling data. What’s interesting about the state model is how it uses demographic data like population density and the share of voters that are white evangelical Christians to determine how similar two states are in terms of voter preferences:

In the visualization above, Wisconsin is more similar to Ohio than Nevada is to Ohio.

I’ve never heard this term before and think the way the team is accounting for this bias in their model makes the model more accurate and unique. They take polling data from major sources like *ABC* and *The Washington Post* and track the changes in poll results *over time*. This means they can account for any irregularities in the data so that large swings in opinion due to news about a candidate don’t impact the model too much.

One visualization that caught my eye in the source code *The Economist* released is this one showing the model results vs. the polls vs. actuals from the 2008, 2012, and 2016 elections. Notice how in 2008 and 2012 the variability between the model, prior, and result are much closer together than in 2016? Just shows the level of uncertainty that went into the 2016 prediction.

Speaking of uncertainty, I like this commit message as the team was refining the model back in March

- Software Engineering Daily: Data Intensive Application with Martin Kleppmann

The post Dear Analyst #36: What The Economist’s model for the 2020 presidential election can teach us about forecasting appeared first on .

]]>Source: The Economist

Key takeaways and a caveat

The model utilizes machine learning and multiple data sources and it's easy to get caught up in the details. Here are the key takeaways as described by Dan Rosenhack, the data editor at The Economist:

* Machine learning is used to create equations to predict the 2020 presidential outcome* Early polls are not as reliable early on in the election cycle* Partisan non-response bias can result in a supporter being more likely or unlikely to respond to a pollster when there is extremely good or bad news about that supporter's party or candidate

A caveat: The Economist's model and the various forecasting techniques they use are definitely outside of my knowledge and skillset. Most of this episode is me learning more about the model and interpreting some of the results. You don't have to be a statistics programmer or data science professional to appreciate what the data team has done at The Economist. If you are working with data in any capacity, pushing yourself to learn about subjects that push your comfort zone will only make you more knowledgable about the data analysis process.

Fundamentals vs. early polling

One key finding from the model is that polls conducted in the first half of the year during the election cycle are a pretty weak predictor of results. On the other hand, fundamental measures like the president's approval rating, GDP growth, and whether there is an incumbent running for re-election are much better predictors. This chart shows the difference between poll results and fundamentals for predicting the outcome in 1992:

Source: The Economist

The model primarily relies on these fundamental indicators, but over time the polls become a better indicator for predicting the outcome. In the last week leading up the election in November, more weight is applied to the polls than the fundamentals.

This visualization below shows that early polls tend to overestimate a party's share of the vote (in this case the Democratic share) compared to fundamental indicators. As you get closer to election day, however, the polls start to become a better predictor:

Source: The Economist

Overfitting data

One downside The Economist points out with other models that try to forecast the presidential election is that equations are created that overfit to historical data points. Think about it: if you tried to create an equation to predict who would win the NBA championship in 2020 based on 1990s data, you may create an equation that leans heavily to the Bulls. Unfortunately, Michael Jordan isn't playing anymore and the 2020 NBA season is now being played in a bubble in Orlando.

Had to mention Jordan somewhere in this post :)

The post Dear Analyst #35: Analyzing what people dream about with the Shape of Dreams data visualization appeared first on .

]]>Data visualizations are just as important (if not more important) than the number crunching and analysis of the data itself. While Excel and Google Sheets are the standard tools for analyzing data, there are a variety of tools for creating charts and visualizations such as Tableau, Google’s Data Studio, and Microsoft’s own Power BI.

I’ve posted about the power of data visualizations in the past including New York Times’ data bootcamp (that teaches data visualization), data visualizations to model COVID-19, and my own class on creating a data-driven presentation. Creating meaningful data visualizations requires you to understand the technical aspects of aggregating data and actually creating the visualization itself. It also requires the creative side of telling a story around the visualization. Federica does an amazing job of telling a story about the Google Search queries about what we collectively dream about as a society.

I really like how Federica gives the reader two options: read the story about the data where she takes you through the visualizations with key takeaways and also gives you the ability to explore the data yourself. In the first chapter, she simply shows the most common types of dreams by keyword across different languages:

When you explore the data, you can use the arrow keys to see the dreams people search for by language and by year which leads to some interesting results:

As you go through chapter 2 and chapter 3, you see Federica utilizing different types of visualizations to better tell the story behind the dream Google Searches. A motif she uses across the visualizations is a flower’s pedals, and you’re able to interact with the pedals in chapter 2. To summarize what I imagine to be a extremely large dataset, we see some general categories of dreams in chapter 2:

Federica discovers that searches in English, Portuguese, and Spanish aggregate up to dreams about animals, family, and relationships.

You’ll see a more traditional time-series chart in chapter 3 showing the popularity of a certain type of dream over time. I’d be curious to see the trend of dreams about “pregnancy” in 2020 given the pandemic:

My favorite visualization is in chapter 4 where you’ll see a network type of visualization that shows two metrics:

- Languages that share common searches about dreams
- The number of dreams in common between languages

We actually use a similar type of visualization at work when we want to see how our customers are related to each other inside an organization (and how they share their Coda docs). What I love about the visualization above is that it shows how *connected *we are as a society given the same type of dreams we have (and subsequently search for on Google).

I also discuss a new podcast I started listening to called *Against the Rules *by one of my favorite authors, Michael Lewis. The episode is all about how there is research (and companies) helping you optimize your conversations with people to get the most benefit from the conversation. Lewis poses the million-dollar question at the end of the episode which is what are the ethics behind using this data to optimize *all *of your conversations in life from business to romance?

This question is probably getting addressed already at Harvard Business School Lewis interviews Professor Allison Wood Brooks in the episode who has a class at HBS called How to Talk Gooder in Business and Life. If you don’t have access to these type of classes and resources, will that put you at a disadvantage later on in your career, negotiating a business deal, finding a romantic partner?

Taken to the extreme, this reminds of me this scene from the season finale of Westworld (spoiler alert):

- Against the Rules: Against the Rules: The Data Coach

The post Dear Analyst #35: Analyzing what people dream about with the Shape of Dreams data visualization appeared first on .

]]>Shape of Dreams

Importance of data visualization

Data visualizations are just as important (if not more important) than the number crunching and analysis of the data itself. While Excel and Google Sheets are the standard tools for analyzing data, there are a variety of tools for creating charts and visualizations such as Tableau, Google's Data Studio, and Microsoft's own Power BI.

Source: Melting Asphalt

I've posted about the power of data visualizations in the past including New York Times' data bootcamp (that teaches data visualization), data visualizations to model COVID-19, and my own class on creating a data-driven presentation. Creating meaningful data visualizations requires you to understand the technical aspects of aggregating data and actually creating the visualization itself. It also requires the creative side of telling a story around the visualization. Federica does an amazing job of telling a story about the Google Search queries about what we collectively dream about as a society.

Structure of Shape of Dreams

I really like how Federica gives the reader two options: read the story about the data where she takes you through the visualizations with key takeaways and also gives you the ability to explore the data yourself. In the first chapter, she simply shows the most common types of dreams by keyword across different languages:

Who doesn't dream about their teeth falling off?

When you explore the data, you can use the arrow keys to see the dreams people search for by language and by year which leads to some interesting results:

Varying the types of visualizations

As you go through chapter 2 and chapter 3, you see Federica utilizing different types of visualizations to better tell the story behind the dream Google Searches. A motif she uses across the visualizations is a flower's pedals, and you're able to interact with the pedals in chapter 2. To summarize what I imagine to be a extremely large dataset, we see some general categories of dreams in chapter 2:

Federica discovers that searches in English, Portuguese, and Spanish aggregate up to dreams about animals, family, and relationships.

You'll see a more traditional time-series chart in chapter 3 showing the popularity...]]>

The post Dear Analyst #34: Trick for finding column index for VLOOKUPs using pride events data appeared first on .

]]>`VLOOKUP`

tips. Given that it’s pride month, we’ll be applying this tip to a list of all pride events in the United States. Here is the Google Sheet if you want to follow along with this example. Here’s the scenario: you have a super large table in Excel or Google Sheets (by large I mean there are many columns) and you need to do a `VLOOKUP`

on the 25th column. Instead of counting 25 columns from the left of your lookup column, you can use this column index trick to quickly get the column you’re after.
In the screenshot above, you’ll notice that each column has the column index above it. This is a simple formula of the previous column index added to 1:

This might feel a little strange because we’re used to heaving the column headers in the first row of our table. By having this column index in *above* the column header, however, it’ll make it easier to provide the `col_index`

parameter you need to provide to your `VLOOKUP`

formula. In this list of pride events, if I want to get the `Start`

column pulled into my `VLOOKUP`

formula, I simply reference the column index above the column header instead of writing out the number “5” (note that `PrideEvents`

is a named range representing A2:E270 in my list of pride events):

In this second example, I put the column index above the *new table* where I want to pull in data from my list of pride events. Notice that the *order* of columns I want to pull does not match the column order from my lookup table. The simple trick here is that I do a simple cell reference to the column index above the main table so that I know that the order of the columns I want to pull back in this case is 3, 5, 2:

One of the benefits of this trick is that you can move columns around in your lookup table and this `VLOOKUP`

formula will still work *only if* you “reset” the column indexes above your lookup table column headers to be sequential (1, 2, 3, etc). This is kind of annoying because any time I switch columns around, I have to re-drag the formula of the previous cell plus 1 in row 1 where my column indexes are. Hopefully your columns aren’t moving around too much and this solution works for you.

This is a little more advanced, but another solution is to use the `MATCH`

function to match the column name in your *new *table with the column names in your *lookup* table:

Instead of doing a simple reference to the column index in that first row in my *new *table, I have this `MATCH`

function which tries to match `Location`

, in this case, with the column headers in the lookup table ($A$2:$E$2 represents the column headers from my list of pride events). If it finds a “match,” then the `MATCH`

function returns back the column index. You could actually do this solution without having that column index above your new table columns by putting the `MATCH`

function directly in your `VLOOKUP`

formula, but it might make the formula more difficult to debug in the future.

To celebrate pride month, here’s a fun Easter egg you’ll find in Google Sheets if you type out “PRIDE” in separate columns (you’ll also see this in the Google Sheets example for this blog post):

- The Pomp Letter: We Need More Software Engineering And Less Financial Engineering

The post Dear Analyst #34: Trick for finding column index for VLOOKUPs using pride events data appeared first on .

]]>Creating column indexes above your lookup table

In the screenshot above, you'll notice that each column has the column index above it. This is a simple formula of the previous column index added to 1:

This might feel a little strange because we're used to heaving the column headers in the first row of our table. By having this column index in above the column header, however, it'll make it easier to provide the col_index parameter you need to provide to your VLOOKUP formula. In this list of pride events, if I want to get the Start column pulled into my VLOOKUP formula, I simply reference the column index above the column header instead of writing out the number "5" (note that PrideEvents is a named range representing A2:E270 in my list of pride events):

Putting the column index above your new column headers

In this second example, I put the column index above the new table where I want to pull in data from my list of pride events. Notice that the order of columns I want to pull does not match the column order from my lookup table. The simple trick here is that I do a simple cell reference to the column index above the main table so that I know that the order of the columns I want to pull back in this case is 3, 5, 2:

One of the benefits of this trick is that you can move columns around in your lookup table and this VLOOKUP formula will still work only if you "reset" the column indexes above your lookup table column headers to be sequential (1, 2, 3, etc). This is kind of annoying because any time I switch columns around, I have to re-drag the formula of the previous cell plus 1 in row 1 where my column indexes are. Hopefully your columns aren't moving around too much and this solution works for you.

Using the MATCH() function to find the column index

This is a little more advanced, but another solution is to use the MATCH function to match the column name in your new table with the column names in your lookup table:

Instead of doing a simple reference to the column index in that first row in my new table, I have this MATCH function which tries to match Location, in this case, with the column headers in the lookup table ($A$2:$E$2 represents the column headers from my list of pride events). If it finds a "match," then the MATCH function returns back the column index. You could actually do this solution without having that column index above your new table columns by putting the MATCH function directly in your VLOOKUP formula, but it might make the formula more difficult to debug in the future.

Pride Easter egg in Google Sheets

To celebrate pride month, here's a fun Easter egg you'll find in Google Sheets if you type out "PRIDE" in separate columns (you'll also see this in the Google Sheets example for this blog post):

]]>