Dear Analyst https://www.thekeycuts.com/category/podcast/ A show made for analysts: data, data analysis, and software. Sun, 13 Sep 2020 14:45:10 +0000 en-US hourly 1 https://wordpress.org/?v=5.5.1 This is a podcast made by a lifelong analyst. I cover topics including Excel, data analysis, and tools for sharing data. In addition to data analysis topics, I may also cover topics related to software engineering and building applications. I also do a roundup of my favorite podcasts and episodes. KeyCuts clean episodic KeyCuts info@thekeycuts.com info@thekeycuts.com (KeyCuts) A show made for analysts: data, data analysis, and software. Dear Analyst https://www.thekeycuts.com/wp-content/uploads/2019/03/dear_analyst_logo-1.png https://www.thekeycuts.com/excel-blog/ TV-G New York, NY 50542147 Dear Analyst #43: Setting up workflows that scale – from spreadsheets to tools & applications https://www.thekeycuts.com/dear-analyst-43-setting-up-workflows-that-scale-from-spreadsheets-to-tools-applications/ https://www.thekeycuts.com/dear-analyst-43-setting-up-workflows-that-scale-from-spreadsheets-to-tools-applications/#respond Mon, 14 Sep 2020 04:05:00 +0000 https://www.thekeycuts.com/?p=49633 This episode is the audio from a presentation I gave a few weeks ago to members of Betaworks based in NYC. Betaworks is a startup accelerator, co-working space, and community of founders. No-code is a pretty hot topic right now, and in this presentation I talk about how spreadsheets is one of the first no-code […]

The post Dear Analyst #43: Setting up workflows that scale – from spreadsheets to tools & applications appeared first on .

]]>
This episode is the audio from a presentation I gave a few weeks ago to members of Betaworks based in NYC. Betaworks is a startup accelerator, co-working space, and community of founders. No-code is a pretty hot topic right now, and in this presentation I talk about how spreadsheets is one of the first no-code “platforms” and how your spreadsheet skills can be extended to build real tools. The presentation is adapted from a talk I gave last year at Webflow’s No-Code Conference. I embedded the “slides” at the bottom of the post, and here is a link to the slides if you want to look on your own.

Summary of presentation

  1. The skills you’ve learned in Excel/Google Sheets — include data structuring — translate to building workflows for any part of your business
  2. Thinking beyond spreadsheets as a way to do data analysis or “number crunching”
  3. Any tool that helps automate or solve some workflow at your company can be built with spreadsheets
  4. Why learning spreadsheets can set you up well for learning “no-code” tools

Spreadsheet examples from presentation

During the presentation, I showed actual spreadsheets (Excel and Google Sheets) I’ve built in the past for freelance clients and friends. The main concept I’m trying to convey is that each of these spreadsheets look and feel more like an application rather than a model that forecasts out certain values. Each of these examples consists three core elements:

  1. Database – A place to store information
  2. User Input – Fields and forms for someone to fill out
  3. Calculations/Display – Formulas (e.g. “business logic”) to make the spreadsheet output something for you (the administrator) or the user

My 2 cents: When you’re building an application in a spreadsheet, you’re extending the original purpose and audience Excel and Google Sheets was meant to serve: financial models for accountants. But this is what makes the spreadsheet so versatile. The fact that an analyst can string together formulas to make a spreadsheet look and feel like an application is what gives the spreadsheet power. This innovation also pushes Microsoft, Google, and other platforms to release new features that give analysts the ability to build tools, not just models.

I’ve written extensively about this subject in the past, so will leave my soliloquy at that. On to the examples

Bachelorette planning Google Sheet

The first example I discuss is this bachelorette party planning Google Sheet I built for a friend. This spreadsheet has been duplicated quite a few times by friends of friends, and all it does is help a to-be bride plan figure out which weekend works best to have a bachelorette party.

The key insight is that the database is everything from column B onwards and row 3 and below. All the availability for each person is captured in each of these cells and there’s some conditional formatting to give the bride a visual indicator to see when a weekend is available.

The user input is the ability for each friend who is shared the Google Sheet to edit the cells. “Yes,” “No,” and “Maybe” are the only inputs that matter for this Google Sheet. Finally, the calculations are in rows 31-33 which tallies up the user inputs for each weekend so the bride can see which weekend is the “most free” for her friends.

There are countless iPhone and Android apps you can download to do this exact same thing, but this spreadsheet just does one thing and one thing well: help brides figure out which weekend to plan a bachelorette party.

Splitting costs with friends

This splitting costs with friends blog post is by far the most popular post on my blog since I published it in 2014 (thanks Google search!). Every day I still get requests to give people edit access to the Google Sheet (please just make a copy of it instead of requesting edit access). Here’s the Google Sheet if you want to make a copy for yourself.

Similar to the previous example, the database is all the items, costs, and who participated in the cost from rows 2 and down. The user input are the cells themselves, but the most important part of the Google Sheet are the 1s and 0s from column C onward. Those 1s and 0s represent whether a friend or family member “participated” in the cost. This allows the spreadsheet to do some basic calculations to figure out who owes what.

Rows 26-28 are the calculations that the trip organizer can see at a glance to see who is owed or who owes money. Again, numerous apps and custom tools you can pay for or download to split costs with friends, and this Google Sheet mimics the features of those apps in a more bare bones way.

Patient intake system

This example shows when the spreadsheet is really extended beyond what it was intended to do. This was for one of my consulting clients who needed a new CRM system for managing new patients at their clinic.

The Excel file basically lets the operations manager at the clinic quickly “move” new patients from one spreadsheet to another using a VBA macro. To mimic the look and feel of an application, I drew these blue and green buttons using the shape feature in Excel and tied a macro to each button. The database consists of patient details, the user input is simply each row of data, and the calculations involve these macros that move data from one spreadsheet to another.

This gets into an important concept that an Excel file or Google Sheet are not that great for: workflows. Since everything is usually calculated in real-time in a spreadsheet, it can be difficult to do a if-this-then-that type of workflow without using a macro or script (see my last post on automating a tedious filling values down task).

“Slides” from Betaworks presentation

The rest of the presentation includes tool and tips for building applications with other no-code tools. Slides are below:

Original talk from Webflow’s No-Code Conference in 2019:

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

  • No other podcasts for this episode given how long this episode is!

The post Dear Analyst #43: Setting up workflows that scale – from spreadsheets to tools & applications appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-43-setting-up-workflows-that-scale-from-spreadsheets-to-tools-applications/feed/ 0 This episode is the audio from a presentation I gave a few weeks ago to members of Betaworks based in NYC. Betaworks is a startup accelerator, co-working space, and community of founders. No-code is a pretty hot topic right now, This episode is the audio from a presentation I gave a few weeks ago to members of Betaworks based in NYC. Betaworks is a startup accelerator, co-working space, and community of founders. No-code is a pretty hot topic right now, and in this presentation I talk about how spreadsheets is one of the first no-code "platforms" and how your spreadsheet skills can be extended to build real tools. The presentation is adapted from a talk I gave last year at Webflow's No-Code Conference. I embedded the "slides" at the bottom of the post, and here is a link to the slides if you want to look on your own.







Summary of presentation



* The skills you've learned in Excel/Google Sheets — include data structuring — translate to building workflows for any part of your business* Thinking beyond spreadsheets as a way to do data analysis or "number crunching"* Any tool that helps automate or solve some workflow at your company can be built with spreadsheets* Why learning spreadsheets can set you up well for learning "no-code" tools



Spreadsheet examples from presentation



During the presentation, I showed actual spreadsheets (Excel and Google Sheets) I've built in the past for freelance clients and friends. The main concept I'm trying to convey is that each of these spreadsheets look and feel more like an application rather than a model that forecasts out certain values. Each of these examples consists three core elements:



* Database - A place to store information* User Input - Fields and forms for someone to fill out* Calculations/Display - Formulas (e.g. "business logic") to make the spreadsheet output something for you (the administrator) or the user



My 2 cents: When you're building an application in a spreadsheet, you're extending the original purpose and audience Excel and Google Sheets was meant to serve: financial models for accountants. But this is what makes the spreadsheet so versatile. The fact that an analyst can string together formulas to make a spreadsheet look and feel like an application is what gives the spreadsheet power. This innovation also pushes Microsoft, Google, and other platforms to release new features that give analysts the ability to build tools, not just models.



I've written extensively about this subject in the past, so will leave my soliloquy at that. On to the examples



Bachelorette planning Google Sheet



The first example I discuss is this bachelorette party planning Google Sheet I built for a friend. This spreadsheet has been duplicated quite a few times by friends of friends, and all it does is help a to-be bride plan figure out which weekend works best to have a bachelorette party.







The key insight is that the database is everything from column B onwards and row 3 and below. All the availability for each person is captured in each of these cells and there's some conditional formatting to give the bride a visual indicator to see when a weekend is available.



The user input is the ability for each friend who is shared the Google Sheet to edit the cells. "Yes," "No," and "Maybe" are the only inputs that matter for this Google Sheet. Finally, the calculations are in rows 31-33 which tallies up the user inputs for each w...]]>
Dear Analyst 43 50:56 49633
Dear Analyst #42: Filling values down into empty cells programmatically with Google Apps Script & VBA tutorial https://www.thekeycuts.com/dear-analyst-filling-values-down-into-empty-cells-programmatically-with-google-apps-script-vba-tutorial/ https://www.thekeycuts.com/dear-analyst-filling-values-down-into-empty-cells-programmatically-with-google-apps-script-vba-tutorial/#respond Mon, 07 Sep 2020 08:45:00 +0000 https://www.thekeycuts.com/?p=49588 SPACs (Special Purpose Acquisition Companies) or “blank check” companies have been in the news recently, so I used some real SPAC data for this episode. Your spreadsheet has empty cells in column A, and these empty cells should be filled with values. Your task is to fill values down up until you find another cell […]

The post Dear Analyst #42: Filling values down into empty cells programmatically with Google Apps Script & VBA tutorial appeared first on .

]]>
SPACs (Special Purpose Acquisition Companies) or “blank check” companies have been in the news recently, so I used some real SPAC data for this episode. Your spreadsheet has empty cells in column A, and these empty cells should be filled with values. Your task is to fill values down up until you find another cell with a value, at which point you need to fill that value down. This episode walks through how to do this programmatically with a script in Google Apps Script (for Google Sheets) and VBA (for Excel). This is the Google Sheet associated with the episode. The Google App Script is here and VBA script is here. See a quick example of what the issue is in the gif below and how the script “fills in” the values for you.


See the video below if you want to jump straight to the tutorial:


Why is this data structure a problem?

You’ve inherited a spreadsheet and the data structure looks like this:

It’s a list of data but there are empty cells in column A. This is usually a category or dimension in your data set that needs to be “filled down” so that the data set is complete. In the Google Sheet, each row represents one person that is associated with a given SPAC, but the SPAC Ticker column is incomplete. You’ll usually get this type of data structure through the following:

  • Data was manually created by someone who didn’t fill down the values in column A since they thought it was a “category”
  • You are working a data set that originally came from a PivotTable but you only have the “values” from the PivotTable, not the PivotTable itself

This data structure is a problem because if you want to do any type of analysis on this data, it will be extremely difficult since you have missing values in column A. Sorting, filtering, and PivotTables are all out of the question if your data set looks like that screenshot.

Solving this with keyboard shortcuts

Totally doable for this Google Sheet. This is what you could do:

All I’m doing above is the following (on PC):

  1. SHIFT+CONTROL+DOWN ARROW – Select all the empty cells from the current cell with a value up until the next cell with a value
  2. SHIFT+UP ARROW – Reduce the selection by one row
  3. CONTROL+D Fill the value from the first cell in the selection down
  4. CONTROL+DOWN ARROW – Skip to the next value that needs to be filled down

The obvious tradeoff here is time vs. human error. Every time I have to do this task on a spreadsheet, I think about whether it was worth filling the values down “manually” using keyboard shortcuts or using a VBA script (in Excel) to do this programatically. It really depends on the number of rows. For the example SPAC Google Sheet, doing this with keyboard shortcuts takes 10 seconds tops. If this spreadsheet was 1,000,000 rows, then we have a problem.

Don’t worry, I got you. Here’s the script you can use to do this programmatically.

Using Google Apps Script in Google Sheets

First off, here’s the script you can use for Google Sheets (gist here). Just 14 lines of code and you’re good to go:

Never used macros or Google Apps Script before? It’s super simply. First go to Tools then Script Editor:

You may be asked to authenticate your Google account so just hit Yes to all those screens. Copy/paste the script into the editor:

Go to File and Save in order to save the script into the Google Apps Script project. Go back to Google Sheets and go to Tools, Macros, and click Import to import the fillValuesDown function into Google Sheets. Now you can use this function as a macro in your Google Sheet:

You can close out the Google Apps Script editor and now click on Tools, Macros, and click on fillValuesDown to run the script on your dataset:

How does the script work?

The script utilizes the Spreadsheet service for Google Apps Script to access the data object for your Google Sheet (more on that below). The script is really only 12 lines long, and does the following in sequential order:

  1. Sets the spreadsheet variable so that we can use the active worksheet you’re on
  2. Sets the currentRange variable to start from A2 to the last row in the table
  3. Two more variables are set: newRange to store the new range of values we want to put into column A, and newFillValue which is kind of like an intermediate variable used in the loop
  4. The script goes through all values in currentRange (including the blank ones) and adds all the correct values to the newRange array
  5. The currentRange is then set equal to newRange to get all the “correct” values into column A

On the backend, the currentRange array looks like this:

[['HZAC'], [], ['FST'], [], [] , []...]

The purpose of newRange is to create a new array that is a complete list of values:

[['HZAC'], ['HZAC'], ['FST'], ['FST'], ['FST'] , ['FST']...]

Recording macros vs. programming Google Sheets

When I first started learning macros, the first thing I did was record my keystrokes and break down what the backend “code” looked like. Here’s what recording a macro looks like:

When you open up the script editor, you’ll see this:

There’s a lot of activate() and getCurrentCell() functions being called. You can then deconstruct all these keystrokes to build a script that accomplishes the task. But here’s the key difference between recording keystrokes versus working with the data object:

You are programming keystrokes instead of the Google Sheets application.

Other advantages of programming the application instead of the keystrokes:

  • Utilizes less compute resources and runs faster
  • Easier to debug
  • Easier to adapt to more scenarios and use cases

In the keystroke world, you are literally telling Google Sheets to select cells, select ranges, and moving the cursor around which doesn’t seem like a big deal. When you are working with hundreds of thousands of rows, this could cause serious performance issues. Since Google Apps Script runs in the cloud, you may not see these performance deficiencies, but you’ll definitely see this in your Excel workbooks.

Speaking of Excel workbooks…

Using the VBA script for Excel

The structure of the VBA script is pretty similar to the Google Apps Script, but it’s just a little different syntax. I’m not going to walk through the tutorial of how to set this up since it’s pretty similar to Google Sheets. In the VBA script, you do end up doing some “cell selection” like in line 8. Most of the script, however, is working with the Excel data object model so the script should run pretty quickly regardless of the size of your Excel file.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The post Dear Analyst #42: Filling values down into empty cells programmatically with Google Apps Script & VBA tutorial appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-filling-values-down-into-empty-cells-programmatically-with-google-apps-script-vba-tutorial/feed/ 0 SPACs (Special Purpose Acquisition Companies) or "blank check" companies have been in the news recently, so I used some real SPAC data for this episode. Your spreadsheet has empty cells in column A, and these empty cells should be filled with values. SPACs (Special Purpose Acquisition Companies) or "blank check" companies have been in the news recently, so I used some real SPAC data for this episode. Your spreadsheet has empty cells in column A, and these empty cells should be filled with values. Your task is to fill values down up until you find another cell with a value, at which point you need to fill that value down. This episode walks through how to do this programmatically with a script in Google Apps Script (for Google Sheets) and VBA (for Excel). This is the Google Sheet associated with the episode. The Google App Script is here and VBA script is here. See a quick example of what the issue is in the gif below and how the script "fills in" the values for you.











See the video below if you want to jump straight to the tutorial:




https://www.youtube.com/watch?v=t-32QkyjKVE&feature=youtu.be








Why is this data structure a problem?



You've inherited a spreadsheet and the data structure looks like this:







It's a list of data but there are empty cells in column A. This is usually a category or dimension in your data set that needs to be "filled down" so that the data set is complete. In the Google Sheet, each row represents one person that is associated with a given SPAC, but the SPAC Ticker column is incomplete. You'll usually get this type of data structure through the following:



* Data was manually created by someone who didn't fill down the values in column A since they thought it was a "category" * You are working a data set that originally came from a PivotTable but you only have the "values" from the PivotTable, not the PivotTable itself



This data structure is a problem because if you want to do any type of analysis on this data, it will be extremely difficult since you have missing values in column A. Sorting, filtering, and PivotTables are all out of the question if your data set looks like that screenshot.



Solving this with keyboard shortcuts



Totally doable for this Google Sheet. This is what you could do:







All I'm doing above is the following (on PC):



* SHIFT+CONTROL+DOWN ARROW - Select all the empty cells from the current cell with a value up until the next cell with a value* SHIFT+UP ARROW - Reduce the selection by one row* CONTROL+D - Fill the value from the first cell in the selection down* CONTROL+DOWN ARROW - Skip to the next value that needs to be filled down



The obvious tradeoff here is time vs. human error. Every time I have to do this task on a spreadsheet, I think about whether it was worth filling the values down "manually" using keyboard shortcuts or using a VBA script (in Excel) to do this programatically. It really depends on the number of rows. For the example SPAC Google Sheet, doing this with keyboard shortcuts takes 10 seconds tops. If this spreadsheet was 1,000,000 rows, then we have a problem.



Don't worry, I got you. Here's the script you can use to do this programmatically.







Using Google Apps Script in Google Sheets



First off, here's the script you can use for Google Sheets (gist 49588
Dear Analyst #41: How to do a VLOOKUP to the “left” without INDEX/MATCH with TikTok data https://www.thekeycuts.com/dear-analyst-how-to-do-a-vlookup-to-the-left-without-index-match-with-tiktok-data/ https://www.thekeycuts.com/dear-analyst-how-to-do-a-vlookup-to-the-left-without-index-match-with-tiktok-data/#respond Mon, 31 Aug 2020 04:01:00 +0000 https://www.thekeycuts.com/?p=49494 Since TikTok is in the news right now about who is going to buy them, I thought using some fake-ish TikTok acquisition data would be relevant for this episode. A classic Excel/Google Sheets challenge: how to do a VLOOKUP to the “left” e.g. your lookup column is not the first column in your lookup table. […]

The post Dear Analyst #41: How to do a VLOOKUP to the “left” without INDEX/MATCH with TikTok data appeared first on .

]]> Since TikTok is in the news right now about who is going to buy them, I thought using some fake-ish TikTok acquisition data would be relevant for this episode. A classic Excel/Google Sheets challenge: how to do a VLOOKUP to the “left” e.g. your lookup column is not the first column in your lookup table. There are all sorts of strategies to overcome this issue with how your data is structured. Notably, the INDEX/MATCH strategy is the most commonly-cited strategy when good ‘ol VLOOKUP is not at your disposal. In this episode I walk through a strategy that allows you to use VLOOKUP: array formulas. Skip to strategy #3 below if you want to see the answer. Associated Google Sheet for this episode if you want to follow along.

Was trying to find some gif associated with “looking up” 🙃

See the video below if you want to jump straight to the tutorial:


Why the VLOOKUP won’t work

If you are new to why VLOOKUP won’t work in this scenario (see Google Sheet), take a look at the data data structure below:

We have ID in column A and we want to find Company Name and Market Cap in columns C and D, respectively, for these IDs. The ID in column A is the unique identifier for the row, and we need to do a lookup to Company ID in column I.

While you can eyeball the result for the first row (“Triller” is the company for ID 3), we want to find a scalable solution using formulas.

As you start writing the VLOOKUP formula in column C, you’ll start to notice the problem: the Company ID column is not the first column in your table to lookup the ID value in column A:

Here are a few strategies for solving this problem (#3 is probably the one you haven’t seen before).

Strategy #1: Move the lookup column to the first column position

This is not the most ideal solution, but you could just simply cut and paste the Company ID column and move it to the left-most “first” column of your lookup table. In Excel you would have to do a cut and paste, but in Google Sheets you can just drag and drop the column into the proper position:

Now the VLOOKUP for Company Name will work correctly since Company ID is the first column in your lookup table:

I don’t like this strategy because it involves some manual cutting and pasting of columns. If your lookup table isn’t static (e.g. might be sales data that gets added daily), then you might be ruining the “structure” of your data on subsequent updates. Let’s see what else we can do.

Strategy #2: Make copies of the columns to the right of the lookup column

Also not an ideal solution, but it works in one-off cases where your data is static and you don’t care about showing your back-end work to a colleague. It looks like data is duplicated, but you’re basically referencing existing columns in your table so that those columns appear to the “right” of your lookup column:

Now you can do a VLOOKUP for columns I to K to get the Company Name and Market Cap values to show up in columns C and D:

Strategy #3 (preferred): Use array formulas

A relatively unknown feature in Google Sheets is you can create your own “tables” using array formulas. An array is simply a range of cells, and you can separate different range of cells using a semicolon. To create an array, you put curly brackets around your ranges. Here’s how an array of columns F and G would look like:

What’s the result? You simply get a reference to the two ranges after you enter the formula:

The key here is that you can create any order of range references in the array formula. We could’ve put G2:G6 first and F2:F6 second, and you would’ve seen the values in Website first followed by Company Name after entering the formula.

Knowing this, we can create our own lookup “table” using the array formula syntax like so:

Notice how the second argument in the VLOOKUP formula is no longer a table, but rather an array of column I followed by columns F to H. In this array, the second “column” is Company Name since we are saying column F is the second range of cells after column I. Market Cap is now the fourth column in this array:

In order to fill this formula down, we need to turn the range references in the array formula into absolute references as shown above.

Strategy #4 (most common): INDEX/MATCH

As mentioned at the beginning of this post, this is the most common method for looking up values to the left. I won’t give a detailed explanation of how INDEX/MATCH works, but here’s how you would get the Company Name given the data structure:

Which strategy should you use?

I’m a little torn between strategies #3 and #4 since INDEX/MATCH is the go-to method for looking up data to the left, and is also more performant than VLOOKUP on large data sets. The fact that the array formula in strategy #3 doesn’t involve a nested formula makes it potentially easier to debug in complicated spreadsheets. I haven’t used an array formula in many VLOOKUP situations since I learned INDEX/MATCH such a long time ago, but I may try this strategy in the future.

Of course, this all becomes irrelevant if you have the XLOOKUP function at your disposal which became available to certain Office 365 subscribers about a year ago (September 2019). This video is a fun poke at XLOOKUP, but also holds some truth for the VLOOKUP purists out there (start watching at 1:19):

A little Kant and poker

I talk about this in the 2nd half of the episode, but thought it would be worth sharing a passage from The Critique of Pure Reason as it relates to betting on your convictions. Listen to the Knowledge Project episode for the full background:

The usual touchstone, whether that which someone asserts is merely his persThe usual touchstone, whether that which someone asserts is merely his persuasion — or at least his subjective conviction, that is, his firm belief — is betting. It often happens that someone propounds his views with such positive and uncompromising assurance that he seems to have entirely set aside all thought of possible error. A bet disconcerts him. Sometimes it turns out that he has a conviction which can be estimated at a value of one ducat, but not of ten. For he is very willing to venture one ducat, but when it is a question of ten he becomes aware, as he had not previously been, that it may very well be that he is in error. If, in a given case, we represent ourselves as staking the happiness of our whole life, the triumphant tone of our judgment is greatly abated; we become extremely diffident, and discover for the first time that our belief does not reach so far. Thus pragmatic belief always exists in some specific degree, which, according to differences in the interests at stake, may be large or may be small.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The post Dear Analyst #41: How to do a VLOOKUP to the “left” without INDEX/MATCH with TikTok data appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-how-to-do-a-vlookup-to-the-left-without-index-match-with-tiktok-data/feed/ 0 Since TikTok is in the news right now about who is going to buy them, I thought using some fake-ish TikTok acquisition data would be relevant for this episode. A classic Excel/Google Sheets challenge: how to do a VLOOKUP to the "left" e.g. Since TikTok is in the news right now about who is going to buy them, I thought using some fake-ish TikTok acquisition data would be relevant for this episode. A classic Excel/Google Sheets challenge: how to do a VLOOKUP to the "left" e.g. your lookup column is not the first column in your lookup table. There are all sorts of strategies to overcome this issue with how your data is structured. Notably, the INDEX/MATCH strategy is the most commonly-cited strategy when good 'ol VLOOKUP is not at your disposal. In this episode I walk through a strategy that allows you to use VLOOKUP: array formulas. Skip to strategy #3 below if you want to see the answer. Associated Google Sheet for this episode if you want to follow along.



Was trying to find some gif associated with "looking up"







See the video below if you want to jump straight to the tutorial:




https://www.youtube.com/watch?v=6JluR45VJl4








Why the VLOOKUP won't work



If you are new to why VLOOKUP won't work in this scenario (see Google Sheet), take a look at the data data structure below:







We have ID in column A and we want to find Company Name and Market Cap in columns C and D, respectively, for these IDs. The ID in column A is the unique identifier for the row, and we need to do a lookup to Company ID in column I.



While you can eyeball the result for the first row ("Triller" is the company for ID 3), we want to find a scalable solution using formulas.



As you start writing the VLOOKUP formula in column C, you'll start to notice the problem: the Company ID column is not the first column in your table to lookup the ID value in column A:







Here are a few strategies for solving this problem (#3 is probably the one you haven't seen before).



Strategy #1: Move the lookup column to the first column position



This is not the most ideal solution, but you could just simply cut and paste the Company ID column and move it to the left-most "first" column of your lookup table. In Excel you would have to do a cut and paste, but in Google Sheets you can just drag and drop the column into the proper position:







Now the VLOOKUP for Company Name will work correctly since Company ID is the first column in your lookup table:







I don't like this strategy because it involves some manual cutting and pasting of columns. If your lookup table isn't static (e.g. might be sales data that gets added daily), then you might be ruining the "structure" of your data on subsequent updates. Let's see what else we can do.



Strategy #2: Make copies of the columns to the right of the lookup column



Also not an ideal solution, but it works in one-off cases where your data is static and you don't care about showing your back-end work to a colleague. It looks like data is duplicated, but you're basically referencing existing columns in your table so that those columns appear to the "right" of your lookup column:







Now you can do a VLOOKUP for columns I to K to get the Company Name and Market Cap values to show up in columns C and D:







]]>
Dear Analyst 41 29:43 49494 Dear Analyst #40: A spreadsheet error from two Harvard professors leading to incorrect economic policies after 2008 recession https://www.thekeycuts.com/dear-analyst-a-spreadsheet-error-that-potentially-led-to-incorrect-economic-and-austerity-policies-after-2008-recession/ https://www.thekeycuts.com/dear-analyst-a-spreadsheet-error-that-potentially-led-to-incorrect-economic-and-austerity-policies-after-2008-recession/#respond Mon, 24 Aug 2020 04:01:00 +0000 https://www.thekeycuts.com/?p=49430 It’s 2010, and the world is coming out of recession. Two Harvard professors–one of whom is a former economist for the IMF and chess Grandmaster–publish a paper suggesting that a country with a high public debt-to-GDP ratio of over 90% is associated with low economic growth. Turns out the Excel model the professors use is […]

The post Dear Analyst #40: A spreadsheet error from two Harvard professors leading to incorrect economic policies after 2008 recession appeared first on .

]]>
It’s 2010, and the world is coming out of recession. Two Harvard professors–one of whom is a former economist for the IMF and chess Grandmaster–publish a paper suggesting that a country with a high public debt-to-GDP ratio of over 90% is associated with low economic growth. Turns out the Excel model the professors use is riddled with some basic statistical and formula errors. The results potentially lead to incorrect economic policies, austerity measures, and high unemployment around the world. This is a Google Sheet which shows one of the spreadsheet errors, and I show how you can prevent such an error in this post.

See the video below if you want to jump straight to the tutorial:

Background

Economists Carmen Reinhart and Kenneth Rogoff published a paper in 2010 called Growth in a Time of Debt (originally published in the American Economic Review) where they argued:

[…] median growth rates for countries with public debt over 90 percent of GDP are roughly one percent lower than otherwise; average (mean) growth rates are several percent lower.

In 2013, PhD students Thomas Herndon, Michael Ash, and Robert Pollin of the University of Massachusetts, Amherst had re-created the study from Reinhart and Rogoff’s paper as part of their PhD program. The students had to analyze the original Excel files that Reinhart and Rogoff used, and they weren’t able to replicate the original results. They cited in their own paper entitled Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff :

[…] coding errors, selective exclusion of available data, and unconventional weighting of summary statistics lead to serious errors that inaccurately represent the relationship between public debt and GDP growth among 20 advanced economies in the post-war period.

Reinhart and Rogoff suggested that the debt/GDP ratio and economic growth is simply a correlation, and that correlation still holds after correcting for the spreadsheet mistakes. However, that correlation is not as strong as their original paper posited.

Why this was a big deal

The implications of their findings resulted in news outlets, politicians, and policymakers using the 90% benchmark as a signal that a country is heading for low economic growth. Some notable examples:

  • 2012 Republican nominee for the US vice presidency Paul Ryan included the paper in hi proposed 2013 budget
  • The Washington Post editorial board takes it as an economic consensus view, stating that “debt-to-GDP could keep rising — and stick dangerously near the 90 percent mark that economists regard as a threat to sustainable economic growth.” 
  • Austerity measures are put into place around the world despite the advice from economic advisers, pushing unemployment rate above 10% in the eurozone

3 main Excel spreadsheet problems with the model

The three main errors that Herndon, Ash, and Polling discovered are the following:

  1. Years of high debt and average growth where selectively excluded from the data set
  2. Countries’ GDP growth rates were not properly weighted
  3. Summary table excludes high-debt and average-growth countries

This video illustrates the three individual problems with the spreadsheet really clearly:

If you fix these errors, the average real GDP growth rate for countries carrying a public debt-to-GDP ratio of over 90% is actually 2.2%, not -0.1%. In the Google Sheet I shared, you wont’ see the correct 2.2% average growth rate since I’m not doing the full analysis and focusing on the third Excel error stated above.

Fixing incorrect cell references for average GDP growth rates

The third error of incorrectly excluding high-growth countries from the average GDP growth rate is a particularly egregious mistake, and Reinhart and Rogoff admit that they made this simple cell referencing mistake. As you can see in the screenshot below, they simply omit rows 45 to 49 in their AVERAGE formula:

Source: https://statmodeling.stat.columbia.edu/

Here are three methods Reinhart and Rogoff could have used to ensure that they referenced the correct cells to avoid this mistake:

Method 1: Check the summary dropdown in the bottom-right

After you select all the cells that contain GDP growth rates in column G, you can look at the dropdown in the bottom right of Excel or Google Sheets to see the average. No formulas required:

You can also get other summary stats like the SUM, MIN, and MAX of your selected range of cells. Probably the easiest method to get a quick sanity check of your averages that you’ve calculated in lines 26-27 of the Google Sheet.

Method 2: Adding a checksum/checkaverage formula to compare results

This one is my preferred method, and is quite common in financial models. Usually you’ll see this type of “error checking” when you want to make sure you’ve captured the correct cell references for a SUM formula, but with some extra work you can check for averages too.

You start by writing a formula below your actual summary stats (in this case starting on line 28 of the Google Sheet) and create a SUM formula of the data:

The big question is this: how do you know if you’ve referenced the correct cells in your “checksum” formula? The hope here is that by writing the SUM formula for the second time, in theory, you won’t make the same mistake twice. Obviously this is a big assumption in this method, but let’s assume you’ve properly made the reference for this internal error-checking formula.

The next formula below the “checksum” is a “count” formula:

Notice how it’s not a COUNT formula. This is because the table contains the “n.a.” text so a COUNTA formula would be incorrect since it would count all values in the column. We only want the numeric values, hence the reason for using COUNT.

Finally, the “checkaverage” formula compares your actual average in line 26 with the result of checksum / count. If the values aren’t equal, then you’ll get the text “Error” as the result of the IF formula:

Since line 26 references the “incorrect” averages used in Reinhart and Rogoff’s paper, we get errors across the board. This “checksum” or “checkaverage” methodology gives you a visual indicator on whether your calculated results are properly referencing all the cells in the range instead of a subset. Instead of writing a “checksum” and “count” formula, you could simplify the “checkaverage” formula to this:

We simply put the SUM and COUNT formulas inside the first argument of the IF statement.

Method 3: Create a PivotTable and compare results

This method also relies on you selecting the proper cells to build your PivotTable. Again, assuming you don’t make the same mistake twice, selecting the cells in the range should be a pretty simple task. After you select the cells (B4:G24 in this case), you build a PivotTable with Country in the Rows and the four debt/GDP buckets in the values. You then summarize each metric with the AVERAGE selection:

The “Grand Total” on the last line of the PivotTable contains the average across all growth rates. You can then compare these numbers to your computed numbers on the first sheet that contains your table.

Lessons to be learned for your own models

People don’t check their analyses with the the above 3 methods because it takes extra work and…well…people are lazy. In addition to putting in error checks to ensure you are not making simple spreadsheet errors like this, there are other strategies you can use to ensure others can replicate your work to detect potential errors.

For Reinhart and Rogoff, they didn’t make their full underlying data public. They only shared their spreadsheet after Herndon, Ash and Pollin reached out to them as the trio was trying to replicate their results. Some other strategies:

  • Upload your results to a public repository like GitHub early on in your analysis and “open source” your data
  • Write detailed steps on experimental design, procedures, equipment, data processing, and statistical methods used so others can replicate your experiment

I really liked this quote from a commenter about the Excel error on this Stat Modeling blog:

I’d like to see how many researchers expose themselves to such criticism. Uploading a raw dataset is one thing but allowing people to see all your intermediate calculations in messy detail is rare.

Too often we’re caught up in doing all the number crunching ourselves and then sharing the output once we think we’ve crossed finished the analysis. As this example suggests, sharing your data set and model as you are doing the analysis can prevent a blunder like this from happening.

Auto date formatting and human gene naming problems

In the second half of this episode, I discuss an article in The Verge about how the HUGO Gene Nomenclature Committee had to rename gene names because of Excel’s simple feature of auto-formatting dates. Gene names like “MARCH1” and “SEPT1” get re-formatted to the dates “1-Mar” and “1-Sep” when these values are entered into Excel. I thought this was interesting to see the scientific community bending to this standard feature in Excel given the widespread use of Excel in the scientific community.

Source: The Verge

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The post Dear Analyst #40: A spreadsheet error from two Harvard professors leading to incorrect economic policies after 2008 recession appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-a-spreadsheet-error-that-potentially-led-to-incorrect-economic-and-austerity-policies-after-2008-recession/feed/ 0 It's 2010, and the world is coming out of recession. Two Harvard professors--one of whom is a former economist for the IMF and chess Grandmaster--publish a paper suggesting that a country with a high public debt-to-GDP ratio of over 90% is associated w... It's 2010, and the world is coming out of recession. Two Harvard professors--one of whom is a former economist for the IMF and chess Grandmaster--publish a paper suggesting that a country with a high public debt-to-GDP ratio of over 90% is associated with low economic growth. Turns out the Excel model the professors use is riddled with some basic statistical and formula errors. The results potentially lead to incorrect economic policies, austerity measures, and high unemployment around the world. This is a Google Sheet which shows one of the spreadsheet errors, and I show how you can prevent such an error in this post.







See the video below if you want to jump straight to the tutorial:




https://youtu.be/mXUynkQQ1uM




Background



Economists Carmen Reinhart and Kenneth Rogoff published a paper in 2010 called Growth in a Time of Debt (originally published in the American Economic Review) where they argued:



[...] median growth rates for countries with public debt over 90 percent of GDP are roughly one percent lower than otherwise; average (mean) growth rates are several percent lower.



In 2013, PhD students Thomas Herndon, Michael Ash, and Robert Pollin of the University of Massachusetts, Amherst had re-created the study from Reinhart and Rogoff's paper as part of their PhD program. The students had to analyze the original Excel files that Reinhart and Rogoff used, and they weren't able to replicate the original results. They cited in their own paper entitled Does High Public Debt Consistently Stifle Economic Growth? A Critique of Reinhart and Rogoff :



[...] coding errors, selective exclusion of available data, and unconventional weighting of summary statistics lead to serious errors that inaccurately represent the relationship between public debt and GDP growth among 20 advanced economies in the post-war period.



Reinhart and Rogoff suggested that the debt/GDP ratio and economic growth is simply a correlation, and that correlation still holds after correcting for the spreadsheet mistakes. However, that correlation is not as strong as their original paper posited.



Why this was a big deal



The implications of their findings resulted in news outlets, politicians, and policymakers using the 90% benchmark as a signal that a country is heading for low economic growth. Some notable examples:



* 2012 Republican nominee for the US vice presidency Paul Ryan included the paper in hi proposed 2013 budget* The Washington Post editorial board takes it as an economic consensus view, stating that "debt-to-GDP could keep rising — and stick dangerously near the 90 percent mark that economists regard as a threat to sustainable economic growth." * Austerity measures are put into place around the world despite the advice from economic advisers, pushing unemployment rate above 10% in the eurozone



3 main Excel spreadsheet problems with the model



The three main errors that Herndon, Ash, and Polling discovered are the following:



* Years of high debt and average growth where selectively excluded from the data set* Countries' G...]]>
Dear Analyst 40 42:51 49430
Dear Analyst #39: Generate a random list of names from a list of popular 90s TV characters https://www.thekeycuts.com/dear-analyst-generate-a-random-list-of-names-from-a-list-of-popular-90s-tv-characters/ https://www.thekeycuts.com/dear-analyst-generate-a-random-list-of-names-from-a-list-of-popular-90s-tv-characters/#respond Mon, 10 Aug 2020 12:48:17 +0000 https://www.thekeycuts.com/?p=49297 Let’s say you have a set list of names (in this case TV characters from popular 90s TV shows). You want Google Sheets/Excel to generate a random list of names from your list as if you were picking names out of a hat. How would you do this? It most likely would involve the RAND […]

The post Dear Analyst #39: Generate a random list of names from a list of popular 90s TV characters appeared first on .

]]>
Let’s say you have a set list of names (in this case TV characters from popular 90s TV shows). You want Google Sheets/Excel to generate a random list of names from your list as if you were picking names out of a hat. How would you do this? It most likely would involve the RAND function, but let’s take it a step further and say you want to give the end user the ability to dictate the number of random names to return from your list (e.g. out of my list of 100, give me 5 random names). This is the Google Sheet with all the completed formulas. In addition to the audio format of this episode, I’m also going to start releasing the video tutorial:

Create your list in column B

Start with your list of names in column B. This can be any list you want to randomize. My list is just a bunch of TV characters from shows I watched when I was a kid.

Source: Fandom

In column A, you put the RAND function and copy it all the way down to the bottom of our list. You’ll get a decimal with random numbers. Doesn’t look that useful now, but this random number column will drive the rest of the tool to generate your list of random names:

Sort this random list of numbers

It sounds kind of weird, why would you sort a random list of numbers? What does that even mean? As you have probably seen, every time you refresh your Google Sheet or commit an Excel formula by hitting ENTER, all those random numbers in column A will change. This means if you sort this list of random numbers, the sorted list will change too. I put a space in column C so in cell D2, you enter this formula:

The SORT function takes in a range of cells as the first parameter, the sort index as the 2nd (which is just the number column we ant to sort on, column #1), and then true or false for sorting in ascending or descending order. You can also put 0 to indicate false which is what I did in this example to sort in descending order.

The nice thing about the SORT function is that it automatically fills the formula down to the bottom of your data set. This is a relatively new function in Excel since it kind of acts like dynamic array formulas or array-entered formulas. The formula kind of “spills” down for you as your list grows so you don’t have to worry about dragging the formula down until the last row in your data set.

A good ‘ol VLOOKUP

What does this column of sorted random numbers do for us? Well, we know that each random number in this sorted column corresponds to one of the numbers in column A where we generated the random number. So in column E, we just do a VLOOKUP using column D as our lookup value and columns A:B as our lookup table to get the name associated with the random number in column D:

This is not the usual way you might use VLOOKUP because you’re usually using VLOOKUP with some unique identifier as the lookup value. Column A isn’t really a unique “TV character ID” since that “ID” changes all the time with the RAND function. We don’t really care about that, because now when you refresh the Sheet, column E will always have a random list of names:

In the above gif I’m just pressing COMMAND + R a few times to refresh the Sheet so that the RAND function in column A constantly changes.

We could stop here since you now have a random list of names in column E. Let’s take this a step further and give the end user the ability to choose the number of random names from the list.

User input with OFFSET

We’re already doing some hacking with VLOOKUP and using it in a way that it probably wasn’t made to use, so let’s do something similar with the OFFSET function. Cell H5 is just my “user input” cell where I’m getting the number of results from the user. This is a hard-coded number the user has to input. Then in cell H2, I have this OFFSET formula:

Let’s break this down by each parameter:

  • E2 – This is the “starting point” for my OFFSET function
  • 0 – I don’t want to move any rows up/down
  • 0 – I don’t want to move any columns up/down
  • H1 – References my user input cell indicating how many rows of data I want to return from my OFFSET (e.g. “height” of the range)
  • 1 – How many columns to return (e.g. “width” of the range)

Now as you put a number in cell H1, the list of random names will grow and shrink. If you put a number that is more than the list of names you have, then it will just return the max number of names from your list (in random order, of course):

Picking the right tool for the job

A caveat I point out at the end of this episode is that while you can do this random list of names generator in Excel or Google Sheets, a spreadsheet may not be the best tool for the job. There are hundreds of random list generator apps that may be built specifically for your industry be it education or hospitality. Sometime it’s just easier to do it in a spreadsheet because all our data is there, but constantly question if the tool you are using is the right one for the job.

There’s a similar template in the Coda gallery which generates a random list of teams of players based on the number of teams and players you have. Just another nifty way at approaching the same problem in a different tool. Disclosure: I work at Coda.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The post Dear Analyst #39: Generate a random list of names from a list of popular 90s TV characters appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-generate-a-random-list-of-names-from-a-list-of-popular-90s-tv-characters/feed/ 0 Let's say you have a set list of names (in this case TV characters from popular 90s TV shows). You want Google Sheets/Excel to generate a random list of names from your list as if you were picking names out of a hat. How would you do this? Let's say you have a set list of names (in this case TV characters from popular 90s TV shows). You want Google Sheets/Excel to generate a random list of names from your list as if you were picking names out of a hat. How would you do this? It most likely would involve the RAND function, but let's take it a step further and say you want to give the end user the ability to dictate the number of random names to return from your list (e.g. out of my list of 100, give me 5 random names). This is the Google Sheet with all the completed formulas. In addition to the audio format of this episode, I'm also going to start releasing the video tutorial:




https://www.youtube.com/watch?v=icKppdnxJRk




Create your list in column B



Start with your list of names in column B. This can be any list you want to randomize. My list is just a bunch of TV characters from shows I watched when I was a kid.



Source: Fandom



In column A, you put the RAND function and copy it all the way down to the bottom of our list. You'll get a decimal with random numbers. Doesn't look that useful now, but this random number column will drive the rest of the tool to generate your list of random names:







Sort this random list of numbers



It sounds kind of weird, why would you sort a random list of numbers? What does that even mean? As you have probably seen, every time you refresh your Google Sheet or commit an Excel formula by hitting ENTER, all those random numbers in column A will change. This means if you sort this list of random numbers, the sorted list will change too. I put a space in column C so in cell D2, you enter this formula:







The SORT function takes in a range of cells as the first parameter, the sort index as the 2nd (which is just the number column we ant to sort on, column #1), and then true or false for sorting in ascending or descending order. You can also put 0 to indicate false which is what I did in this example to sort in descending order.



The nice thing about the SORT function is that it automatically fills the formula down to the bottom of your data set. This is a relatively new function in Excel since it kind of acts like dynamic array formulas or array-entered formulas. The formula kind of "spills" down for you as your list grows so you don't have to worry about dragging the formula down until the last row in your data set.



A good 'ol VLOOKUP



What does this column of sorted random numbers do for us? Well, we know that each random number in this sorted column corresponds to one of the numbers in column A where we generated the random number. So in column E, we just do a VLOOKUP using column D as our lookup value and columns A:B as our lookup table to get the name associated with the random number in column D:







This is not the usual way you might use VLOOKUP because you're usually using VLOOKUP with some unique identifier as the lookup value. Column A isn't really a unique "TV character ID" since that "ID" changes all the time with the RAND function. We don't really care about that, because now when you refresh the Sheet, column E will always have a random list of names:
...]]>
Dear Analyst 39 22:35 49297
Dear Analyst #38: Breaking down an Excel error that led to a $6.2B loss at JPMorgan Chase https://www.thekeycuts.com/dear-analyst-38-breaking-down-an-excel-error-that-led-to-six-billion-loss-at-jpmorgan-chase/ https://www.thekeycuts.com/dear-analyst-38-breaking-down-an-excel-error-that-led-to-six-billion-loss-at-jpmorgan-chase/#respond Tue, 04 Aug 2020 08:03:00 +0000 https://www.thekeycuts.com/?p=49241 You blink a few times at the screen and realize what you’re seeing is not a typo. $6.2B has left your bank due to some rogue trader making untimely bets on the market. That’s B as in billion. You call up the modeler who was supposed to make sure this never happens to your bank. […]

The post Dear Analyst #38: Breaking down an Excel error that led to a $6.2B loss at JPMorgan Chase appeared first on .

]]>
You blink a few times at the screen and realize what you’re seeing is not a typo. $6.2B has left your bank due to some rogue trader making untimely bets on the market. That’s B as in billion. You call up the modeler who was supposed to make sure this never happens to your bank. The modeler takes a closer look at his model, and realizes that he made a fundamental error in how he calculates one value that caused the dominoes to fall. This is the story of the “London Whale” at JPMorgan Chase in 2012 who cost the bank $6.2B and a breakdown of the Excel error that may have caused the whole thing. This is the Google Sheet if you want to follow along with the Excel error.

Derivative of a derivative

I’m not going to pretend like a know the intricacies of all the financial products involved here, so you can read the Wikipedia article if you want the full details. In 2012, there was a CDS (credit default swap) product called CDX IG 9 that the trader at JPMorgan may have made large bets on, and ended up on the wrong side of the bet. The London trader’s name is Bruno Iksil, and it was a classic scenario of a gambler trying to get out of his losses by doubling down on black at the roulette table.

Source: The Fiscal Times

Multiple investigations were taken by the authorities in the U.S. and U.K., the the investigations show that a variety of institutional failures may have facilitated the large bets made by the London Whale. This HBR article by Ben Heineman, Jr. provides a nice summary of all the key players:

  • London traders – The traders simply didn’t understand the complexity of the derivative products they were buying and selling
  • Chief Investment Office (CIO) – The head of the CIO didn’t monitor the trading strategies and put in the proper controls for the portfolio of products the office was buying. The Value at Risk (VaR) model was flawed (see more below).
  • Firm-wide Leaders – Not enough oversight by the CFO and CEO (Jamie Dimon)
  • Board and Risk Policy Committee – The committee was told that everything was fine with the CIO, and didn’t get accurate pictures of what risk officers really felt about the risky trades being made.

Appendix of the Task Force Report by JPMorgan

There is a 130-page report created by JPMorgan Chase in 2012 which details what happened internally that led to this debacle. In my opinion, the juicy stuff starts in the appendix starting on page 121 of the report. I read off some parts of this appendix in this episode, but the appendix basically details issues with the VaR models created by one of the quantitative modelers at JPMorgan to more accurately value the complex traders that were happening. Or at least they thought the model was more accurate.

At the very end of the appendix, there’s a section called “Discovery of Problems with the New VaR Model and Discontinuance” where the report details the Excel error that contributed to the large inaccuracies in how the model valued risk.

The $6.2B Excel error

This is how the error is described in the report (emphasis mine):

Following that decision, further errors were discovered in the Basel II.5 model, including, most significantly, an operational error in the calculation of the relative changes in hazard rates and correlation estimates. Specifically, after subtracting the old rate from the new rate, the spreadsheet divided by their sum instead of their average, as the modeler had intended.

Note: I don’t have domain expertise in VaR models, synthetic credit derivatives, or trading in general. The following example is my over-simplification of the error based on what’s written in the report.

The report talks about hazard rates (for what I assume relate to the default of corporate loans in this case) and how the changes in the hazard rates were improperly calculated. Here’s a simple table from the Google Sheet showing fictitious dates, hazard rates, and the change in rates:

Now here’s what happens when you apply a SUM vs. an AVERAGE to the “Change in %” column:

This is hitting the border of my knowledge of growth rates and time periods, but the sum of changes will always be 5X the average of changes given there are 5 values we are summing/averaging.

The difficulty with detecting this type of formula error

The magnitude of the difference between the SUM and the AVERAGE is not what I think is interesting, but rather the absolute difference between the SUM and AVERAGE. Here is a chart plotting the same data:

Based on this chart, can you estimate what the average of the Change in % is? Looks like something around 0%, but 3% doesn’t feel that far off. The point I’m trying to make is that unless you are monitoring the SUM and AVERAGE consistently over time to detect any outliers, it will be difficult to know whether you made the formula mistake in the first place. With the presence of outliers, it makes it more clear that you might have an error in your model. Here’s the other table from the Google Sheet with intentionally skewed hazard rates:

Here we see the magnitude of the difference is still 5X, but the absolute difference is much wider. This would cause an analyst to look deeper into the model and try to figure out why there is such a large discrepancy. But this is only because there are fictitious hazard rates. In the case of JPMorgan Chase, my hunch is that the gap between the lower and upper bound of daily hazard rates was really narrow, so detecting a change like this would’ve been very difficult without the proper controls in place.

This reminds me of the tale of the boiling frog:

Urban myth has it that if you put a frog in a pot of boiling water it will instantly leap out. But if you put it in a pot filled with pleasantly tepid water and gradually heat it, the frog will remain in the water until it boils to death. (Source)

Without a really hot pot of boiling water, it was too late for JPMorgan to detect there was something wrong with the CDS trades, and the proverbial frog boils to death.

Hanlon’s Razor

One frame for this egregious Excel error is Hanlon’s Razor:

“Never attribute to malice that which is adequately explained by stupidity”, known in several other forms. It is a philosophical razor which suggests a way of eliminating unlikely explanations for human behavior. (Source)

Perhaps the modeler cannot be blamed for his Excel error because it was an error that he had know way of knowing or predicting. I’m not trying to remove blame from the modeler, but it’s an interesting frame to analyze the problem because this is an spreadsheet error that is difficult to prevent unless you have other models and risk controls that are able to predict this type of error in advance. There are many other cases of Excel errors that led to false calculations that cost firms millions of dollars, and it’s hard to say if one can blame the modeler for “malice” or plain stupidity.

New intermediate Excel class on Skillshare

Quick plug for a new Excel class I just launched today on Skillshare. It’s an intermediate Excel class for cleaning and analyzing data.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The post Dear Analyst #38: Breaking down an Excel error that led to a $6.2B loss at JPMorgan Chase appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-38-breaking-down-an-excel-error-that-led-to-six-billion-loss-at-jpmorgan-chase/feed/ 0 You blink a few times at the screen and realize what you're seeing is not a typo. $6.2B has left your bank due to some rogue trader making untimely bets on the market. That's B as in billion. You call up the modeler who was supposed to make sure this n... You blink a few times at the screen and realize what you're seeing is not a typo. $6.2B has left your bank due to some rogue trader making untimely bets on the market. That's B as in billion. You call up the modeler who was supposed to make sure this never happens to your bank. The modeler takes a closer look at his model, and realizes that he made a fundamental error in how he calculates one value that caused the dominoes to fall. This is the story of the "London Whale" at JPMorgan Chase in 2012 who cost the bank $6.2B and a breakdown of the Excel error that may have caused the whole thing. This is the Google Sheet if you want to follow along with the Excel error.







Derivative of a derivative



I'm not going to pretend like a know the intricacies of all the financial products involved here, so you can read the Wikipedia article if you want the full details. In 2012, there was a CDS (credit default swap) product called CDX IG 9 that the trader at JPMorgan may have made large bets on, and ended up on the wrong side of the bet. The London trader's name is Bruno Iksil, and it was a classic scenario of a gambler trying to get out of his losses by doubling down on black at the roulette table.



Source: The Fiscal Times



Multiple investigations were taken by the authorities in the U.S. and U.K., the the investigations show that a variety of institutional failures may have facilitated the large bets made by the London Whale. This HBR article by Ben Heineman, Jr. provides a nice summary of all the key players:



* London traders - The traders simply didn't understand the complexity of the derivative products they were buying and selling* Chief Investment Office (CIO) - The head of the CIO didn't monitor the trading strategies and put in the proper controls for the portfolio of products the office was buying. The Value at Risk (VaR) model was flawed (see more below).* Firm-wide Leaders - Not enough oversight by the CFO and CEO (Jamie Dimon) * Board and Risk Policy Committee - The committee was told that everything was fine with the CIO, and didn't get accurate pictures of what risk officers really felt about the risky trades being made.



Appendix of the Task Force Report by JPMorgan



There is a 130-page report created by JPMorgan Chase in 2012 which details what happened internally that led to this debacle. In my opinion, the juicy stuff starts in the appendix starting on page 121 of the report. I read off some parts of this appendix in this episode, but the appendix basically details issues with the VaR models created by one of the quantitative modelers at JPMorgan to more accurately value the complex traders that were happening. Or at least they thought the model was more accurate.



At the very end of the appendix, there's a section called "Discovery of Problems with the New VaR Model and Discontinuance" where the report details the Excel error that contributed to the large inaccuracies in how the model valued risk.



The $6.2B Excel error



This is how the error is described in the report (emphasis mine):



Following that decision, further errors were discovered in the Basel II.5 model, including, most significantly, an operational error in the calculation of the relative changes in hazard rates and correlation estimates. Specifically,]]>
Dear Analyst 38 39:10 49241
Dear Analyst #37: Text manipulation functions to extract domain names from email addresses https://www.thekeycuts.com/dear-analyst-text-manipulation-functions-to-extract-domain-names-from-email-addresses/ https://www.thekeycuts.com/dear-analyst-text-manipulation-functions-to-extract-domain-names-from-email-addresses/#respond Mon, 27 Jul 2020 09:27:00 +0000 https://www.thekeycuts.com/?p=49227 In Excel or Google Sheets, text manipulation is usually associated with data cleaning, data cleansing, and data transformation. Sometimes your data is “dirty” and needs to be categorized in a different way or you need to “extract” a piece of text from a another piece of text. In this example, we use a combination of […]

The post Dear Analyst #37: Text manipulation functions to extract domain names from email addresses appeared first on .

]]>
In Excel or Google Sheets, text manipulation is usually associated with data cleaning, data cleansing, and data transformation. Sometimes your data is “dirty” and needs to be categorized in a different way or you need to “extract” a piece of text from a another piece of text. In this example, we use a combination of the FIND, RIGHT, and LEN functions to extract the domain name from an email address (e.g. the “tesla.com” from “john.smith@tesla.com”). Here’s the Google Sheet if you want to make a copy for yourself to follow along.

Start with finding the @

The first step is to use the FIND function to find the location of the “@” symbol in the email address. The FIND function takes two required arguments and one optional argument. You’re basically find the index location of where that characters or string exists within the cell:

In the case of “john.smith@amazon.com,” the FIND function would return 11 since the “@” symbol starts at the 11th position within the email address. Pretty simple right?

Nesting LEN inside the RIGHT function

The next part is a little trickier. Now that we know the position of the “@” symbol, we want all the characters after the “@” symbol to get the domain of the email address. There are multiple ways of doing this, but I chose to use the RIGHT and LEN functions. To make this more clear, I could have put the LEN function in its own column, but decided to next it within the RIGHT function:

The RIGHT function takes two arguments and simply returns the number of characters from the “right” of the text you give it (in this case the email address). Since we don’t know how many characters to pull from each e-mail address, we use the result of the LEN(A2) - B2 formula which tells us how many characters to pull from the right of the email address.

LEN(A2) gives us the length of the entire text (for “john.smith@amazon.com” it’s 21). If we subtract the index position of the “@” symbol from that length, we’ll get the exact number of characters to pull for each unique email address. Pretty nifty.

Note: The “Position of @” column also could’ve been nested in the 3rd column (and replaced the current cell reference of B2).

I typically use a combination of FIND, LEN, and MID to extract the text I need from a longer piece of text. Once you master these few functions, you’ll be able to to pull anything you want out of a long piece of text to get “clean” data.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The post Dear Analyst #37: Text manipulation functions to extract domain names from email addresses appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-text-manipulation-functions-to-extract-domain-names-from-email-addresses/feed/ 0 In Excel or Google Sheets, text manipulation is usually associated with data cleaning, data cleansing, and data transformation. Sometimes your data is "dirty" and needs to be categorized in a different way or you need to "extract" a piece of text from ... In Excel or Google Sheets, text manipulation is usually associated with data cleaning, data cleansing, and data transformation. Sometimes your data is "dirty" and needs to be categorized in a different way or you need to "extract" a piece of text from a another piece of text. In this example, we use a combination of the FIND, RIGHT, and LEN functions to extract the domain name from an email address (e.g. the "tesla.com" from "john.smith@tesla.com"). Here's the Google Sheet if you want to make a copy for yourself to follow along.



Start with finding the @



The first step is to use the FIND function to find the location of the "@" symbol in the email address. The FIND function takes two required arguments and one optional argument. You're basically find the index location of where that characters or string exists within the cell:







In the case of "john.smith@amazon.com," the FIND function would return 11 since the "@" symbol starts at the 11th position within the email address. Pretty simple right?



Nesting LEN inside the RIGHT function



The next part is a little trickier. Now that we know the position of the "@" symbol, we want all the characters after the "@" symbol to get the domain of the email address. There are multiple ways of doing this, but I chose to use the RIGHT and LEN functions. To make this more clear, I could have put the LEN function in its own column, but decided to next it within the RIGHT function:







The RIGHT function takes two arguments and simply returns the number of characters from the "right" of the text you give it (in this case the email address). Since we don't know how many characters to pull from each e-mail address, we use the result of the LEN(A2) - B2 formula which tells us how many characters to pull from the right of the email address.



LEN(A2) gives us the length of the entire text (for "john.smith@amazon.com" it's 21). If we subtract the index position of the "@" symbol from that length, we'll get the exact number of characters to pull for each unique email address. Pretty nifty.



Note: The "Position of @" column also could've been nested in the 3rd column (and replaced the current cell reference of B2).



I typically use a combination of FIND, LEN, and MID to extract the text I need from a longer piece of text. Once you master these few functions, you'll be able to to pull anything you want out of a long piece of text to get "clean" data.



Other Podcasts & Blog Posts



In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:



* The Tim Ferriss Show #444: Hugh Jackman on Best Decisions, Daily Routines, The 85% Rule, Favorite Exercises, Mind Training, and Much More* EconTalk: Robert Lerman on Apprenticeships
]]>
Dear Analyst 37 19:02 49227
Dear Analyst #36: What The Economist’s model for the 2020 presidential election can teach us about forecasting https://www.thekeycuts.com/dear-analyst-what-the-economists-model-for-the-2020-presidential-election-can-teach-us-about-forecasting/ https://www.thekeycuts.com/dear-analyst-what-the-economists-model-for-the-2020-presidential-election-can-teach-us-about-forecasting/#respond Mon, 13 Jul 2020 09:26:00 +0000 https://www.thekeycuts.com/?p=49183 On a recent episode of The Intelligence, The data editor at The Economist spoke about a U.S. presidential election forecast their publication is working on. I looked more into their model and discuss some of the features and parameters of their model and what makes their forecast unique. Some of the techniques used in The […]

The post Dear Analyst #36: What The Economist’s model for the 2020 presidential election can teach us about forecasting appeared first on .

]]>
On a recent episode of The Intelligence, The data editor at The Economist spoke about a U.S. presidential election forecast their publication is working on. I looked more into their model and discuss some of the features and parameters of their model and what makes their forecast unique. Some of the techniques used in The Economist‘s model can be used with your own forecasting use cases. To see a summary of The Economist‘s model, see this page. Learn more about how the model works on this page.

Source: The Economist

Key takeaways and a caveat

The model utilizes machine learning and multiple data sources and it’s easy to get caught up in the details. Here are the key takeaways as described by Dan Rosenhack, the data editor at The Economist:

  1. Machine learning is used to create equations to predict the 2020 presidential outcome
  2. Early polls are not as reliable early on in the election cycle
  3. Partisan non-response bias can result in a supporter being more likely or unlikely to respond to a pollster when there is extremely good or bad news about that supporter’s party or candidate

A caveat: The Economist‘s model and the various forecasting techniques they use are definitely outside of my knowledge and skillset. Most of this episode is me learning more about the model and interpreting some of the results. You don’t have to be a statistics programmer or data science professional to appreciate what the data team has done at The Economist. If you are working with data in any capacity, pushing yourself to learn about subjects that push your comfort zone will only make you more knowledgable about the data analysis process.

Fundamentals vs. early polling

One key finding from the model is that polls conducted in the first half of the year during the election cycle are a pretty weak predictor of results. On the other hand, fundamental measures like the president’s approval rating, GDP growth, and whether there is an incumbent running for re-election are much better predictors. This chart shows the difference between poll results and fundamentals for predicting the outcome in 1992:

Source: The Economist

The model primarily relies on these fundamental indicators, but over time the polls become a better indicator for predicting the outcome. In the last week leading up the election in November, more weight is applied to the polls than the fundamentals.

This visualization below shows that early polls tend to overestimate a party’s share of the vote (in this case the Democratic share) compared to fundamental indicators. As you get closer to election day, however, the polls start to become a better predictor:

Source: The Economist

Overfitting data

One downside The Economist points out with other models that try to forecast the presidential election is that equations are created that overfit to historical data points. Think about it: if you tried to create an equation to predict who would win the NBA championship in 2020 based on 1990s data, you may create an equation that leans heavily to the Bulls. Unfortunately, Michael Jordan isn’t playing anymore and the 2020 NBA season is now being played in a bubble in Orlando.

Had to mention Jordan somewhere in this post 🙂

The Economist utilizes machine learning to better predict the outcome of the presidential election and utilizes two techniques which I’ll try to explain in layman’s terms from reading the post:

  1. Elastic-net regularisation – Simplify the equation you’re using to predict the outcome
  2. Leave-one-out-cross-validation – Split your data into pieces and apply the machine learning to each piece to predict outcomes

#2 is a pretty common technique I’ve seen used in finance. Take actual results and see if you can predict what would’ve happened if you applied your forecast to last quarter or last year.

In the context of the presidential election, let’s say the model is trying to predict what the outcome of the 1948 election would’ve been (the incumbent Harry Truman defeated Thomas Dewey). The training model is done on all the other years of data except for 1948. Then use the learnings from these other years to see which model was best at predicting the outcome in 1948.

State polling

The model also looks at state-level polling data. What’s interesting about the state model is how it uses demographic data like population density and the share of voters that are white evangelical Christians to determine how similar two states are in terms of voter preferences:

Source: The Economist

In the visualization above, Wisconsin is more similar to Ohio than Nevada is to Ohio.

A note about non-partisan response bias

I’ve never heard this term before and think the way the team is accounting for this bias in their model makes the model more accurate and unique. They take polling data from major sources like ABC and The Washington Post and track the changes in poll results over time. This means they can account for any irregularities in the data so that large swings in opinion due to news about a candidate don’t impact the model too much.

Looking at the us-potus-model repo

One visualization that caught my eye in the source code The Economist released is this one showing the model results vs. the polls vs. actuals from the 2008, 2012, and 2016 elections. Notice how in 2008 and 2012 the variability between the model, prior, and result are much closer together than in 2016? Just shows the level of uncertainty that went into the 2016 prediction.

2008

2012

2016

Speaking of uncertainty, I like this commit message as the team was refining the model back in March

We have chronic uncertainty.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The post Dear Analyst #36: What The Economist’s model for the 2020 presidential election can teach us about forecasting appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-what-the-economists-model-for-the-2020-presidential-election-can-teach-us-about-forecasting/feed/ 0 On a recent episode of The Intelligence, The data editor at The Economist spoke about a U.S. presidential election forecast their publication is working on. I looked more into their model and discuss some of the features and parameters of their model a... On a recent episode of The Intelligence, The data editor at The Economist spoke about a U.S. presidential election forecast their publication is working on. I looked more into their model and discuss some of the features and parameters of their model and what makes their forecast unique. Some of the techniques used in The Economist's model can be used with your own forecasting use cases. To see a summary of The Economist's model, see this page. Learn more about how the model works on this page.



Source: The Economist



Key takeaways and a caveat



The model utilizes machine learning and multiple data sources and it's easy to get caught up in the details. Here are the key takeaways as described by Dan Rosenhack, the data editor at The Economist:



* Machine learning is used to create equations to predict the 2020 presidential outcome* Early polls are not as reliable early on in the election cycle* Partisan non-response bias can result in a supporter being more likely or unlikely to respond to a pollster when there is extremely good or bad news about that supporter's party or candidate



A caveat: The Economist's model and the various forecasting techniques they use are definitely outside of my knowledge and skillset. Most of this episode is me learning more about the model and interpreting some of the results. You don't have to be a statistics programmer or data science professional to appreciate what the data team has done at The Economist. If you are working with data in any capacity, pushing yourself to learn about subjects that push your comfort zone will only make you more knowledgable about the data analysis process.



Fundamentals vs. early polling



One key finding from the model is that polls conducted in the first half of the year during the election cycle are a pretty weak predictor of results. On the other hand, fundamental measures like the president's approval rating, GDP growth, and whether there is an incumbent running for re-election are much better predictors. This chart shows the difference between poll results and fundamentals for predicting the outcome in 1992:



Source: The Economist



The model primarily relies on these fundamental indicators, but over time the polls become a better indicator for predicting the outcome. In the last week leading up the election in November, more weight is applied to the polls than the fundamentals.



This visualization below shows that early polls tend to overestimate a party's share of the vote (in this case the Democratic share) compared to fundamental indicators. As you get closer to election day, however, the polls start to become a better predictor:



Source: The Economist



Overfitting data



One downside The Economist points out with other models that try to forecast the presidential election is that equations are created that overfit to historical data points. Think about it: if you tried to create an equation to predict who would win the NBA championship in 2020 based on 1990s data, you may create an equation that leans heavily to the Bulls. Unfortunately, Michael Jordan isn't playing anymore and the 2020 NBA season is now being played in a bubble in Orlando.



Had to mention Jordan somewhere in this post :)
Dear Analyst 47:30 49183
Dear Analyst #35: Analyzing what people dream about with the Shape of Dreams data visualization https://www.thekeycuts.com/dear-analyst-analyzing-what-people-dream-about-with-the-shape-of-dreams-data-visualization/ https://www.thekeycuts.com/dear-analyst-analyzing-what-people-dream-about-with-the-shape-of-dreams-data-visualization/#respond Mon, 29 Jun 2020 09:02:00 +0000 https://www.thekeycuts.com/?p=49164 Have you ever wondered what the underlying meaning of your dreams are? Chances are you may have tried Googling something like “What does it mean to dream about [INSERT DREAM].” In The Shape of Dreams, Federica Fragapane answers this very question of what people around the world dream about by using Google Search queries from […]

The post Dear Analyst #35: Analyzing what people dream about with the Shape of Dreams data visualization appeared first on .

]]>
Have you ever wondered what the underlying meaning of your dreams are? Chances are you may have tried Googling something like “What does it mean to dream about [INSERT DREAM].” In The Shape of Dreams, Federica Fragapane answers this very question of what people around the world dream about by using Google Search queries from 2009 to 2019. Federica uses a mix of data storytelling and data visualizations to show what we collectively dream about based on what we search for in Google. The key takeaway: someone on the opposite side of the world probably has similar dreams as you showing that we are more connected than we think.

Shape of Dreams

Importance of data visualization

Data visualizations are just as important (if not more important) than the number crunching and analysis of the data itself. While Excel and Google Sheets are the standard tools for analyzing data, there are a variety of tools for creating charts and visualizations such as Tableau, Google’s Data Studio, and Microsoft’s own Power BI.

data visualization kevin simler
Source: Melting Asphalt

I’ve posted about the power of data visualizations in the past including New York Times’ data bootcamp (that teaches data visualization), data visualizations to model COVID-19, and my own class on creating a data-driven presentation. Creating meaningful data visualizations requires you to understand the technical aspects of aggregating data and actually creating the visualization itself. It also requires the creative side of telling a story around the visualization. Federica does an amazing job of telling a story about the Google Search queries about what we collectively dream about as a society.

Structure of Shape of Dreams

I really like how Federica gives the reader two options: read the story about the data where she takes you through the visualizations with key takeaways and also gives you the ability to explore the data yourself. In the first chapter, she simply shows the most common types of dreams by keyword across different languages:

Who doesn’t dream about their teeth falling off?

When you explore the data, you can use the arrow keys to see the dreams people search for by language and by year which leads to some interesting results:

Varying the types of visualizations

As you go through chapter 2 and chapter 3, you see Federica utilizing different types of visualizations to better tell the story behind the dream Google Searches. A motif she uses across the visualizations is a flower’s pedals, and you’re able to interact with the pedals in chapter 2. To summarize what I imagine to be a extremely large dataset, we see some general categories of dreams in chapter 2:

Federica discovers that searches in English, Portuguese, and Spanish aggregate up to dreams about animals, family, and relationships.

You’ll see a more traditional time-series chart in chapter 3 showing the popularity of a certain type of dream over time. I’d be curious to see the trend of dreams about “pregnancy” in 2020 given the pandemic:

A network of dreams

My favorite visualization is in chapter 4 where you’ll see a network type of visualization that shows two metrics:

  • Languages that share common searches about dreams
  • The number of dreams in common between languages

We actually use a similar type of visualization at work when we want to see how our customers are related to each other inside an organization (and how they share their Coda docs). What I love about the visualization above is that it shows how connected we are as a society given the same type of dreams we have (and subsequently search for on Google).

Using data to get a edge on human conversations

I also discuss a new podcast I started listening to called Against the Rules by one of my favorite authors, Michael Lewis. The episode is all about how there is research (and companies) helping you optimize your conversations with people to get the most benefit from the conversation. Lewis poses the million-dollar question at the end of the episode which is what are the ethics behind using this data to optimize all of your conversations in life from business to romance?

This question is probably getting addressed already at Harvard Business School Lewis interviews Professor Allison Wood Brooks in the episode who has a class at HBS called How to Talk Gooder in Business and Life. If you don’t have access to these type of classes and resources, will that put you at a disadvantage later on in your career, negotiating a business deal, finding a romantic partner?

Taken to the extreme, this reminds of me this scene from the season finale of Westworld (spoiler alert):

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The post Dear Analyst #35: Analyzing what people dream about with the Shape of Dreams data visualization appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-analyzing-what-people-dream-about-with-the-shape-of-dreams-data-visualization/feed/ 0 Have you ever wondered what the underlying meaning of your dreams are? Chances are you may have tried Googling something like "What does it mean to dream about [INSERT DREAM]." In The Shape of Dreams, Federica Fragapane answers this very question of wh... Have you ever wondered what the underlying meaning of your dreams are? Chances are you may have tried Googling something like "What does it mean to dream about [INSERT DREAM]." In The Shape of Dreams, Federica Fragapane answers this very question of what people around the world dream about by using Google Search queries from 2009 to 2019. Federica uses a mix of data storytelling and data visualizations to show what we collectively dream about based on what we search for in Google. The key takeaway: someone on the opposite side of the world probably has similar dreams as you showing that we are more connected than we think.



Shape of Dreams



Importance of data visualization



Data visualizations are just as important (if not more important) than the number crunching and analysis of the data itself. While Excel and Google Sheets are the standard tools for analyzing data, there are a variety of tools for creating charts and visualizations such as Tableau, Google's Data Studio, and Microsoft's own Power BI.



Source: Melting Asphalt



I've posted about the power of data visualizations in the past including New York Times' data bootcamp (that teaches data visualization), data visualizations to model COVID-19, and my own class on creating a data-driven presentation. Creating meaningful data visualizations requires you to understand the technical aspects of aggregating data and actually creating the visualization itself. It also requires the creative side of telling a story around the visualization. Federica does an amazing job of telling a story about the Google Search queries about what we collectively dream about as a society.



Structure of Shape of Dreams



I really like how Federica gives the reader two options: read the story about the data where she takes you through the visualizations with key takeaways and also gives you the ability to explore the data yourself. In the first chapter, she simply shows the most common types of dreams by keyword across different languages:



Who doesn't dream about their teeth falling off?



When you explore the data, you can use the arrow keys to see the dreams people search for by language and by year which leads to some interesting results:







Varying the types of visualizations



As you go through chapter 2 and chapter 3, you see Federica utilizing different types of visualizations to better tell the story behind the dream Google Searches. A motif she uses across the visualizations is a flower's pedals, and you're able to interact with the pedals in chapter 2. To summarize what I imagine to be a extremely large dataset, we see some general categories of dreams in chapter 2:







Federica discovers that searches in English, Portuguese, and Spanish aggregate up to dreams about animals, family, and relationships.



You'll see a more traditional time-series chart in chapter 3 showing the popularity...]]>
Dear Analyst 35 34:49 49164
Dear Analyst #34: Trick for finding column index for VLOOKUPs using pride events data https://www.thekeycuts.com/dear-analyst-34-trick-for-finding-column-index-for-vlookups-using-pride-events-data/ https://www.thekeycuts.com/dear-analyst-34-trick-for-finding-column-index-for-vlookups-using-pride-events-data/#respond Mon, 22 Jun 2020 09:27:00 +0000 https://www.thekeycuts.com/?p=49140 This is one of my favorite VLOOKUP tips. Given that it’s pride month, we’ll be applying this tip to a list of all pride events in the United States. Here is the Google Sheet if you want to follow along with this example. Here’s the scenario: you have a super large table in Excel or […]

The post Dear Analyst #34: Trick for finding column index for VLOOKUPs using pride events data appeared first on .

]]>
This is one of my favorite VLOOKUP tips. Given that it’s pride month, we’ll be applying this tip to a list of all pride events in the United States. Here is the Google Sheet if you want to follow along with this example. Here’s the scenario: you have a super large table in Excel or Google Sheets (by large I mean there are many columns) and you need to do a VLOOKUP on the 25th column. Instead of counting 25 columns from the left of your lookup column, you can use this column index trick to quickly get the column you’re after.

Creating column indexes above your lookup table

In the screenshot above, you’ll notice that each column has the column index above it. This is a simple formula of the previous column index added to 1:

This might feel a little strange because we’re used to heaving the column headers in the first row of our table. By having this column index in above the column header, however, it’ll make it easier to provide the col_index parameter you need to provide to your VLOOKUP formula. In this list of pride events, if I want to get the Start column pulled into my VLOOKUP formula, I simply reference the column index above the column header instead of writing out the number “5” (note that PrideEvents is a named range representing A2:E270 in my list of pride events):

Putting the column index above your new column headers

In this second example, I put the column index above the new table where I want to pull in data from my list of pride events. Notice that the order of columns I want to pull does not match the column order from my lookup table. The simple trick here is that I do a simple cell reference to the column index above the main table so that I know that the order of the columns I want to pull back in this case is 3, 5, 2:

One of the benefits of this trick is that you can move columns around in your lookup table and this VLOOKUP formula will still work only if you “reset” the column indexes above your lookup table column headers to be sequential (1, 2, 3, etc). This is kind of annoying because any time I switch columns around, I have to re-drag the formula of the previous cell plus 1 in row 1 where my column indexes are. Hopefully your columns aren’t moving around too much and this solution works for you.

Using the MATCH() function to find the column index

This is a little more advanced, but another solution is to use the MATCH function to match the column name in your new table with the column names in your lookup table:

Instead of doing a simple reference to the column index in that first row in my new table, I have this MATCH function which tries to match Location, in this case, with the column headers in the lookup table ($A$2:$E$2 represents the column headers from my list of pride events). If it finds a “match,” then the MATCH function returns back the column index. You could actually do this solution without having that column index above your new table columns by putting the MATCH function directly in your VLOOKUP formula, but it might make the formula more difficult to debug in the future.

Pride Easter egg in Google Sheets

To celebrate pride month, here’s a fun Easter egg you’ll find in Google Sheets if you type out “PRIDE” in separate columns (you’ll also see this in the Google Sheets example for this blog post):

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The post Dear Analyst #34: Trick for finding column index for VLOOKUPs using pride events data appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-34-trick-for-finding-column-index-for-vlookups-using-pride-events-data/feed/ 0 This is one of my favorite VLOOKUP tips. Given that it's pride month, we'll be applying this tip to a list of all pride events in the United States. Here is the Google Sheet if you want to follow along with this example. This is one of my favorite VLOOKUP tips. Given that it's pride month, we'll be applying this tip to a list of all pride events in the United States. Here is the Google Sheet if you want to follow along with this example. Here's the scenario: you have a super large table in Excel or Google Sheets (by large I mean there are many columns) and you need to do a VLOOKUP on the 25th column. Instead of counting 25 columns from the left of your lookup column, you can use this column index trick to quickly get the column you're after.







Creating column indexes above your lookup table



In the screenshot above, you'll notice that each column has the column index above it. This is a simple formula of the previous column index added to 1:







This might feel a little strange because we're used to heaving the column headers in the first row of our table. By having this column index in above the column header, however, it'll make it easier to provide the col_index parameter you need to provide to your VLOOKUP formula. In this list of pride events, if I want to get the Start column pulled into my VLOOKUP formula, I simply reference the column index above the column header instead of writing out the number "5" (note that PrideEvents is a named range representing A2:E270 in my list of pride events):







Putting the column index above your new column headers



In this second example, I put the column index above the new table where I want to pull in data from my list of pride events. Notice that the order of columns I want to pull does not match the column order from my lookup table. The simple trick here is that I do a simple cell reference to the column index above the main table so that I know that the order of the columns I want to pull back in this case is 3, 5, 2:







One of the benefits of this trick is that you can move columns around in your lookup table and this VLOOKUP formula will still work only if you "reset" the column indexes above your lookup table column headers to be sequential (1, 2, 3, etc). This is kind of annoying because any time I switch columns around, I have to re-drag the formula of the previous cell plus 1 in row 1 where my column indexes are. Hopefully your columns aren't moving around too much and this solution works for you.



Using the MATCH() function to find the column index



This is a little more advanced, but another solution is to use the MATCH function to match the column name in your new table with the column names in your lookup table:







Instead of doing a simple reference to the column index in that first row in my new table, I have this MATCH function which tries to match Location, in this case, with the column headers in the lookup table ($A$2:$E$2 represents the column headers from my list of pride events). If it finds a "match," then the MATCH function returns back the column index. You could actually do this solution without having that column index above your new table columns by putting the MATCH function directly in your VLOOKUP formula, but it might make the formula more difficult to debug in the future.



Pride Easter egg in Google Sheets



To celebrate pride month, here's a fun Easter egg you'll find in Google Sheets if you type out "PRIDE" in separate columns (you'll also see this in the Google Sheets example for this blog post):



]]>
Dear Analyst 34 23:19 49140