Dear Analyst #45: Thinking long-term for structuring your dataset using U.S. public food assistance data
Podcast: Play in new window | Download
Subscribe: Spotify | TuneIn | RSS
When you need to capture some data in a structured way, you’ll open up an Excel file or Google Sheet and just start throwing data into the spreadsheet. Not much thinking; just copy and paste. As that dataset grows, the original structure you had set up for that spreadsheet may not be ideal. Specifically, the dataset is not ideal for putting into a PivotTable. Long-term, I’d argue that all your spreadsheets should be structured in a way that’s suitable for a PivotTable (which makes it ready for storing in a traditional database). This post explores how you can structure a dataset that looks like 99% of data out there into a structure you can analyze in a PivotTable. Link to the Google Sheet is here.
Video walkthrough of Google Sheet:
Why this is important
Telling someone that their data should be structured is a platitude like “such is life” and “forgive and forget.” Let’s be more specific in how this statement can impact your work.
To be specific: 9 times out of 10, structure your data so that it can always be analyzed in a PivotTable.
Consider this scenario:
- Your accounting team needs your group to start forecasting expense for next month’s budget
- You start gathering the data and throw it into a spreadsheet
- Every month new data gets added to the spreadsheet, and perhaps the CFO wants to get more granular analyses on the forecast
- You start adding additional columns to the spreadsheet and perhaps summary tables in other sheets in the file
- Other teams now need to see your data to understand how your team’s decisions will impact their decisions
- This spreadsheet ends up being too hard to maintain, so there’s an internal project to put this data into a real database (some ERP solution)
- One quarter of planning goes by, and another quarter for implementation
- 6 months later, the business has changed, the structure of the database needs to be adjusted, and the data engineer role still needs to be filled
This concocted scenario is quite extreme, but the key lesson is this:
Focusing on the schema and structure of your spreadsheet today takes time and requires you to think about how your data will be used and maintained in the future.
U.S. public food assistance dataset
I’ve started browsing Kaggle’s to find interesting datasets recently, and this one caught my attention since it looks at spending and household participation related to a public food assistance program called SNAP. As the creator of the dataset discusses, there are many issues with collecting government datasets. Data is spread out across different agencies, there are multiple formats, and data is sometimes aggregated. This makes consolidating the data a pain. These problems may sound familiar if you’re working at a large organization.
The “Raw” sheet in the Google Sheet simply shows the cost, households participating, and total people associated with the SNAP program for the 2019 fiscal year across four states (CA, IL, LA, NY):
In your organization, this could be sales data, headcount data, COGS, whatever. The key thing about this dataset is that you have all the numbers organized by month across the top. This table would be great for a simple time series analysis where you may want to see the cost per household for California over time. But what if you need to build out a more dynamic dashboard looking at various metrics for just a few months or a subset of states?
Pivoting this data
If you create a PivotTable with this data, you’ll run into this issue of having to select individual month names to put into the Values section of the PivotTable builder. We only have twelve months of data for FY19, imagine if this we had to this for ten years worth of data going back to FY09.
Some people asked me about what a “denormalized dataset” means in the context of Excel/Google Sheets I mentioned this term in the previous episode. We need to “denormalize” this data so that it’s easier to pivot off of. This means putting in data that may repeat itself in a certain column, but this helps with structuring the data properly for a PivotTable.
In Excel, there is hacky way of denormalizing your data, and it involves going through the antiquated PivotTable wizard (which I believe you can only access via old Excel keyboard shortcuts). I don’t think the PivotTable wizard is available in the ribbon in recent versions of Excel.
This video below shows you how to do it. It involves checking a radio button for “Multiple consolidation ranges” and then double-clicking in the grand total of the sum of Values in the PivotTable. It’s not pretty, but it works:
Unfortunately for Google Sheets users, that PivotTable wizard isn’t available. If you find a similar workaround let me know.
Moving time periods to rows in Google Sheets
Whenever you see time periods (in this case, months in 2018 and 2019) organized across the columns, think about how you can put those time periods into one column. This starts the process of denormalization. You want something that looks like this:
When you pivot off of the Period
column in the PivotTable, you can then filter for and group your values by specific dates:
Moving metrics from rows to columns
In the original data set, there’s a Metric
column which contains metrics we care about for each state (Cost, Households, and Persons). This structure will make a PivotTable very hard to organize and analyze because you will have to filter for a specific metric in order to get any meaningful statistics from your dataset. Additionally, this structure is mixing data types (e.g. Cost is in dollars and Households is a number).
Whenever you see metrics organized in this manner, think about moving each individual metric to its own column:
Now, each of these columns is a value you can drag and drop into the “Values” section of the PivotTable. This means you can get summary results or drill down into a specific state’s numbers:
Transposing the data
Setting up the data structure to look like the structure in the “Solution” sheet of the Google Sheet does take a little spreadsheet gymnastics. The easiest method I’ve found is to apply the TRANSPOSE
function to the original dataset and then do some copying/pasting. Here’s what a TRANSPOSE
looks like:
The nice thing about this function is that it puts all your time periods (months in this case) into its own column. Each metric also is organized in a top-to-down fashion. The problem is that each state’s data is still organized across the top. At this point, you’re doing a copy and paste to consolidate the 13 columns that result from the TRANSPOSE
function into the 5 columns we ultimately care about: State
, Period
, Cost
, Households
, and Persons
.
Setting things up for a database
You may be wondering what other benefits there are for having this data structure besides the ease of creating a PivotTable. If your data ever ends up in a regular database (e.g. SQL), this is the ideal data structure for that tool.
I’ve seen scenarios at different organizations where an Excel file or Google Sheet has hundreds of thousands of rows that represent critical business data cobbled together over time. There comes a point in time from an organizational perspective where that data needs to be put into a database for ease of querying. A data engineer will have to do some data manipulation or run an ETL process to convert the data into a suitable format for a database. Guess what? You can help your data engineer out by getting this structure correct from day one.
Data down good :), data right bad 🙁
To summarize how your data should “grow” over time (big data ain’t going nowhere), your data should NOT grow right:
Instead, it should grow down:
Other Podcasts & Blog Posts
In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:
- Acquired Podcast Season 7 Episode #3: Epic Games
- People I (Mostly) Admire Ep #2: Mayim Bialik
Trackbacks/Pingbacks
[…] your organization uses. If you have control over your data pipeline, this means you can setup your data structures for long-term success (see episode […]