Dear Analyst

Dear Analyst #126: How to data storytelling and create amazing data visualizations with Amanda Makulec

KeyCuts — Mon, 15 Apr 2024 06:05:00 +0000

With an undergraduate degree in zoology and a master’s in public health, you wouldn’t expect Amanda Makulec to lead a successful career in data analytics and data visualization. As we’ve seen with multiple guests on the podcast, the path to a career in data analytics is windy and unexpected. It was the intersection of public health and data visualization that got Amanda interested in data visualization as a career. In one of her roles, Amanda was supporting USAID by analyzing open data sets and creating charts and graphs for publishing content. Her team consisted of graphic designers and developers. Designers would basically take her charts from Excel and add more color and add on text to the chart. Amanda found that large enterprises were facing the same challenges as the organizations she was supporting in public health (and enterprises have more money to throw at this problem). Thus began Amanda’s career in data viz.

How do you tell a data story?

We’ve talked a lot about data storytelling a lot on this podcast. If there is one person who can crisply define what data storytelling is, it would be Amanda. This is Amanda’s definition according to this blog post:

Finding creative ways to weave together numbers, charts, and context in a meaningful narrative to help someone understand or communicate a complex topic.

We talked a bit about how data storytelling can mean different things to different people (this blog post in Nightingale talks more about this). You might work with a business partner or client who says they want a data story, but all they really want is just an interactive dashboard with a filter. Amanda cites Robert Kosara’s definition of data storytelling in 2014 as one of her favorites:

ties facts together: there is a reason why this particular collection of facts is in this story, and the story gives you that reason
provides a narrative path through those facts: guides the viewer/reader through the world, rather than just throwing them in there
presents a particular interpretation of those facts: a story is always a particular path through a world, so it favors one way of seeing things over all others

Amanda stresses the 3rd bullet point as the most important part of data storytelling. If the audience has to walk away with one analytics fact from the story, what is that fact you want to get across?

Source: Effective Data Storytelling

Getting feedback on your data stories and visualization

One point Amanda brought up during the conversation which I think is worth highlighting is feedback. After you’ve published of launched an analysis, dashboard, or data story, you rarely get feedback on how effective the product was at telling a story. You might get some qualitative feedback like the dashboard answers specific questions or that the findings are “interesting.” But was the visualization actually effective at telling a story?

Amanda likes to ask people what they like and don’t like about her data stories and visualizations. Often people will get frustrate because the key takeaway from the data story is simply counter to what they believe. This leads them to questioning the validity of the data source. But you as the storyteller are simply conveying the signal from the noise in all the data.

During the pandemic, Amanda worked with the John Hopkins Center for Communications to create charts around COVID. Talk about telling an important data story! Amanda is presenting data about a worldwide pandemic while working with an organization that was at the core of reporting on the stats on the pandemic. Needless to say, the data stories and visualizations drew a variety of feedback. Remember seeing stories like this questioning how different entities and organizations were collecting and disseminating data about COVID? Being able to concisely present dense survey data about COVID is probably the toughest data storytelling job I can think of.

Applying principles of user-centered design to data visualization

Before Amanda starts working on a new dashboard or visualization, she asks several questions about the project:

Who is going to use the dashboard?
When are they going to use it?
What are their needs?

Before designing the dashboard, Amanda likes to borrow from the world of user-centered design to make sure her data visualization meets the goals of the end user. She creates mindset maps to make sure the dashboard is serving the right group. Journey maps also helps with figuring out how often the target audience will engage with the dashboard.

Source: The Interaction Design Foundation

We chatted about data exploration and data explanation. The explanation step is sometimes overlooked by analysts because so much time is spent on the nuts and bolts of creating the visualization. But data explanation is just as important because this helps lead the end user to the key analytical fact of the data story. This means having clear chart titles and annotations so that you’re guiding the user to the key takeaway of the story.

Data tools for building effective data visualizations

I love talking about tools so we spent some time talking about the tools Amanda uses to build her data stories and data visualizations. Amanda talked about understanding the constraints of the data tools so that you know what you can and cannot build with the tool. For instance, Amanda talked about Power BI not supporting dot chart plots before so she didn’t consider Power BI as tool in her toolbelt for telling data stories (if it involve dot chart plots). Other tools like Tableau and Ggplot are great for adding annotations to different parts of the data visualization.

Did you really mean story-finding?

Amanda talks about how some people want more data storytelling but what they really want is “story-finding.” Coupled with data exploration, story-finding is all about finding the trends and outliers in a dataset before doing the actual data storytelling. This graphic from Amanda’s blog post neatly plots some of these common terms we hear in the data viz world and shows how important words are in describing what we want:

Source: Nightingale

I asked Amanda what story-finding projects she’s actively working on and she talked about a project she’s working on with the Data Visualization Society (where she is the Executive Director). Her team has been trying to learn more about the membership (31,000+ people) so that they can create better programming for members. They partnered with the The Data School at The Information Lab to create a survey for members. As Amanda explored the data, she found that members are requesting information about professional development and work in a digital analytics capacity. The Data Visualization Society also issues a challenge to the community to come up with interesting visualizations using the survey data (see results from 2022 here). I really like this one done in Figma which shows which tools are most used by the members (surprise surprise Excel is on the map):

Source: James Wood

Advancing the profession of data visualization

We talked about how Amanda got involved with the Data Visualization Society and how it keeps her connected with the broader data viz community. It’s a volunteer board and the the organization has its roots in the Tapestry Conference in 2018. Elijah Meeks (then a senior data viz engineer at Netflix) gave the keynote about a “3rd wave” of BI tools becoming popular like Tableau and Jupyter notebooks. The talk is definitely worth a watch if you’re interested in the history of data visualization:

The Data Visualization Society is a space for people working in the data visualization profession to connect. The main goals of organization are to celebrate the profession, nurture cross-functional connections, and advance the practice of data visualization. Their annual conference is aptly called Outlier and is coming up in June.

Landing your next data visualization role

As with all episodes, I asked Amanda on advice she has for aspiring data visualization professionals. She had a lot to say on the topic. One thing that stood out to me is that all of us have the skills that translate well into data analytics and data visualization. Whether you are a writer or elementary school teacher, your communication and collaboration skills to produce a deliverable are the skills you need to be a data visualization expert.

Source: Global Investigative Journalism Network

Aside from joining organizations like the Data Visualization Society, Amanda suggested mastering fundamental design skills. Understanding how to declutter charts is one important aspect of being a data visualization expert. Of course, the Data Visualization Society’s journal has a bunch of great resources like this article on starting out in the world of data visualization and questions to ask when starting out in data viz. In the starting out article, I really liked this line about making things simple:

Simple is also beautiful in data visualization, and as long as what you’re creating is meeting the needs of your audience, you’re succeeding in making data more accessible to more people, which is an incredible talent in itself.
Source: Nightingale

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!

The post Dear Analyst #126: How to data storytelling and create amazing data visualizations with Amanda Makulec appeared first on .

Dear Analyst #125: How to identify Taylor Swift’s most underrated songs using data with Andrew Firriolo

KeyCuts — Mon, 25 Mar 2024 05:34:00 +0000

Sometimes pop culture and data analysis meet and the result is something interesting, thought-provoking, and of course controversial. How can one use data to prove definitely which Taylor Swift songs are the most underrated? Isn’t this a question for your heart to answer? Andrew Firriolo sought to answer this question over the last few months and the results are interesting (if you’re a Taylor Swift fan). As a Swiftie since 2006 (moniker for Taylor Swift fans), Andrew wanted to find a way to bridge his passions for Taylor Swift and data analysis. He’s currently a senior data analyst at Buzzfeed, and published his findings on Buzzfeed to much reaction from the Swiftie community. In the words of Taylor Swift, Andrew’s methodology and analysis just “hits different.”

From comp sci to data analytics

Andrew studied computer science at New Jersey Institute of Technology but realized he liked the math parts of his degree over the engineering parts. Like many guests on this podcast, he made a transition to data analytics. Interestingly, it wasn’t a job that propelled him into the world of data analytics. But rather, going to graduate school at Georgia Institute of Technology (Georgia Tech). GIT has some really affordable online technical programs including data analytics. After getting his master’s degree, he worked at Rolling Stone as a data analyst. This is the beginning of Andrew’s exploration into the Spotify API to see the data behind music. You can see some of the articles Andrew published while at Rolling Stone here.

Source: Pocketmags

After Rolling Stone, Andrew landed his current role at Buzzfeed building internal dashboards and doing internal analysis. In both of his roles, he talks about using a lot of SQL and R. A big part of his job is explaining the analyses he’s doing to his colleagues. This is where the data storytelling aspect of a data analyst’s job comes into play. I call this the “soft” side of analytics but some would argue that it’s the most important part of a data analyst’s job. In most data analyst roles you aren’t just sitting at your desk writing SQL queries and building Excel models. You’re a business partner with other people in the organization communication skills are more important than technical skills.

Answering a Taylor Swift question with data

Andrew became a Taylor Swift fan through his sister in 2006. They both listed to the world premier of Taylor’s first album. Given his background in data, Andrew decided to answer a question about Taylor Swift that’s been on his mind for a while: what are Taylor Swift’s most underrated songs?

To read Andrew’s full article, go to this Buzzfeed post.

Andrew’s hypothesis was that there’s a way to use data to prove which songs in Taylor’s discography are most underrated. When I classify something as “underrated,” it’s usually a decision you make with your gut. But it’s always interesting to see the data (and the methodology) for determining if something is truly “underrated.”

Multiple iterations in song streaming analysis

As mentioned earlier, Andrew made good use of Spotify’s API. The API gives you a plethora of information about songs such as how “danceable” or “acoustic” a song is. Each characteristic is measured on a scale of 0 to 1.

For the first iteration of Andrew’s analysis, he simply compared a given song’s streaming performance to the album’s median streaming performance. The hypothesis here is that the less-streamed songs are considered the underrated songs. The result of this analysis was a lot of Taylor’s deluxe tracks.

Source: Genius

The second iteration was to look beyond the streaming performance of the album the song is on. Andrew compared the song’s performance relative to album’s released before and after the current album. This surfaced some more underrated songs.

Getting the opinion of Swifties

While Andrew’s analysis so far yielded some interesting songs, he found that these songs weren’t all that loved by other Swifties.

In his final iteration, Andrew implemented a quality score to his analysis. This is a more subjective number that would take into account the opinion of experts.

At Rolling Stone, they had a rolling list of expert opinions that were published in various places. He had a data set of 1,000 opinions on different Taylor Swift songs that he could use to qualify a song. The big question is, how much weight do you give the quality score? In the end, Andrew decided on a weight od 33% to each metric he tracked:

Percent difference between its lifetime Spotify streams and the median streams of its album
Percent difference between its lifetime Spotify streams and the median streams, including neighboring albums
Average of six rankings of Taylor’s discography from media publications (quality score)

The quality score basically took into account the wisdom of the Swifty community.

Source: Know Your Meme

Getting to the #1 most underrated song: Holy Ground (Red)

Andrew was able to use R–a tool he’s already using every day on his job–to do this analysis. After dumping all the data from the Spotify API into a CSV, he used the Tidyverse R packages do crunch the numbers. One of the most commonly used packages for data visualization in Tidyverse is ggplot. But superimposing the images of Taylor Swift’s albums onto the charts created by ggplot was a new script Andrew had to write in R. I asked Andrew if he had to learn any new skills for this Taylor Swift analysis, and the main skill Andrew said he had to learn was data visualization. Here’s an example of a visual from Andrew’s blog post for the #1 most underrated Taylor Swift song:

Source: Republic Records / Tidyverse / Andrew Firriolo / BuzzFeed

To make sure he was on the right track, Andrew asked other Swifties what their #1 most underrated Taylor Swift song was. To Andrew’s delight, two co-workers said Holy Ground. Getting this qualitative feedback let Andrew know he was on the right track.

On the Buzzfeed article, half of the commenters agree that Holy Ground is indeed the most underrated song. The other half talk about other songs that should on the list. When Andrew posted his analysis on LinkedIn, most people commented on his methodology and thought process (like we did in this episode).

Using science to see which re-releases of Taylor’s songs most resemble the original song

Of course, “science” is used a bit loosely here. But similar to Andrew’s underrated song analysis, this analysis utilized the Spotify API to see which Taylor’s Version song most closely matches the original song. This was Andrew’s first analysis on Taylor Swift published late last year.

Read the Buzzfeed article for the full details on the methodology. Andrew also used R and various packages like the HTTP request package to pull the data from Spotify. To skip right to the results: the #1 song where Taylor’s version is most similar to the original is Welcome to New York.

Source: Republic Records/Big Machine Records/Tidyverse/Andrew Firriolo/BuzzFeed

Euclidean Pythagorean distance scores and Taylor Swift

When Andrew first brought up this concept I just scratched my head. Sounds advanced and if someone is bringing up Euclid in a Taylor Swift analysis, you trust that it must be thorough and accurate.

In reality, this concept harkens back to your high school geometry/algebra days. The distance formula simply measures the distance between two points on an X-Y plot:

Source: HowStuffWorks

In this analysis, Andrew utilized 7 metrics from the Spotify API for each version of Taylor’s songs. So each song could be plotted on an X-Y plot where the X might be the acousticness of the original song and the Y would be the acousticness of Taylor’s Version. The beauty of this formula is that it can find the distance between N points in N dimensions. I definitely went down the rabbit hole on this one to learn more about this formula I originally learned in high school. Here’s an explanation of the distance formula in 3-D space (something we can comprehend visually):

But in this analysis, there are 7 points. That means there are points in 7 dimensions. How do we even visualize that many dimensions? This explanation discusses a solution to this problem of how to think about plotting points beyond three dimensions. Math and linear algebra for the win!

I asked Andrew what the next Taylor Swift analysis will be. He said once he sees enough people asking a question about Taylor Swift that can potentially be answered by data, he’ll start an exploratory analysis (most likely with the Spotify API).

Getting your big break in data analytics

Andrew’s #1 advice for landing a job in data analytics or transitioning to a career in data is getting your master’s degree. We haven’t heard this advice too much on the podcast, but Andrew is a shining example of how a master’s degree in data can help. Especially at a university like GIT where the cost is quite low relative to a traditional university.

Andrew also discussed the importance of knowing SQL as the key technical skill for a data analytics role. Who knew that a database query language from 1970 would still be in high demand today?

Source: Medium / Çağatay Kılınç

The final piece of advice Andrew gave regarding skills you need for a career in data analytics is communication. Specifically, knowing how to communicate your analysis to a non-technical audience. At the beginning of his career at Buzzfeed, Andrew received feedback that his explanations were too technical. He realized that everyone didn’t need to know how the SQL query was constructed and people just cared about the trends and final results.

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!

The post Dear Analyst #125: How to identify Taylor Swift’s most underrated songs using data with Andrew Firriolo appeared first on .

Dear Analyst #124: Navigating people, politics and analytics solutions at large companies with Alex Kolokolov

KeyCuts — Mon, 05 Feb 2024 06:16:00 +0000

We sometimes forget that a large organization is composed of groups and divisions. Within these groups, there are teams and individuals looking to advance their careers. Sometimes at the expense of others. When your advancement depends on the success of your project, the benefits of that project to your company may be suspect and the tools you use to complete that project may not be the best tools for the job. Alex Kolokolov started his journey in data like many of us: in Excel. He moved on to Power BI, PowerPivot, PowerQuery, and building data visualizations for the last 15 years. In this episode, he talks through consulting with a company as the analytics expert only to find out that the the underlying forces at play were company politics. He also discusses strategies to make your line charts tell a better data story.

The state of analytics at companies in traditional industries

Alex consults with large companies in “traditional” industries like oil, gas, and mining companies. The state of analytics and knowledge of analytics is not equal in these companies, according to Alex. You’ll come across data science and AI groups at these companies who are, indeed, working on the cutting edge. But then when you approach other departments like HR or operations, they are still quite far from this digital transformation that everyone’s talking about.

Alex worked with a mining company where there are cameras that can ID employees using facial recognition when they walk through the door. But when you sit down with the folks who are actually doing the work at the plant, they are still humming along on Excel 2010. Excel 2010! What a time…

Source: dummies.com

In terms of creating dashboards, teams from these companies would consult their IT or tech team to create a report. But then the IT team comes back and says it will take three months to create this report given their current backlog. Hence the reason these companies outsource the analytics training, metrics collection, and dashboarding to people like Alex.

Internal battles for power and platforms

Alex once worked with a government institution and they were building an internal SQL data warehouse before Power BI came on the scene. This specific project was driven by IT as a warehouse solution for the finance department. a few years later, the head of this SQL project became the CIO, but started getting some pushback from the heads of the finance department. It turns out the finance department heads already had their own platform in mind and claimed Microsoft’s technology was outdated for their purposes (the finance team wanted to go with Tableau to build out pretty dashboards).

Source: reddit.com

The finance department proceeded to roll out their solution in Tableau and the CFO eventually became the Chief Digital Office and pushed the CIO who was spearheading the SQL project out. The project wasn’t about Microsoft vs. Tableau at all. It was all about who was better at playing the game of internal politics and fighting for the resources to get your project across the line.

When digital transformation is 10 years too late

Large companies Alex has worked claimed they went through “digital transformation” but this was back in 2012. When Alex started working with these companies over the last few years, he found that individuals were still using SAP and Excel 2010. It’s as if the digital transformation didn’t go past 2012, and whatever tools were brought in at the time were meant to carry the organization for another 20 years. We’ve all seen this story. Large companies and enterprises move slow and digital transformation sounds nice and warm, but execution is where organizations may lose their place.

Source: marketoonist.com

In my own experience, teaching someone an Excel keyboard shortcut that saves them X number of hours per week of manual work is a pretty awesome feeling. It’s a visceral feeling of knowing you are having a direct impact on the person’s productivity. Alex has done the same thing at these large companies which, at the heart of it, is explaining somewhat “technical” concepts in an approachable way. If there’s one lesson Alex has learned over the years from helping people stand up dashboards, the one advice he always gives is: don’t insert new columns. Adding new columns may ruin the way the data is laid out (if its a time series) or affect the look and feel of a dashboard.

When your line charts look like spaghetti

Alex published a blog post late last year called When Charts Look Like Spaghetti, Try These Saucy Solutions where he provides different strategies for “untangling” your messy line charts. The goal is to have your audience walk away with the key message from the line chart. For instance, you have a line chart where it’s hard to detect trends (given the number of series on the chart) so you can selectively highlight a line (and gray out the rest) to make a point:

Source: nightingaledvs.com

Another option is to simply break out each line into its own mini chart:

Source: nightingaledvs.com

The one skill Alex believes analysts and dashboard creators should learn is compromise. If your visualization is overloaded with elements and colors and your target audience says they want to see all the data, you’ll have to find ways to give them what they need and highlight the story. Imagine this scenario:

An analyst is tasked with plotting more data on a chart by their superiors and so the analyst goes off and makes more charts. Eventually, the analyst realizes there are too many charts and decides to make the dashboard an interactive dashboard with interactive filters and Slicers. This allows the target audience to manipulate the data however they see fit. But does the target audience even know how to use the filters in the first place? Do they know it’s something they are supposed to interact with. There’s a mismatch between what the analyst wants the dashboard to do and what the target audience expects (consume vs. interact).

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!

The post Dear Analyst #124: Navigating people, politics and analytics solutions at large companies with Alex Kolokolov appeared first on .

Dear Analyst #123: Telling data stories about rugby and the NBA with Ben Wylie

KeyCuts — Mon, 15 Jan 2024 06:55:00 +0000

When you think of data journalism, you might think of The New York Times’ nifty data visualizations and the Times’ embrace of data literacy for all their journalists. Outside of The New York Times, I haven’t met anyone who does data journalism and data storytelling full-time until I spoke with Ben Wylie. Ben is the lead financial journalist at a financial publication in London. Like many data analysts, he cut his teeth in Excel, got his equivalent of a CPA in the UK, and received his master’s degree in journalism. In this episode, we discuss how his side passion (sports analytics) led him to pursue a career in data journalism and how he approaches building sports data visualizations.

Playing with rugby data on lunch breaks

When Ben worked for an accounting firm, he would pull rugby data during his lunch breaks and just analyze it for fun. One might say this started Ben’s passion in data storytelling because he started a blog called The Chase Rubgy to share his findings. The blog was a labor of love, and at the end of 2019 he had only focused on rugby. After building an audience, he realized data journalism could be a promising career path so he did some freelance sports journalism at the end of his master’s course. At the end of 2022, he started Plot the Ball (still a side project) where the tagline is “Using data to tell better stories about sport.”

Learning new data skills from writing a newsletter

Ben spoke about how writing Plot the Ball forced him to learn new tools and techniques for cleaning and visualizing data. All the visualizations on the blog are done in R. A specific R package Ben uses to scrape data from websites is rvest. Through the blog, Ben learned how to scrape, import, and clean data before he even started doing any data visualizations. Sports data all came from Wikipedia.

I’ve spoken before about how the best way to show an employer you want a job in analytics is to create a portfolio of your data explorations. Nothing is better than starting a blog where you can just showcase stuff you’re interested in.

How the NBA became a global sport

One of my favorite posts from Plot the Ball is this post entitled Wide net. It’s a short post but the visualization tells a captivating story on how the NBA became global over the last 30 years. Here’s the main visualization from the post:

Source: Plot the Ball

Ben first published a post about NBA phenom Victor Wembanyama in June 2023 (see the post for another great visualization). Ben talks about this post being a good data exercise because there is no good NBA data in tabular form. This “waffle” chart was Ben’s preferred visualization since it allows you to better see the change in the subgroups. A stacked bar chart would’ve been fine as well, but since each “row” of data represents a roster of 15 players, the individual squares abstracts the team composition each year.

Home Nations closing the gap with Tri Nations in rugby

Ben talked about another popular post from his blog entitled Heading South. The post started as a data exploration exercise where Ben was simply trying to find trends instead of telling a story. For some background, rugby has traditionally been dominated by a few teams (e.g. Australia, New Zealand, and South Africa). The most recent finals was between New Zealand and South Africa and these two clubs have won a majority of World Cups.

Ben was interested in seeing how these elite teams and other teams were trending over time. Ireland and France have started doing well over the last few years but there is not bird’s eye view of how these teams are performing as a whole. So Ben decided to create this visualization:

Source: Plot the Ball

Cognitive overload is a concept many data visualization professionals care about. When a visualization has more information than an individual has the mental capacity to process, the message and story gets lost. A few factors about the visualization above eases the path for understanding the story:

Gridline color is muted
Data labels only show up at the end of the line charts
The colors of the lines match the series name in the title of the chart

If it’s not clear what the trend is, the main header of the chart even tells you the key takeaway from the chart.

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!

The post Dear Analyst #123: Telling data stories about rugby and the NBA with Ben Wylie appeared first on .

Dear Analyst #122: Designing an online version of Excel to help Uber China compete with DiDi on driver incentives with Matt Basta

KeyCuts — Mon, 04 Dec 2023 19:28:54 +0000

There are only so many ways to make Excel “fun.” If you’ve been following this blog/podcast, stories about the financial modeling competition and spreadsheet errors that lead to catastrophic financial loss are stories that make a 1980s tool somewhat interesting to read and listen to. There are numerous tutorials and TikTok influencers who teach Excel for those who are actually in the tool day in and day out. Meet Matt Basta, a software engineer by trade. He published a story on his own blog called No sacred masterpieces which is worth reading in its entirety as its all about Excel. In this episode, we discuss highlights from Matt’s time at Uber, how he built a version of Excel online to help Uber China compete with DiDi, and how Uber completely scrapped the project weeks later after DiDi acquired Uber China.

Business intelligence at Uber through the eyes of a software engineer

I don’t normally speak with software engineers on the podcast, but Matt’s story during Uber will resonate with anyone who works at a high-growth startup and lives in Excel. Matt’s story has everything. Tech, cutthroat competition, drama, and of course, Excel.

Matt has worked at a variety of high-growth startups like Box, Uber, Stripe, and now Runway. He joined Uber in 2016 and worked on a team called “Crystal Ball.” The team was part of the business intelligence team. The goal of this team was to create and develop a platform that analysts and business folks could use to figure out how much to charge for rides, how much incentives to provide to drivers, etc. All the core number crunching that makes Uber run.

As per Matt’s blog post, employees were working on one of two major initiatives at Uber in 2016:

Redesigning the core Uber app
Uber China

As Matt told his story, it reminded me of all the news articles that came out in 2016 about Uber’s rapid expansion in markets like China. The issue is that a large incumbent existed in China: DiDi. This comes up later in Matt’s story.

Getting data to the city teams to calculate driver incentives

From the perspective of the Crystal Ball team, all they wanted to do was set up a data pipeline so that data about the app could be shared with analysts. Analysts would then download these files and crunch numbers in R and this process would take hours. In 2016, Uber was competing directly with DiDi to get drivers on the platform. The city team would use the data provided by the Crystal Ball team to figure out how much of an incentive to offer a driver so that the driver would choose to drive with Uber instead of DiDi for that ride.

Source: Forbes

The problem was that the city team in China was using these giant Excel files that would take a long time to calculate. In order to compete with DiDi, Uber China would need a much faster way to calculate the incentives to offer drivers. This is where Matt’s team came in.

The only other “tool” the city team had at their disposal was the browser. The city team still wanted the flexibility of the spreadsheet, so Matt’s team strategy was to put the spreadsheet in the browser. Now at this point, you might be wondering how in the world did this become the solution to the problem? Matt’s blog post goes into much more detail as to the stakeholders, constraints, and variables that led his team to go in this direction.

Luckily, Matt had worked on a similar tool while at Box, so he re-used code from that previous project. During this time at Box, Box had Box Notes and Dropbox had Dropbox Paper. Both of these products were based on the open source tool Etherpad for real-time collaborative document editing. Matt thought, why not build something similar for spreadsheets?

Source: Dropbox

Discovering nuances about Excel

In the blog post, Matt talks about discovering Excel’s circular references. We all know that circular references can break your models, but Excel’s calculation engine also allows for continually calculating if the computed value of the cell converges. I think this is how the Goal Seek function works in Excel to a certain extent.

Source: Microsoft

When Matt’s online version of Excel was released internally, the head of finance was upset since you could see how the formulas were calculated in the tool. To Matt’s team, they did what they were supposed to do. They put Excel in the browser and figured you should be able to see the formulas in the cells.

According to the head of finance, there were spies from DiDi who would apply for internships at Uber China just to get competitive data. Needless to say, Matt removed the ability to see formulas in his tool.

DiDi buys Uber China

Matt and the Crystal Ball team spent 6 months helping the Uber China team with their data needs. Internally, Matt’s team didn’t get an all-hands invite or anything regarding the acquisition of Uber China by DiDi. People just found out through the news. Eventually, then CEO of Uber Travis Kalanick sent out a message regarding the acquisition. Matt’s tool would get scrapped immediately.

Matt open-sourced the code for this WebSheets tool and the calculation engine lives on GitHub here. We chatted about the feedback Matt’s received about his blog post and you can see the comments on HackerNews. As usual, there are people chiming in saying Matt could’ve done this or that better. Whenever there is a mention of Excel on HackerNews, you’ll inevitably see people talking about how $XX billions of their company is still run off of someone’s Excel file. Interestingly, one of the resources Matt used to learn about Excel is Martin Shkreli’s YouTube channel where Shkreli walks through building out a financial model. Putting aside misgivings about Shkreli’s character, the videos are actually super educational:

Excel’s fast feedback loop

This is where the Matt’s story turns into takeaways and learnings that make this story more than a story about Uber China and Excel. Matt built something from scratch and had to come to terms with that it wouldn’t have a business purpose anymore. The tool is just a way to achieve the business objective. If the business objective changes, then the tool may become obsolete.

Hearing Matt’s perspective about Excel was quite refreshing since prior to this Crystal Ball project, he wasn’t an analyst and in the weeds of Excel every day. However, he worked with said analysts every day to understand their requirements and more importantly, whey they were so tied to Excel. Excel allows you to create a fast feedback loop to test an idea or an assumption. The reason the city team stuck with Excel and put up with the hours of calculation time is because building similar functionality with code would’ve been too difficult.

Founders will use Excel before writing code.

To the analysts and data scientists Matt worked with, writing formulas was their version of programming. Unlike traditional programming, Excel users don’t have to develop unit tests, build integrations, and deal with piping data in/out. Another interesting tidbit Matt brought up about the internal workings of the city team at the time is that there was no expectation that a given Excel file would live for more than a week. Each file would solve a specific problem at that point in time, and then get discarded as it too became obsolete.

Planning and forecasting on IBM software

Following this Crystal Ball project, Matt started working on the financial engineering team within Uber. His next project as trying to figure out how much revenue Uber would make in 2017. The tool they used was a self-hosted version of Anaplan called IBM TM1. I’ve never heard of this tool from a FP&A perspective, but my guess is that it’s similar to Oracle Hyperion (the tool I used back in the day).

Source: Lodestar Solutions

There were analysts working with this tool who would turn Excel spreadsheet data into TM1 code for planning purposes. The problem is TM1 code is not strongly typed, so analysts would constantly break the tool when trying to write code for it. It was just one guy who created TM1 and the platform was acquired by IBM. Uber even invited one of TM1’s chief architects to talk to Uber’s analysts about the tool. According to the creator of TM1, Manny Perez, TM1 was the first “functional database” in the 1980s which exploited in-memory computing. Apparently there’s a cult following around Manny and the creation of TM1. So much so that a documentary was released a few years ago aptly named Beyond the Spreadsheet: The Story of TM1:

Not gonna lie, this seems like a super interesting documentary given the foundation of the story discusses spreadsheets at length. How about this description from the film’s website to incite some excitement around corporate planning software:

But as long ago as 1983, a light-bulb idea went off in the head of an employee at oil distributor Exxon. Manny Perez realized he could give business users the freedom to create at scale but also the control and collaboration prevalent in other technologies today. He thought his solution to the problem was so elegant and obvious, it would become instantly ubiquitous. It didn’t. To achieve his ultimate aims, he would need to pioneer and master many facets of technology, staying true to the spirit of user freedom whilst battling waves of competitors selling solutions that enriched themselves but not their customers. Eventually, with thousands of companies globally using his solution, and with a passionate community of followers, his inspiration and perspiration was validated when IBM acquired his technology in 2008.
Source: tm1.film

Back to Matt’s work with TM1. His goal was to make it easier for analysts to work with the software. He built a programming language on top of what the analysts were coding. The new language had type inference and checking to prevent errors from occurring in TM1.

Tips for Excel users

Given Matt’s extensive experience building on top of Excel and working with analysts all day at Uber, I thought it would be interesting to get tips he has for us Excel users. A key question that is worth pondering is when the business evolves to a point where Excel doesn’t make sense to be the tool of record anymore. I’m sure many of you have worked with files that handle business critical processes at your company and have wondered: this data should probably be in a secure database or something more secure than Excel.

Source: KaiNexus Blog

Realistically, moving the data and process off of Excel involves a team of engineers writing code where everything is hosted on a server. The resourcing for this speaks to to the speed and immediacy of Excel’s value when your team needs to work fast. Should your team go down this route and create code instead of spreadsheets, Matt encourages all analysts to do one thing: provide good documentation.

This helps with the migration process when you have to work with a team of engineers. Tactically, this can mean something as simple as adding a comment to a cell in your file, leaving notes in the cell itself, or even creating a text box with the notes in the box. How many times have you inherited a file and spend hours spelunking around trying to figure out how it was constructed? Good documentation helps everyone.

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!

The post Dear Analyst #122: Designing an online version of Excel to help Uber China compete with DiDi on driver incentives with Matt Basta appeared first on .

Dear Analyst #121: Fabricating and skewing Excel survey data about honesty with behavioral economists Dan Ariely and Francesca Gino

KeyCuts — Mon, 23 Oct 2023 05:28:00 +0000

One of the more popular courses you could take at my college to fulfill the finance major requirements was Behavioral Finance. The main “textbook” was Inefficient Markets and we learned about how there are qualitative ways to value a security beyond what the efficient market hypothesis purports. During the financial crisis of 2008, psychology professor and behavioral economist Dan Ariely published Predictably Irrational to much fanfare. The gist of the book is that humans are less rational than what economic theory tells us. With the knowledge that humans are irrational (what a surprise) when it comes to investing and other aspects of life, the capitalist would try to find the edge in a situation to get a profit. That is, until, recent reports have surfaced showing that the results of Dan Ariely’s experiments are fabricated (Ariely partially admits to it). This episode looks at how the data was potentially fabricated to skew the final results.

Dan Ariely. Source: Wikipedia

Background on the controversy surrounding Dan Ariely’s fabricated data

In short, Ariely’s main experiment coming under fire is one he ran with an auto insurance company. The auto insurance company asks customers to provide odometer readings. Ariely claims that if you “nudge” the customer first by having them sign an “honesty declaration” at the top of the form saying they won’t lie on the odometer reading, they will provide more accurate (higher) readings.

I was a fan of Predictably Irrational. It was an easy read, and Ariely’s storytelling in his TED talk from 15 years ago is compelling. I first heard that Ariely’s experiments were coming under scrutiny from this Planet Money episode called Did two honesty researchers fabricate their data? The episode walks through how Ariely a thought leader and used his status to get paid behavioral economics consulting gigs and to give talks. Apparently the Israeli Ministry of Finance paid Ariely to look into ways to reduce traffic congestion. In the Planet Money episode, they talk about how other behavioral scientists like Professor Michael Sanders applied Ariely’s findings to the Guatemalan government by encouraging businesses to accurately report taxes. Sanders was the one who originally questioned the efficacy of Ariely’s findings. Here is part of the abstract from the paper Sanders wrote with his authors:

The trial involves short messages and choices presented to taxpayers as part of a CAPTCHA pop-up window immediately before they file a tax return, with the aim of priming honest declarations. […] Treatments include: honesty declaration; information about public goods; information about penalties for dishonesty, questions allowing a taxpayer to choose which public good they think tax money should be spent on; or questions allowing a taxpayer to state a view on the penalty for not declaring honestly. We find no impact of any of these treatments on the average amount of tax declared. We discuss potential causes for this null effect and implications for ‘online nudges’ around honesty priming.

Professor Michael Sanders

If you want to dive deeper into Dan Ariely’s story, how he rose to fame, and the events surrounding this controversy, this New Yorker article by Gideon Lewis-Kraus is well researched and reported. NPR also did a podcast episode about this a few months ago. This undergraduate student only has one video in his YouTube account, but it tells the story about Ariely quite well:

Instead of discussing Ariely’s career and his character, I’m going to focus on the data irregularities in the Excel file Ariely used to come up with the findings from the auto insurance experiment. This podcast/newsletter is about data analysis, after all.

Instead of dissecting the Excel file myself, I’m basically going to re-hash the findings from this Data Colada blog post. Data Colada is a blog run by three behavioral scientists: Uri Simonsohn, Leif Nelson, and Joe Simmons. Their posts demonstrate how “p-hacking” is used to massage data to get the results you want.

Irregularity #1: Uniform distribution vs. normal distribution of miles driven

This is the raw driving dataset from the experiment (download the file here). Each row represents an individual insurance policy and each column shows the odometer reading for each car in the policy before and after the form was presented to the customer.

The average number of miles driven per year irrespective of this experiment is around 13,000. In this dataset, you would expect to see a lot of numbers around 13,000, and a few numbers below 1,000 and a few numbers above 50,000 (as an example). This is what normal distribution or bell curve looks like:

Source: Math Is Fun

In Ariely’s dataset, there is a uniform distribution of miles driven. This means the number of people driving 1,000 miles per year is similar to those who 13,000 miles/year and those who drove 50,000 miles/year.

Source: Data Colada

No bell curve. No normal distribution. This by itself makes the dataset very suspect. One could argue that the data points were cherry-picked to massage the data a certain way, but the other irregularities will show that something more sinister was at play. You’ll also notice in the chart created by Data Colada is that the data abruptly stops at 50,000 miles per year. Although 50,000 miles driven per year is a lot, it’s highly unlikely that there are no observations above 50,000.

Irregularity #2: Mileage reported after people were shown form are not rounded and RANDBETWEEN() was used

People in the experiment were asked to recall their mileage driven and write the number on a piece of paper. If you were to report on a large number, you’d probably round the number to the nearest 100 or 1,000. In the screenshot below, you’ll see how some of the reported mileage are indeed rounded. What’s peculiar is that mileage reported after people were shown the form (Column D) were generally not rounded at all:

Did these customers all of a sudden remember their mileage driven down to the single digit? Highly suspect. Data Colada suggests that the RANDBETWEEN() function in Excel was used to fabricate the mileage in Column D. The reasoning is that RANDBETWEEN() doesn’t round numbers at all.

Even the numbers in Column C (mileage reported before shown the form) seem suspect given how many places most numbers go to. If Ariely or members in his lab did in fact use RANDBETWEEN() to generate the mileage in Column D, they could’ve at least tried to hide it better using the ROUND() function which would allow them to round the numbers to the 100 or 1,000th place. This is just pure laziness.

This chart from Data Colada further shows how the last digit in the baseline mileage (before people were shown the form) is disproportionately 0. This supports that these numbers are indeed reported accurately. The last digit in the updated mileage (after people were shown the form) again has a uniform distribution further adding to the evidence that the numbers were fabricated.

Source: Data Colada

Irregularity #3: Two fonts randomly used throughout Excel file

This is by far the most amateur mistake when it comes to judging the validity of any dataset. When you open the Excel file, something instantly feels off about the data. That’s because half of the rows have Calibri font (default Excel font) and the other half have Cambria font (in the same font family as Calibri).

Were some of the rows copied and pasted from another Excel file into the main file and then sorted in some fashion? Did someone incorrectly select half the data and set it to Cambria?

According to Data Colada, the numbers probably started out in Calibri and the RANDBETWEEN() function was used again to generate a number between 0 and 1,000 to be added to the number in Calibri. The resulting number is in Cambria:

Source: Data Colada

To recap what the data hacking looks like with this irregularity:

13,000 baseline car readings are composed of Calibri and Cambria font (almost exactly 50/50)
6,500 “accurate” observations have Calibri
6,500 new observations were fabricated in Cambria
To mask the new observations, a random number between 0 and 1,000 was added to the original numbers in Calibri to form the fabricated numbers in Cambria

In the screenshot above, this pattern of the Cambria number being almost identical to the Calibri number is what leads Data Colada to believe that the Cambria numbers (half the dataset) are fabricated.

To put the cherry on top of this font irregularity, very few of the numbers in Cambria font are rounded. As discussed in irregularity #2 above, using RANDBETWEEN() without using ROUND() will lead to numbers not being rounded. Not having rounded numbers is again, highly suspicious when you consider that these mileage numbers are reported by humans who tend to round large numbers.

Source: Data Colada

Why did Ariely allegedly fabricate the numbers?

Easy. Fame, notoriety, and consulting gigs. Again, I’d read the New Yorker piece to learn more about Ariely’s background and character. The narrative Ariely wanted to tell was that nudges have an outsize impact on behavior, and the data was skewed to prove this.

Source: Resourceaholic

Ariely actually acknowledged Data Colada’s analysis and basically responded with “I’ll check my data better next time” over email. The New Yorker article talks about maybe someone at the auto insurance company fabricating the data before it was sent to Ariely, which means Ariely can claim he had no hand in fabricating the data.

Even if that were the case, you wouldn’t at least scroll through the dataset to see–I don’t know–that the data is in two different fonts? Your future TED talks, published books, and paid consulting gigs are dependent on your findings from this Excel file and you don’t bother to check the validity of it? The file is just over 13,000 rows long so it’s not even that huge of a dataset. While not on the same scale, this narrative feels similar to what happened with Theranos. Similar to Elizabeth Holmes, Ariely claims he can’t recall who sent him datasets or how the data was transformed (as reported in the New Yorker).

Excel mistakes are different from fabricating data

I’ve dissected a few Excel blunders on the podcast such as the error that led to a $6.2B loss at JPMorgan Chase, Enron’s spreadsheet woes, the DCF spreadsheet error leading to a mistake with a Tesla acquisition, and many others. In these cases, the pilot simply misused the instrument which led to a massive mistake.

With the fabricated data in Ariely’s experiment, Ariely, members of his lab, or someone at the auto insurance company knowingly massaged the data with the intention of not getting caught. Better auditing or controls cannot prevent data drudging to this magnitude.

Perhaps Ariely (or whoever fabricated the data) knew that if they could tell this narrative that “nudging” does indeed lead to changes in human behavior, there would be a size-able financial payout somewhere down then line.

Source: GetYarn

Blowing the whistle on Ariely

In the Planet Money episode referenced earlier, Professor Michael Sanders is credited with first calling bullshit on Ariely’s findings after his own failed project with the Guatemalan government. Data Colada’s blog post really made clear what issues exited in Ariely’s spreadsheet.

Data Colada kind of reminds me of the European Spreadsheet Risks Interest Group (EuRpRIG), a group of individuals who document all these Excel errors in the hopes that analysts won’t make the same errors. By detailing Ariely’s spreadsheet tactics, hopefully it will be easier to spot issues like this in the future.

The New Yorker article shows that it’s hard to evaluate the true intentions of each party in this case. It’s easy to point fingers at Ariely and say he committed spreadsheet fraud for his own personal gain. But what about Data Colada? While the behavioral scientists behind the blog seem like upstanding citizens, who knows what benefit they stand to gain from uncovering these issues and calling out fraud? Simmons, Nelson, and Simonsohn also get their share of the limelight in this recent WSJ article highlighting the impact of the group’s research.

Leif Nelson, Uri Simonsohn, and Joe Simmons. Source: WSJ

Like Ariely, maybe more consulting gigs get thrown their way based on their ability to take down high profile authors and scientists? Remember when Hindenburg Research came out with the hit piece on Nikola leading to the resignation of the CEO? Not only did Hindenburg stand to gain from short-selling the stock, they also drew more attention to their investment research services. They also probably got more inbound interest from people who have an axe to grind with some other company CEO and want to take down the company.

Open source wins the day

I’ve been a fan of open source ever since I got into software since, well, the whole fucking Internet runs on it. One of my favorite data cleaning tools (OpenRefine) is completely free to use and is just as powerful as Microsoft Power Query for cleaning data.

Source: Rocket.Chat

The beautiful thing about open source is that anyone can analyze and investigate how the code really works. There is no narrative about what the tool or library can do. These same values should also be applied to researchers and scientists. I really like how the Data Colada team ended their post on Ariely’s spreadsheet issues:

There will never be a perfect solution, but there is an obvious step to take: Data should be posted. The fabrication in this paper was discovered because the data were posted. If more data were posted, fraud would be easier to catch. And if fraud is easier to catch, some potential fraudsters may be more reluctant to do it. Other disciplines are already doing this. For example, many top economics journals require authors to post their raw data. There is really no excuse. All of our journals should require data posting. Until that day comes, all of us have a role to play. As authors (and co-authors), we should always make all of our data publicly available. And as editors and reviewers, we can ask for data during the review process, or turn down requests to review papers that do not make their data available. A field that ignores the problem of fraud, or pretends that it does not exist, risks losing its credibility. And deservedly so.

Hopefully this episode nudges you in the right direction.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

Data Visualization Society’s 2022 State of the Industry Report

Source: Rachel E Cinelli

The post Dear Analyst #121: Fabricating and skewing Excel survey data about honesty with behavioral economists Dan Ariely and Francesca Gino appeared first on .

Dear Analyst #120: Marketing attribution, sensitivity models, and building data infrastructure from the ground up with Zach Wilner

KeyCuts — Mon, 09 Oct 2023 05:47:00 +0000

Data analytics and business analytics are still relatively new areas of study (in terms of academics). The subject borders business and computer science. When I went to school, the only data analytics classes available were special electives offered through our school’s continuing education department. In this episode, I spoke with Zach Wilner who currently leads data and analytics at Pair Eyewear. Zach is a “classically trained” in data analytics (if one can call it such) since he studied business analytics at Boston College. He has worked at various DTC (direct-to-consumer) companies like Wayfair and Bombas before landing at Pair (also a DTC company). In addition to discussing marketing attribution and pricing projects, Zach also talks about building Pair Eyewear’s data infrastructure from 0 and how to build the team around it.

Scaling a data stack in a step-wise approach

When Zach joined Pair, there wasn’t really much of a data infrastructure in place. People wanted to analyze and visualize data but didn’t know where to pull the data from. The classic multiple data silos problem.

The easy thing to do would’ve been to take the data stack at Bombas or Wayfair and try to implement it at Pair. Instead, Zach asked what if we started with a blank slate? With the help of a consultant, Zach spent 6 months building out a data warehouse with dbt, Stitch, and other ETL tools. After the foundation was placed, he then focused on BI and implemented Looker and Heap. The goal is to make analytics as self-service as possible. Today, 60%-70% of the company use Looker actively.

From a marketing analytics perspective, most DTC companies have similar marketing channels (e.g. Shopify, Facebook, TikTok). This means Zach could set up similar telemetry for tracking all of Pair’s marketing initiatives. One area the team spent some time on is health data and they decided that they wouldn’t be HIPPA compliant or deal with PHI data.

Customer centric vs. marketing attribution model

Marketing attribution. A never-ending battle between marketing channels and data to figure out which channel gives your company the best bang for your buck. The reason I know this problem hasn’t been solved yet is because new marketing attribution vendors pop up every year claiming to be the end-all-be-all omnichannel tracking tool. If you work in martech, we’ve seen the industry evolve from last-click to multi-touch models.

Source: WordStream

Zach worked with Pair’s head of marketing to figure out what model would work for the company. Surprise surprise, they started with the data. Using the data, they answered questions like how many sessions does it take before a customer makes a purchase? How many ads does the customer need to see before they make a purchase?

The team decided to build out a home-grown attribution model and called it a customer-centric attribution model. They basically looked at how individual customers viewed Pair’s different marketing messages and optimized spend based on the customer. They were able to properly attribute conversions by comparing their results with lift studies from Facebook.

Using a sensitivity model to experiment with pricing

Pair’s business model is doing limited-edition drops. This means a lot of one-unit orders when the drops happen. With the longevity of the business in mind, the team asked what would happen if they encouraged customers to to purchase two items with less frequency between them instead of just these one-time higher-priced drops?

Source: SoundCloud (Mokos)

Again, they started with the data. They looked at a distribution of their order values. As expected, they saw a normal distribution of orders and could see the average order value across all customers. Using this data, they could figure out what the order minimum customers were reaching for. Then came the sensitivity model to find the tradeoff between a lower conversion % and higher order value.

Hiring the right people for your data team

The sequencing of how Zach went about hiring members to join his data team might sound familiar to folks. The first hire was an analytics engineer, the Swiss army knife of the data world. The analytics engineer can help build the tech stack and do analysis. This breakdown of data engineer, analytics engineer, and data analyst is always good to know:

Source: LearnSQL

Once the data infrastructure is in place, Zach then hired the data analysts who do the more traditional exploratory analysis and dashboarding. From there, Zach built out a consumer insights team. The analytics team is now doing full-stack stuff which goes beyond Excel and Tableau. They are diving into dbt and machine learning as well.

Zach talked about encouraging data analysts to be generalists. One reason people leave their current job or employer is simply being bored with the work. If an analyst is a generalist, they can grow and learn and be excited about other aspects of their role. They will have the opportunity to touch multiple departments. More importantly, they can approach company problems from multiple angles.

What keeps Zach up at night: building in-house vs. managed services

Build vs. buy. No matter how trite this debate may seem to some of you, I think it’s always interesting to hear how different companies view this problem. There’s always a new set of constraints, contexts, and tools to consider this tradeoff. What doesn’t change, however, is that there is never a clear answer. Even when you think if you’d made the right decision, that all can change next quarter.

Source: Customer Success Memes

One of the things that keeps Zach up. at night is whether a certain task should be delegated to a managed service like Stitch or Fivetran. These tools make it easy to tap into APIs. They also allow teams to move quicker and get to impact faster. The problem is that it opens up your company to more risk. If one of the APIs or providers happens to go down, you’re at the mercy of the provider. Zach talked about an issue that Stitch had with the Shopify API and that there was nothing his team could do about it.

The other side is you build in-house and everything is under your control. This requires more resources and you move slower. According to Zach, this tradeoff is something. he revisits often and the work is never quite done even when you think it’s done.

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!

The post Dear Analyst #120: Marketing attribution, sensitivity models, and building data infrastructure from the ground up with Zach Wilner appeared first on .

Dear Analyst #119: Developing the holy “grail” model at Lyft, user journeys, and hidden analytics with Sean Taylor

KeyCuts — Mon, 18 Sep 2023 18:32:14 +0000

Future Dear Analyst episodes will get more sporadic since, well, life gets in the way. Unfortunately curiosity (in most cases) doesn’t pay the bills. Nevertheless, when I come across an idea or person that I think is worth sharing/learning more about, I’ll try my best to post. In this episode, I interview the Chief Scientist of a data startup who did his PhD at Stern NYU and was on track go becoming a professor. Then he got an internship at Facebook and everything changed. The speed of learning at a tech company outpaced what the academic was used to at university. Over the years, Sean Taylor has worked with and spoken to hundreds of data analysts and statisticians. We’ll dive into his data science work at Lyft, his notion of “hidden analytics,” and why he’s obsessed with user journeys in modern applications.

Modeling the Lyft marketplace and creating the GRAIL model

Sean worked at Facebook for 5 years as a research scientist and worked on general data problems. Eventually he joined the revenue operations science team at Lyft. His team’s goal was to help grow the marketplace of riders and drives on the platform. One of the most important aspects of the marketplace is the forecast. As Lyft runs promotions and enters new cities, how do you ensure there are enough drivers for the riders and vice versa?

The team ultimately decided that a simple cohort methodology would be best to help set the forecast for both drivers and riders. Every rider, for instance, would belong to a cohort based on when they first signed up for Lyft, when they booked their first ride, etc. There’s a “liquidation curve” for each cohort that eventually hugs the x-axis. There is much more detail about the cohort methodology in this blog post by the Lyft Engineering team from 2019.

Despite being such a simple model, the model worked surprisingly well. Goals of this model taken from the blog post mentioned in the previous paragraph:

Forecast the behavior of each observed cohort and use it to project how many rides are taken or driver hours are provided within a specific cohort
Forecast the behavior of the cohorts that are yet to be seen.
Aggregate all the projected rides and driver hours to make forecasts for both the demand and supply side of our business.

Sean talked about how there were flaws in the model, and one of those flaws is that a marketplace is ver fluid and evolves over time. When a rider is exposed ot high prices, this may lead to churn and this was also not included in the model. Sean’s team tried building a better model called GRAIL but Sean left Lyft before completing the model.

Source: Symposiums

Speaking of Lyft’s data team, I had mentioned Amundsen, an open source data discovery platform Lyft released in 2019 (blog post). It’s great to see the data team at Lyft giving back to the ecosystem to help data analysts and data scientists do their job better!

Discovering a bug that cost the company $15M per year

One of the best feelings as a data analyst is using data to uncover the root cause or underlying trends in a given business situation. One might say this is like Moneyball where the Oakland As realize that On-base percentage (OBP) is the best predictor for player performance.

Source: Hire an Esquire

Sean believes there is a lot that data analysts do that is not necessarily taught in school or on the job. You’re expected to understand the business and how every day business operations are translated into the numbers on the dashboard.

When you’re working on a project because your are curious about the project rather than being forced to come up with an analysis, you are able to come up with the bigger wins that really move the needle. Sean calls this type of work “hidden analytics,” or as I like to say, there is much more behind the numbers.

Sean’s colleague at Lyft cam across some anomaly in the data and just started pulling on the thread some more. His colleague ultimately found a bug in the marketplace in how Lyft was dispersing driver incentives. Sean talks about how his colleague’s curiosity led them to discover this bug in the first place and squashing the bug led to saving Lyft $15M per year.

Why the systems for collecting user journey data are broken

Modern websites and applications collect a ton of data, but the actual user journey is harder to quantify. A customer signs up for a tool or service, goes through an onboarding process, and might engage with the tool at various times in the future. Modeling and visualizing this data on a spreadsheet or in a SQL database can be difficult. With these tools, you are aggregating data and parts of the user journey might be improperly reduced down to a single number when there is much more nuance to a user’s journey on a website.

Source: Wikipedia

Users are in different states when using a website or app. Sessionizing data has become the default way to capture the path a user takes but there are still many micro-sessions in just one experience like registering your account on a website.

Sean discusses this concept in the context of a rider taking or not taking a ride booked on Lyft. The customer requests the ride, and perhaps declines the first ride and books the second ride. The basic conversion rate would be 50%, but that statistic doesn’t answer why the customer didn’t book the first ride. Perhaps the customer couldn’t find the right address with the first ride, and just gave up. Perhaps the driver was too far away.

Balancing usability and expressivity in data tools

Browse any Hacker News article and you’ll inevitably see devs talking about why you should just build your own tool on-prem with code. The main reason is that you can fully customize the app if you know how to code. I’ve discussed at length on this podcast and through content I’ve created for my company how the need for low-code and no-code tools redefines who a “builder” is in a company.

Sean’s current company (Motif Analytics) is trying to strike that balance between giving data analysts and data scientists the ability to express their data question without diving right into the code. In terms of user journey data, Sean says most people use Amplitude, Mixpanel, or other similar tools. While these tools allow you to execute common data tasks, there are certain things these tools block you from doing. Python notebooks, for instance, are very expressive. But you kind of need to be an expert to use them to their full potential.

Source: Jupyter

Sean talks about how he drew inspiration from Ruby on Rails in terms of how the creators had strong opinions about how to do web development. I also first learned about web development through a Ruby on Rails book and it’s interesting to see how many of the patterns from Rails are still seen in frameworks using PHP or Javascript.

As we discussed the platform Sean and his team are building, we got into the weeds about a little-known SQL command called MATCH_RECOGNIZE(). There apparently isn’t much documentation about this function and the creators behind SQL rushed this pattern-matching function into the language because of competitors coming out with similar functionality. Nothing like real-world drama impacting the open source world!

Start with the questions instead of the tools

We ended the conversation with a bit of career talk. Sean talks about intrinsic motivation being the number one driving force in his career. While tools come and go, he said domain expertise is something that can give budding analysts a leg up when searching for their next role. Technical skills, unfortunately, are slowly becoming a commodity. What never goes out of style? Asking the right questions.

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!

The post Dear Analyst #119: Developing the holy “grail” model at Lyft, user journeys, and hidden analytics with Sean Taylor appeared first on .

Dear Analyst #118: Uncovering trends and insights behind Facebook News Feed, Reels, and Recommendations using data science with Akos Lada

KeyCuts — Mon, 03 Jul 2023 05:41:00 +0000

No, this isn’t an episode about how Facebook’s algorithm and feed works. The data science function is popping up in companies small and large given the amount of data swimming around. No other company understand the power and influence that data science can have on the customer experience than Facebook (Meta, to be exact). Akos Lada is Facebook’s Director of Data Science for Feed Ranking and Recommendations. Akos has always been interested in the intersection of social science and data, so this role at Facebook seems fitting. In this episode, Akos discusses what the analytics team does at Facebook, an analytics framework his team developed and open-sourced, A/B testing, and more.

What does the data science team do at Facebook?

I know the company is called Meta, but I grew up calling it Facebook, so I’m just going to stick with Facebook for now. The data science team actually consists of two teams: Analytics and Core Applied Data Science.

The Analytics team partners with product managers and engineers and their focus is on delivering long-term value for users (you’ll hear a lot about this during this episode). There is also another data science team Akos used to work on, called Central Applied Science (formerly known as Core Data Science), which is a smaller team that focuses on scientific problems and research that every product team at Facebook might be able to benefit from. One of the frameworks the Central Applied science team created and open-sourced is called Ax. This framework helps optimize any kind of experiment including machine learning experiments, A/B tests, and simulations.

Making better decisions with the GTMF model

Akos’ team published a blog post on four analytics best practices at Facebook which is worth a read. The impetus for this blog post was one question: how does Facebook drive more long-term value for users?

There are many different lenses you can put on to answer this question. Of course, Akos’ team treats this question as a data science question. The Ground Truth Maturity Framework (GTMF) improves ground truth data–the data that powers Facebook’s machine learning models. In a sense, the GTMF model ensures your data is clean. One place where GTMF is used is News Feed ranking. The team’s ultimate goal with News Feed is trying to figure out if a post is something you would want to click on. You can read more about how machine learning is used in the News Feed algorithm here.

Running A/B experiments to figure out the right number of notifications to send to Facebook users

Akos discusses at length his team’s experimentation frameworks. One interesting insight is that the longer his team kept experiments running (say one year) then the outcome of the experiment would change. One of the more surprising results from a long-term experiment his team ran was that if you send less notifications to users, it led to better long-term value for users (e.g. clicking on more posts). In the short-term, sending less notifications would naturally lead to less people engaging with posts.

At the end of the day, this is a behavioral science challenge. Given the amount of data Akos’ team can analyze, they suggested that the product team drastically reduce the number of notifications being sent to Facebook users. You can read more about this experiment and the results here on the Facebook Analytics team’s blog.

While the data science team has so much data at their disposal to make data-driven decisions, Akos talks a bit about how the team also uses intuition for making decisions as well. In an organization as large as Facebook, you can run multiple experiments at a time, evaluate the results, and then ensure the knowledge and insights are spread between product teams. While the results from an experiment on News Feed may not necessarily apply to other product teams, other products at Facebook like Instagram and WhatsApp can benefit from the institutional knowledge.

What the future holds for data science at Facebook

There is a saying at Facebook that the work is only 1% done. Akos talks about how the data science field in general is a relatively new field that really began in the last decade. Compared to other fields like economics, data science is still in its infancy.

Akos’ team is investing more time in machine learning systems, neural networks, reinforcement learning, and all the new and sexy data science topics you’ve been reading about in the last few years. Akos’ interests in data science goes beyond Facebook as he’s published academic papers such as this one about heterogenous causal effects. Akos talks about his fascination with how activity can change when nodes are connected to each other (referring to Facebook’s social graph). If someone sees a post and they find it interesting, they will share that post with their friends. Then those friends share that same post with their friends. Given the connected nature of the social graph, how can Akos’ team help suggest posts that you might like? Facebook’s Recommendation system is built on this concept called collaborative filtering.

Advice for aspiring data scientists

It seems like a tradition now to ask people on the podcast about advice they have for upcoming data analysts, engineers, and scientists. Akos’ advice was a bit sobering but exactly what aspiring data scientists should keep in mind as they find their next role. It’s a tough time in the tech world, but don’t be discouraged. Akos believe that despite the downturn, data science will continue to grow as technology becomes ever more prevalent in our lives. Now is the time to double-down on building your skills. One of the reasons Facebook has their Analytics blog is to share their insights with the community in the hopes that data scientists can build off of Facebook’s work. Akos talks a bit about the generative AI trend, but he’s still focused on how regular “generic” AI can still help people around the world.

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!

The post Dear Analyst #118: Uncovering trends and insights behind Facebook News Feed, Reels, and Recommendations using data science with Akos Lada appeared first on .

Dear Analyst #117: New 2023 Google Sheets functions for data manipulation that already exist in Excel

KeyCuts — Tue, 23 May 2023 05:09:00 +0000

The Google Workspace team announced a slew of Google Sheets functions a few months ago (February 2023). These functions look familiar and that’s because Microsoft Excel released most of them two years ago. I never had a chance to play around with the new functions in Excel since I don’t have the latest Office 365 version. Now that they are live in Google Sheets, I played around with them and find them pretty interesting for data manipulation purposes. I think what’s interesting about these new functions is that they help with both super basic data organization use cases but also more advanced data cleaning use cases too. Here’s a rundown of some of the new functions and more importantly, examples of real-life use cases. If you want a copy of the Google Sheet I use in this episode, go here.

Watch a tutorial showing all the new Google Sheets functions in 2023:

What’s interesting about these “new” Google Sheets functions?

Here’s a quick rant on these “new” Google Sheets functions. They aren’t new. They are basically a direct copy of what exists in Excel already (if you have Office 365). I think Google Sheets has some pretty awesome features that differentiate it from Excel (auto-fill, collaboration features, it’s free, etc.) But I’ve always viewed Google Sheets as a tool that is playing catchup to Excel. These functions are an example of Google playing catchup with Excel’s features versus coming up with something new.

These “new” functions in Google Sheets also highlight something Microsoft discovered a few years ago about how people are using spreadsheets: data is not organized in a structured way. You have time periods across the columns and the rows. You have headers and sub-headers. People don’t typically organize and clean their data for the purposes of a PivotTable but rather for ease of use. With this in mind, I think these new Google Sheets functions are targeted at the beginner spreadsheet user who may just be using Google Sheets to show who’s sitting at different tables at a banquet dinner or showing a shift schedule.

Next to each function, I also put a usefulness rating (🌶 being not useful and 🌶🌶🌶🌶🌶 being really useful) based on what I think would be useful for a beginner Google Sheets user.

1) EPOCHTODATE() – Turn computer-generated dates into a human-readable date format

USEFULNESS RATING: 🌶

This is a pretty basic one. You’ll typically get epoch dates when getting some output from a database or any type of computer-generated date/time. It’s usually a long string of numbers and EPOCHTODATE simply converts that “computer time” into a date and time that us humans can comprehend.

Gave this a rating of 1 because I don’t see many instances where you’ll have the epoch time format in your spreadsheet save the rare occasion you have a a Unix export of data that has these epoch times.

2) TOROW(), TOCOL() – Arrange a bunch of cells into a single row or column

USEFULNESS RATING: 🌶🌶🌶🌶🌶

Also a pretty simply formula that helps with basic data manipulation tasks. Big fan of this one because it removes the need to cut and paste ranges of data on top of each other. I think TOCOL() will be used more often just because you typically want to get a continuous list of values in one column. Here’s an example where you have a bunch of names arranged by groups (perhaps groups of students in a class) and you just want to get all the names in one column:

There are also some interesting options that let you remove errors and blanks as well as how the data should be “scanned” and put together. Someone just asked me how to do a data manipulation task similar to this and using TOCOL() with the scan_by_column flag set to false does the trick.

3) CHOOSEROWS(), CHOOSECOLS() – Choose which rows or columns you want from a data set

USEFULNESS RATING: 🌶🌶🌶🌶

I would put these new functions in the camp of “making it easier to filter out the data I don’t need.” I find this useful when you know when you want to quickly get the top 3 scores or maybe the top score and bottom score from a list of test scores, for instance. There are probably a bunch of other use cases I’m not able to think of, but in general it’s a really useful function to quickly “pull out” the rows or columns of data you need from a data set. CHOOSEROWS() in action:

While we’re at it, I’d say CHOOSECOLS() is equally as useful because you can just pull out the columns of data that matter for you. In this case, you can just pull out the list of students and just the scores from the subjects that matter for you. This feels like a more user-friendly version of the {} syntax for concatenating different ranges to create a custom range (typically used for creating a custom VLOOKUP formula with multiple conditions)

4) WRAPROWS(), WRAPCOLS() – Turn a bunch of cells into a specified number of rows or columns

USEFULNESS RATING: 🌶🌶

Kind of an interesting formula for a specific use case (I think). You put in a list of cells, and then the number of rows or columns you want to turn the list into. I don’t find these formulas that useful because your data has to be in really bad shape to warrant using these formulas. Then again, I may not be thinking of all the use cases where one would use these formulas.

For instance, you might have a list of employees with their location, job, etc. all listed out versus properly arranged in columns. This is where you would use the WRAPROWS() formula:

A more realistic use case is you have a list of names and you want to put them into 3 groups. You would use WRAPROWS() to quickly put this list of names into 3 columns:

In this case the number of names don’t fit perfectly into 3 columns so there are two N/As at the end. There’s this handy pad_width parameter which kind of acts like an IFERROR() function where you can just put in a placeholder value for those extra cells:

5) VSTACK(), HSTACK() – Stack rows from different sheets on top of each other

USEFULNESS RATING: 🌶🌶🌶

I think the reason why VSTACK() might be useful is when you have data coming in on multiple sheets. The data is also structured the same across those three sheets. Then you can have one primary sheet that aggregates all the data using VSTACK().

Not sure when you might use HSTACK() but the example Google shows is when you’re combining dates together. Kind of a weird scenario, but sure whatever.

In this Google Sheet, I have 3 sheets called shows1, shows2, and shows3. Each sheet has the same columns in the same order, but the data is different between the three:

Then with VSTACK(), you can “add” or concatenate all these data sources together on one page:

Again, this assumes your data is structured exactly the same across sheets or even on the same spreadsheet. If the data is, then using VSTACK() could be a nice way to put together these “disparate” data sources compared to using the bracket syntax {}. This feels like an alternative to CHOOSEROWS() where Google Sheets is just making it easier to use the {} syntax.

6) LET() – Assign the result of a formula to a variable to use in the future

USEFULNESS RATING: 🌶🌶

I have mixed feelings about the usefulness for this formula. It technically already exists using named ranges. But this is the formula version of named ranges. I also wouldn’t say it’s that much easier to understand compared to a named range hence the 2-pepper rating. It’s also not a “beginner” function.

Say you have a bunch of product ratings like in the table below. In the Average Score column, you want to put the word “High” if the average rating for a product is greater than 4. If the average rating is between 3-4, then you want the word “Medium.” 3 or below should say “Low”:

Today, you might write a simple formula like this to get this output of “High” and “Low”:

=if(average(B44:D44)>4,"High",if(average(B44:D44)>3,"Medium","Low"))

A typical nested IF statement. Now with the LET() function, you simple are assigning the average(B44:D44) “result” to a variable. The formula below would output the same exact thing as the nested IF statement above:

=LET(avg_rating, average(B44:D44), if(avg_rating>4,"High",if(avg_rating>3,"Medium","Low")))

Here’s a look at the formula in the context of the example:

The formula doesn’t look that much “easier” compared to writing out the nested IF statement. But for more complicated formulas beyond a regular average, this could make the formula much more readable and easier to debug.

One reason I like this function is that it starts to bridge the gap between working in a spreadsheet and using Google Apps Script (or Office Script if you’re in Excel). Starting to treat things like variables might make the learning curve to scripting in Google Apps Script easier and more approachable to a Google Sheets user who has never touched an Apps Script.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The Defiant: Synthetix, DeFi Summer 2, FTX, 3AC and Crystal Meth with Crypto OG Kain Warwick
The Forward Thinking CFO #8: Brian Jones – Microsoft Excel

The post Dear Analyst #117: New 2023 Google Sheets functions for data manipulation that already exist in Excel appeared first on .