Dear Analyst https://www.thekeycuts.com/category/podcast/ A show made for analysts: data, data analysis, and software. Tue, 05 Dec 2023 16:06:50 +0000 en-US hourly 1 https://wordpress.org/?v=6.4.2 This is a podcast made by a lifelong analyst. I cover topics including Excel, data analysis, and tools for sharing data. In addition to data analysis topics, I may also cover topics related to software engineering and building applications. I also do a roundup of my favorite podcasts and episodes. KeyCuts false episodic KeyCuts podcast A show made for analysts: data, data analysis, and software. Dear Analyst https://www.thekeycuts.com/wp-content/uploads/2019/03/dear_analyst_logo-1.png https://www.thekeycuts.com/excel-blog/ TV-G New York, NY New York, NY 5f213539-991a-51f4-96e4-df596a7aec88 50542147 Dear Analyst #122: Designing an online version of Excel to help Uber China compete with DiDi on driver incentives with Matt Basta https://www.thekeycuts.com/dear-analyst-122-designing-an-online-version-of-excel-to-help-uber-china-compete-with-didi-on-driver-incentives-with-matt-basta/ https://www.thekeycuts.com/dear-analyst-122-designing-an-online-version-of-excel-to-help-uber-china-compete-with-didi-on-driver-incentives-with-matt-basta/#respond Mon, 04 Dec 2023 19:28:54 +0000 https://www.thekeycuts.com/?p=54167 There are only so many ways to make Excel “fun.” If you’ve been following this blog/podcast, stories about the financial modeling competition and spreadsheet errors that lead to catastrophic financial loss are stories that make a 1980s tool somewhat interesting to read and listen to. There are numerous tutorials and TikTok influencers who teach Excel […]

The post Dear Analyst #122: Designing an online version of Excel to help Uber China compete with DiDi on driver incentives with Matt Basta appeared first on .

]]>
There are only so many ways to make Excel “fun.” If you’ve been following this blog/podcast, stories about the financial modeling competition and spreadsheet errors that lead to catastrophic financial loss are stories that make a 1980s tool somewhat interesting to read and listen to. There are numerous tutorials and TikTok influencers who teach Excel for those who are actually in the tool day in and day out. Meet Matt Basta, a software engineer by trade. He published a story on his own blog called No sacred masterpieces which is worth reading in its entirety as its all about Excel. In this episode, we discuss highlights from Matt’s time at Uber, how he built a version of Excel online to help Uber China compete with DiDi, and how Uber completely scrapped the project weeks later after DiDi acquired Uber China.

Business intelligence at Uber through the eyes of a software engineer

I don’t normally speak with software engineers on the podcast, but Matt’s story during Uber will resonate with anyone who works at a high-growth startup and lives in Excel. Matt’s story has everything. Tech, cutthroat competition, drama, and of course, Excel.

Matt has worked at a variety of high-growth startups like Box, Uber, Stripe, and now Runway. He joined Uber in 2016 and worked on a team called “Crystal Ball.” The team was part of the business intelligence team. The goal of this team was to create and develop a platform that analysts and business folks could use to figure out how much to charge for rides, how much incentives to provide to drivers, etc. All the core number crunching that makes Uber run.

As per Matt’s blog post, employees were working on one of two major initiatives at Uber in 2016:

  1. Redesigning the core Uber app
  2. Uber China

As Matt told his story, it reminded me of all the news articles that came out in 2016 about Uber’s rapid expansion in markets like China. The issue is that a large incumbent existed in China: DiDi. This comes up later in Matt’s story.

Getting data to the city teams to calculate driver incentives

From the perspective of the Crystal Ball team, all they wanted to do was set up a data pipeline so that data about the app could be shared with analysts. Analysts would then download these files and crunch numbers in R and this process would take hours. In 2016, Uber was competing directly with DiDi to get drivers on the platform. The city team would use the data provided by the Crystal Ball team to figure out how much of an incentive to offer a driver so that the driver would choose to drive with Uber instead of DiDi for that ride.

Source: Forbes

The problem was that the city team in China was using these giant Excel files that would take a long time to calculate. In order to compete with DiDi, Uber China would need a much faster way to calculate the incentives to offer drivers. This is where Matt’s team came in.

The only other “tool” the city team had at their disposal was the browser. The city team still wanted the flexibility of the spreadsheet, so Matt’s team strategy was to put the spreadsheet in the browser. Now at this point, you might be wondering how in the world did this become the solution to the problem? Matt’s blog post goes into much more detail as to the stakeholders, constraints, and variables that led his team to go in this direction.

Luckily, Matt had worked on a similar tool while at Box, so he re-used code from that previous project. During this time at Box, Box had Box Notes and Dropbox had Dropbox Paper. Both of these products were based on the open source tool Etherpad for real-time collaborative document editing. Matt thought, why not build something similar for spreadsheets?

Source: Dropbox

Discovering nuances about Excel

In the blog post, Matt talks about discovering Excel’s circular references. We all know that circular references can break your models, but Excel’s calculation engine also allows for continually calculating if the computed value of the cell converges. I think this is how the Goal Seek function works in Excel to a certain extent.

Source: Microsoft

When Matt’s online version of Excel was released internally, the head of finance was upset since you could see how the formulas were calculated in the tool. To Matt’s team, they did what they were supposed to do. They put Excel in the browser and figured you should be able to see the formulas in the cells.

According to the head of finance, there were spies from DiDi who would apply for internships at Uber China just to get competitive data. Needless to say, Matt removed the ability to see formulas in his tool.

DiDi buys Uber China

Matt and the Crystal Ball team spent 6 months helping the Uber China team with their data needs. Internally, Matt’s team didn’t get an all-hands invite or anything regarding the acquisition of Uber China by DiDi. People just found out through the news. Eventually, then CEO of Uber Travis Kalanick sent out a message regarding the acquisition. Matt’s tool would get scrapped immediately.

Matt open-sourced the code for this WebSheets tool and the calculation engine lives on GitHub here. We chatted about the feedback Matt’s received about his blog post and you can see the comments on HackerNews. As usual, there are people chiming in saying Matt could’ve done this or that better. Whenever there is a mention of Excel on HackerNews, you’ll inevitably see people talking about how $XX billions of their company is still run off of someone’s Excel file. Interestingly, one of the resources Matt used to learn about Excel is Martin Shkreli’s YouTube channel where Shkreli walks through building out a financial model. Putting aside misgivings about Shkreli’s character, the videos are actually super educational:

Excel’s fast feedback loop

This is where the Matt’s story turns into takeaways and learnings that make this story more than a story about Uber China and Excel. Matt built something from scratch and had to come to terms with that it wouldn’t have a business purpose anymore. The tool is just a way to achieve the business objective. If the business objective changes, then the tool may become obsolete.

Hearing Matt’s perspective about Excel was quite refreshing since prior to this Crystal Ball project, he wasn’t an analyst and in the weeds of Excel every day. However, he worked with said analysts every day to understand their requirements and more importantly, whey they were so tied to Excel. Excel allows you to create a fast feedback loop to test an idea or an assumption. The reason the city team stuck with Excel and put up with the hours of calculation time is because building similar functionality with code would’ve been too difficult.

Founders will use Excel before writing code.

To the analysts and data scientists Matt worked with, writing formulas was their version of programming. Unlike traditional programming, Excel users don’t have to develop unit tests, build integrations, and deal with piping data in/out. Another interesting tidbit Matt brought up about the internal workings of the city team at the time is that there was no expectation that a given Excel file would live for more than a week. Each file would solve a specific problem at that point in time, and then get discarded as it too became obsolete.

Planning and forecasting on IBM software

Following this Crystal Ball project, Matt started working on the financial engineering team within Uber. His next project as trying to figure out how much revenue Uber would make in 2017. The tool they used was a self-hosted version of Anaplan called IBM TM1. I’ve never heard of this tool from a FP&A perspective, but my guess is that it’s similar to Oracle Hyperion (the tool I used back in the day).

Source: Lodestar Solutions

There were analysts working with this tool who would turn Excel spreadsheet data into TM1 code for planning purposes. The problem is TM1 code is not strongly typed, so analysts would constantly break the tool when trying to write code for it. It was just one guy who created TM1 and the platform was acquired by IBM. Uber even invited one of TM1’s chief architects to talk to Uber’s analysts about the tool. According to the creator of TM1, Manny Perez, TM1 was the first “functional database” in the 1980s which exploited in-memory computing. Apparently there’s a cult following around Manny and the creation of TM1. So much so that a documentary was released a few years ago aptly named Beyond the Spreadsheet: The Story of TM1:

Not gonna lie, this seems like a super interesting documentary given the foundation of the story discusses spreadsheets at length. How about this description from the film’s website to incite some excitement around corporate planning software:

But as long ago as 1983, a light-bulb idea went off in the head of an employee at oil distributor Exxon. Manny Perez realized he could give business users the freedom to create at scale but also the control and collaboration prevalent in other technologies today. He thought his solution to the problem was so elegant and obvious, it would become instantly ubiquitous. It didn’t. To achieve his ultimate aims, he would need to pioneer and master many facets of technology, staying true to the spirit of user freedom whilst battling waves of competitors selling solutions that enriched themselves but not their customers. Eventually, with thousands of companies globally using his solution, and with a passionate community of followers, his inspiration and perspiration was validated when IBM acquired his technology in 2008.

Source: tm1.film

Back to Matt’s work with TM1. His goal was to make it easier for analysts to work with the software. He built a programming language on top of what the analysts were coding. The new language had type inference and checking to prevent errors from occurring in TM1.

Tips for Excel users

Given Matt’s extensive experience building on top of Excel and working with analysts all day at Uber, I thought it would be interesting to get tips he has for us Excel users. A key question that is worth pondering is when the business evolves to a point where Excel doesn’t make sense to be the tool of record anymore. I’m sure many of you have worked with files that handle business critical processes at your company and have wondered: this data should probably be in a secure database or something more secure than Excel.

Source: KaiNexus Blog

Realistically, moving the data and process off of Excel involves a team of engineers writing code where everything is hosted on a server. The resourcing for this speaks to to the speed and immediacy of Excel’s value when your team needs to work fast. Should your team go down this route and create code instead of spreadsheets, Matt encourages all analysts to do one thing: provide good documentation.

This helps with the migration process when you have to work with a team of engineers. Tactically, this can mean something as simple as adding a comment to a cell in your file, leaving notes in the cell itself, or even creating a text box with the notes in the box. How many times have you inherited a file and spend hours spelunking around trying to figure out how it was constructed? Good documentation helps everyone.

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!

The post Dear Analyst #122: Designing an online version of Excel to help Uber China compete with DiDi on driver incentives with Matt Basta appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-122-designing-an-online-version-of-excel-to-help-uber-china-compete-with-didi-on-driver-incentives-with-matt-basta/feed/ 0 There are only so many ways to make Excel "fun." If you've been following this blog/podcast, stories about the financial modeling competition and spreadsheet errors that lead to catastrophic financial loss are stories that make a 1980s tool somewhat in... There are only so many ways to make Excel "fun." If you've been following this blog/podcast, stories about the financial modeling competition and spreadsheet errors that lead to catastrophic financial loss are stories that make a 1980s tool somewhat interesting to read and listen to. There are numerous tutorials and TikTok influencers who teach Excel for those who are actually in the tool day in and day out. Meet Matt Basta, a software engineer by trade. He published a story on his own blog called No sacred masterpieces which is worth reading in its entirety as its all about Excel. In this episode, we discuss highlights from Matt's time at Uber, how he built a version of Excel online to help Uber China compete with DiDi, and how Uber completely scrapped the project weeks later after DiDi acquired Uber China.







Business intelligence at Uber through the eyes of a software engineer



I don't normally speak with software engineers on the podcast, but Matt's story during Uber will resonate with anyone who works at a high-growth startup and lives in Excel. Matt's story has everything. Tech, cutthroat competition, drama, and of course, Excel.







Matt has worked at a variety of high-growth startups like Box, Uber, Stripe, and now Runway. He joined Uber in 2016 and worked on a team called "Crystal Ball." The team was part of the business intelligence team. The goal of this team was to create and develop a platform that analysts and business folks could use to figure out how much to charge for rides, how much incentives to provide to drivers, etc. All the core number crunching that makes Uber run.



As per Matt's blog post, employees were working on one of two major initiatives at Uber in 2016:




* Redesigning the core Uber app



* Uber China




As Matt told his story, it reminded me of all the news articles that came out in 2016 about Uber's rapid expansion in markets like China. The issue is that a large incumbent existed in China: DiDi. This comes up later in Matt's story.



Getting data to the city teams to calculate driver incentives



From the perspective of the Crystal Ball team, all they wanted to do was set up a data pipeline so that data about the app could be shared with analysts. Analysts would then download these files and crunch numbers in R and this process would take hours. In 2016, Uber was competing directly with DiDi to get drivers on the platform. The city team would use the data provided by the Crystal Ball team to figure out how much of an incentive to offer a driver so that the driver would choose to drive with Uber instead of DiDi for that ride.



Source: Forbes



The problem was that the city team in China was using these giant Excel files that would take a long time to calculate. In order to compete with DiDi, Uber China would need a much faster way to calculate the incentives to offer drivers. This is where Matt's team came in.



The only other "tool" the city team had at their disposal was the browser. The city team still wanted the flexibility of the spreadsheet, so Matt's team strategy was to put the spreadsheet in the browser. Now at this point,]]>
Dear Analyst 122 122 full false 43:24 54167
Dear Analyst #121: Fabricating and skewing Excel survey data about honesty with behavioral economists Dan Ariely and Francesca Gino https://www.thekeycuts.com/dear-analyst-121-fabricating-and-skewing-survey-data-about-honesty-with-behavioral-economists-dan-ariely-and-francesca-gino/ https://www.thekeycuts.com/dear-analyst-121-fabricating-and-skewing-survey-data-about-honesty-with-behavioral-economists-dan-ariely-and-francesca-gino/#respond Mon, 23 Oct 2023 05:28:00 +0000 https://www.thekeycuts.com/?p=54109 One of the more popular courses you could take at my college to fulfill the finance major requirements was Behavioral Finance. The main “textbook” was Inefficient Markets and we learned about how there are qualitative ways to value a security beyond what the efficient market hypothesis purports. During the financial crisis of 2008, psychology professor […]

The post Dear Analyst #121: Fabricating and skewing Excel survey data about honesty with behavioral economists Dan Ariely and Francesca Gino appeared first on .

]]>
One of the more popular courses you could take at my college to fulfill the finance major requirements was Behavioral Finance. The main “textbook” was Inefficient Markets and we learned about how there are qualitative ways to value a security beyond what the efficient market hypothesis purports. During the financial crisis of 2008, psychology professor and behavioral economist Dan Ariely published Predictably Irrational to much fanfare. The gist of the book is that humans are less rational than what economic theory tells us. With the knowledge that humans are irrational (what a surprise) when it comes to investing and other aspects of life, the capitalist would try to find the edge in a situation to get a profit. That is, until, recent reports have surfaced showing that the results of Dan Ariely’s experiments are fabricated (Ariely partially admits to it). This episode looks at how the data was potentially fabricated to skew the final results.

Dan Ariely. Source: Wikipedia

Background on the controversy surrounding Dan Ariely’s fabricated data

In short, Ariely’s main experiment coming under fire is one he ran with an auto insurance company. The auto insurance company asks customers to provide odometer readings. Ariely claims that if you “nudge” the customer first by having them sign an “honesty declaration” at the top of the form saying they won’t lie on the odometer reading, they will provide more accurate (higher) readings.

I was a fan of Predictably Irrational. It was an easy read, and Ariely’s storytelling in his TED talk from 15 years ago is compelling. I first heard that Ariely’s experiments were coming under scrutiny from this Planet Money episode called Did two honesty researchers fabricate their data? The episode walks through how Ariely a thought leader and used his status to get paid behavioral economics consulting gigs and to give talks. Apparently the Israeli Ministry of Finance paid Ariely to look into ways to reduce traffic congestion. In the Planet Money episode, they talk about how other behavioral scientists like Professor Michael Sanders applied Ariely’s findings to the Guatemalan government by encouraging businesses to accurately report taxes. Sanders was the one who originally questioned the efficacy of Ariely’s findings. Here is part of the abstract from the paper Sanders wrote with his authors:

The trial involves short messages and choices presented to taxpayers as part of a CAPTCHA pop-up window immediately before they file a tax return, with the aim of priming honest declarations. […] Treatments include: honesty declaration; information about public goods; information about penalties for dishonesty, questions allowing a taxpayer to choose which public good they think tax money should be spent on; or questions allowing a taxpayer to state a view on the penalty for not declaring honestly. We find no impact of any of these treatments on the average amount of tax declared. We discuss potential causes for this null effect and implications for ‘online nudges’ around honesty priming.

Professor Michael Sanders

If you want to dive deeper into Dan Ariely’s story, how he rose to fame, and the events surrounding this controversy, this New Yorker article by Gideon Lewis-Kraus is well researched and reported. NPR also did a podcast episode about this a few months ago. This undergraduate student only has one video in his YouTube account, but it tells the story about Ariely quite well:

Instead of discussing Ariely’s career and his character, I’m going to focus on the data irregularities in the Excel file Ariely used to come up with the findings from the auto insurance experiment. This podcast/newsletter is about data analysis, after all.

Instead of dissecting the Excel file myself, I’m basically going to re-hash the findings from this Data Colada blog post. Data Colada is a blog run by three behavioral scientists: Uri Simonsohn, Leif Nelson, and Joe Simmons. Their posts demonstrate how “p-hacking” is used to massage data to get the results you want.

Irregularity #1: Uniform distribution vs. normal distribution of miles driven

This is the raw driving dataset from the experiment (download the file here). Each row represents an individual insurance policy and each column shows the odometer reading for each car in the policy before and after the form was presented to the customer.

The average number of miles driven per year irrespective of this experiment is around 13,000. In this dataset, you would expect to see a lot of numbers around 13,000, and a few numbers below 1,000 and a few numbers above 50,000 (as an example). This is what normal distribution or bell curve looks like:

Source: Math Is Fun

In Ariely’s dataset, there is a uniform distribution of miles driven. This means the number of people driving 1,000 miles per year is similar to those who 13,000 miles/year and those who drove 50,000 miles/year.

Source: Data Colada

No bell curve. No normal distribution. This by itself makes the dataset very suspect. One could argue that the data points were cherry-picked to massage the data a certain way, but the other irregularities will show that something more sinister was at play. You’ll also notice in the chart created by Data Colada is that the data abruptly stops at 50,000 miles per year. Although 50,000 miles driven per year is a lot, it’s highly unlikely that there are no observations above 50,000.

Irregularity #2: Mileage reported after people were shown form are not rounded and RANDBETWEEN() was used

People in the experiment were asked to recall their mileage driven and write the number on a piece of paper. If you were to report on a large number, you’d probably round the number to the nearest 100 or 1,000. In the screenshot below, you’ll see how some of the reported mileage are indeed rounded. What’s peculiar is that mileage reported after people were shown the form (Column D) were generally not rounded at all:

Did these customers all of a sudden remember their mileage driven down to the single digit? Highly suspect. Data Colada suggests that the RANDBETWEEN() function in Excel was used to fabricate the mileage in Column D. The reasoning is that RANDBETWEEN() doesn’t round numbers at all.

Even the numbers in Column C (mileage reported before shown the form) seem suspect given how many places most numbers go to. If Ariely or members in his lab did in fact use RANDBETWEEN() to generate the mileage in Column D, they could’ve at least tried to hide it better using the ROUND() function which would allow them to round the numbers to the 100 or 1,000th place. This is just pure laziness.

This chart from Data Colada further shows how the last digit in the baseline mileage (before people were shown the form) is disproportionately 0. This supports that these numbers are indeed reported accurately. The last digit in the updated mileage (after people were shown the form) again has a uniform distribution further adding to the evidence that the numbers were fabricated.

Source: Data Colada

Irregularity #3: Two fonts randomly used throughout Excel file

This is by far the most amateur mistake when it comes to judging the validity of any dataset. When you open the Excel file, something instantly feels off about the data. That’s because half of the rows have Calibri font (default Excel font) and the other half have Cambria font (in the same font family as Calibri).

Were some of the rows copied and pasted from another Excel file into the main file and then sorted in some fashion? Did someone incorrectly select half the data and set it to Cambria?

According to Data Colada, the numbers probably started out in Calibri and the RANDBETWEEN() function was used again to generate a number between 0 and 1,000 to be added to the number in Calibri. The resulting number is in Cambria:

Source: Data Colada

To recap what the data hacking looks like with this irregularity:

  1. 13,000 baseline car readings are composed of Calibri and Cambria font (almost exactly 50/50)
  2. 6,500 “accurate” observations have Calibri
  3. 6,500 new observations were fabricated in Cambria
  4. To mask the new observations, a random number between 0 and 1,000 was added to the original numbers in Calibri to form the fabricated numbers in Cambria

In the screenshot above, this pattern of the Cambria number being almost identical to the Calibri number is what leads Data Colada to believe that the Cambria numbers (half the dataset) are fabricated.

To put the cherry on top of this font irregularity, very few of the numbers in Cambria font are rounded. As discussed in irregularity #2 above, using RANDBETWEEN() without using ROUND() will lead to numbers not being rounded. Not having rounded numbers is again, highly suspicious when you consider that these mileage numbers are reported by humans who tend to round large numbers.

Source: Data Colada

Why did Ariely allegedly fabricate the numbers?

Easy. Fame, notoriety, and consulting gigs. Again, I’d read the New Yorker piece to learn more about Ariely’s background and character. The narrative Ariely wanted to tell was that nudges have an outsize impact on behavior, and the data was skewed to prove this.

Source: Resourceaholic

Ariely actually acknowledged Data Colada’s analysis and basically responded with “I’ll check my data better next time” over email. The New Yorker article talks about maybe someone at the auto insurance company fabricating the data before it was sent to Ariely, which means Ariely can claim he had no hand in fabricating the data.

Even if that were the case, you wouldn’t at least scroll through the dataset to see–I don’t know–that the data is in two different fonts? Your future TED talks, published books, and paid consulting gigs are dependent on your findings from this Excel file and you don’t bother to check the validity of it? The file is just over 13,000 rows long so it’s not even that huge of a dataset. While not on the same scale, this narrative feels similar to what happened with Theranos. Similar to Elizabeth Holmes, Ariely claims he can’t recall who sent him datasets or how the data was transformed (as reported in the New Yorker).

Excel mistakes are different from fabricating data

I’ve dissected a few Excel blunders on the podcast such as the error that led to a $6.2B loss at JPMorgan Chase, Enron’s spreadsheet woes, the DCF spreadsheet error leading to a mistake with a Tesla acquisition, and many others. In these cases, the pilot simply misused the instrument which led to a massive mistake.

With the fabricated data in Ariely’s experiment, Ariely, members of his lab, or someone at the auto insurance company knowingly massaged the data with the intention of not getting caught. Better auditing or controls cannot prevent data drudging to this magnitude.

Perhaps Ariely (or whoever fabricated the data) knew that if they could tell this narrative that “nudging” does indeed lead to changes in human behavior, there would be a size-able financial payout somewhere down then line.

Source: GetYarn

Blowing the whistle on Ariely

In the Planet Money episode referenced earlier, Professor Michael Sanders is credited with first calling bullshit on Ariely’s findings after his own failed project with the Guatemalan government. Data Colada’s blog post really made clear what issues exited in Ariely’s spreadsheet.

Data Colada kind of reminds me of the European Spreadsheet Risks Interest Group (EuRpRIG), a group of individuals who document all these Excel errors in the hopes that analysts won’t make the same errors. By detailing Ariely’s spreadsheet tactics, hopefully it will be easier to spot issues like this in the future.

The New Yorker article shows that it’s hard to evaluate the true intentions of each party in this case. It’s easy to point fingers at Ariely and say he committed spreadsheet fraud for his own personal gain. But what about Data Colada? While the behavioral scientists behind the blog seem like upstanding citizens, who knows what benefit they stand to gain from uncovering these issues and calling out fraud? Simmons, Nelson, and Simonsohn also get their share of the limelight in this recent WSJ article highlighting the impact of the group’s research.

Leif Nelson, Uri Simonsohn, and Joe Simmons. Source: WSJ

Like Ariely, maybe more consulting gigs get thrown their way based on their ability to take down high profile authors and scientists? Remember when Hindenburg Research came out with the hit piece on Nikola leading to the resignation of the CEO? Not only did Hindenburg stand to gain from short-selling the stock, they also drew more attention to their investment research services. They also probably got more inbound interest from people who have an axe to grind with some other company CEO and want to take down the company.

Open source wins the day

I’ve been a fan of open source ever since I got into software since, well, the whole fucking Internet runs on it. One of my favorite data cleaning tools (OpenRefine) is completely free to use and is just as powerful as Microsoft Power Query for cleaning data.

Source: Rocket.Chat

The beautiful thing about open source is that anyone can analyze and investigate how the code really works. There is no narrative about what the tool or library can do. These same values should also be applied to researchers and scientists. I really like how the Data Colada team ended their post on Ariely’s spreadsheet issues:

There will never be a perfect solution, but there is an obvious step to take: Data should be posted.  The fabrication in this paper was discovered because the data were posted. If more data were posted, fraud would be easier to catch. And if fraud is easier to catch, some potential fraudsters may be more reluctant to do it. Other disciplines are already doing this. For example, many top economics journals require authors to post their raw data. There is really no excuse. All of our journals should require data posting. Until that day comes, all of us have a role to play. As authors (and co-authors), we should always make all of our data publicly available. And as editors and reviewers, we can ask for data during the review process, or turn down requests to review papers that do not make their data available. A field that ignores the problem of fraud, or pretends that it does not exist, risks losing its credibility. And deservedly so.

Hopefully this episode nudges you in the right direction.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

Source: Rachel E Cinelli

The post Dear Analyst #121: Fabricating and skewing Excel survey data about honesty with behavioral economists Dan Ariely and Francesca Gino appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-121-fabricating-and-skewing-survey-data-about-honesty-with-behavioral-economists-dan-ariely-and-francesca-gino/feed/ 0 One of the more popular courses you could take at my college to fulfill the finance major requirements was Behavioral Finance. The main "textbook" was Inefficient Markets and we learned about how there are qualitative ways to value a security beyond wh... One of the more popular courses you could take at my college to fulfill the finance major requirements was Behavioral Finance. The main "textbook" was Inefficient Markets and we learned about how there are qualitative ways to value a security beyond what the efficient market hypothesis purports. During the financial crisis of 2008, psychology professor and behavioral economist Dan Ariely published Predictably Irrational to much fanfare. The gist of the book is that humans are less rational than what economic theory tells us. With the knowledge that humans are irrational (what a surprise) when it comes to investing and other aspects of life, the capitalist would try to find the edge in a situation to get a profit. That is, until, recent reports have surfaced showing that the results of Dan Ariely's experiments are fabricated (Ariely partially admits to it). This episode looks at how the data was potentially fabricated to skew the final results.



Dan Ariely. Source: Wikipedia



Background on the controversy surrounding Dan Ariely's fabricated data



In short, Ariely's main experiment coming under fire is one he ran with an auto insurance company. The auto insurance company asks customers to provide odometer readings. Ariely claims that if you "nudge" the customer first by having them sign an "honesty declaration" at the top of the form saying they won't lie on the odometer reading, they will provide more accurate (higher) readings.



I was a fan of Predictably Irrational. It was an easy read, and Ariely's storytelling in his TED talk from 15 years ago is compelling. I first heard that Ariely's experiments were coming under scrutiny from this Planet Money episode called Did two honesty researchers fabricate their data? The episode walks through how Ariely a thought leader and used his status to get paid behavioral economics consulting gigs and to give talks. Apparently the Israeli Ministry of Finance paid Ariely to look into ways to reduce traffic congestion. In the Planet Money episode, they talk about how other behavioral scientists like Professor Michael Sanders applied Ariely's findings to the Guatemalan government by encouraging businesses to accurately report taxes. Sanders was the one who originally questioned the efficacy of Ariely's findings. Here is part of the abstract from the paper Sanders wrote with his authors:




The trial involves short messages and choices presented to taxpayers as part of a CAPTCHA pop-up window immediately before they file a tax return, with the aim of priming honest declarations. [...] Treatments include: honesty declaration; information about public goods; information about penalties for dishonesty, questions allowing a taxpayer to choose which public good they think tax money should be spent on; or questions allowing a taxpayer to state a view on the penalty for not declaring honestly. We find no impact of any of these treatments on the average amount of tax declared. We discuss potential causes for this null effect and implications for 'online nudges' around honesty priming.




Professor Michael Sanders



If you want to dive deeper into Dan Ariely's story, how he rose to fame, and the events surrounding this controversy,]]>
Dear Analyst 121 121 full false 27:33 54109
Dear Analyst #120: Marketing attribution, sensitivity models, and building data infrastructure from the ground up with Zach Wilner https://www.thekeycuts.com/dear-analyst-120-marketing-attribution-sensitivity-models-and-building-data-infrastructure-from-the-ground-up-with-zach-wilner/ https://www.thekeycuts.com/dear-analyst-120-marketing-attribution-sensitivity-models-and-building-data-infrastructure-from-the-ground-up-with-zach-wilner/#respond Mon, 09 Oct 2023 05:47:00 +0000 https://www.thekeycuts.com/?p=53861 Data analytics and business analytics are still relatively new areas of study (in terms of academics). The subject borders business and computer science. When I went to school, the only data analytics classes available were special electives offered through our school’s continuing education department. In this episode, I spoke with Zach Wilner who currently leads […]

The post Dear Analyst #120: Marketing attribution, sensitivity models, and building data infrastructure from the ground up with Zach Wilner appeared first on .

]]>
Data analytics and business analytics are still relatively new areas of study (in terms of academics). The subject borders business and computer science. When I went to school, the only data analytics classes available were special electives offered through our school’s continuing education department. In this episode, I spoke with Zach Wilner who currently leads data and analytics at Pair Eyewear. Zach is a “classically trained” in data analytics (if one can call it such) since he studied business analytics at Boston College. He has worked at various DTC (direct-to-consumer) companies like Wayfair and Bombas before landing at Pair (also a DTC company). In addition to discussing marketing attribution and pricing projects, Zach also talks about building Pair Eyewear’s data infrastructure from 0 and how to build the team around it.

Scaling a data stack in a step-wise approach

When Zach joined Pair, there wasn’t really much of a data infrastructure in place. People wanted to analyze and visualize data but didn’t know where to pull the data from. The classic multiple data silos problem.

The easy thing to do would’ve been to take the data stack at Bombas or Wayfair and try to implement it at Pair. Instead, Zach asked what if we started with a blank slate? With the help of a consultant, Zach spent 6 months building out a data warehouse with dbt, Stitch, and other ETL tools. After the foundation was placed, he then focused on BI and implemented Looker and Heap. The goal is to make analytics as self-service as possible. Today, 60%-70% of the company use Looker actively.

From a marketing analytics perspective, most DTC companies have similar marketing channels (e.g. Shopify, Facebook, TikTok). This means Zach could set up similar telemetry for tracking all of Pair’s marketing initiatives. One area the team spent some time on is health data and they decided that they wouldn’t be HIPPA compliant or deal with PHI data.

Customer centric vs. marketing attribution model

Marketing attribution. A never-ending battle between marketing channels and data to figure out which channel gives your company the best bang for your buck. The reason I know this problem hasn’t been solved yet is because new marketing attribution vendors pop up every year claiming to be the end-all-be-all omnichannel tracking tool. If you work in martech, we’ve seen the industry evolve from last-click to multi-touch models.

Source: WordStream

Zach worked with Pair’s head of marketing to figure out what model would work for the company. Surprise surprise, they started with the data. Using the data, they answered questions like how many sessions does it take before a customer makes a purchase? How many ads does the customer need to see before they make a purchase?

The team decided to build out a home-grown attribution model and called it a customer-centric attribution model. They basically looked at how individual customers viewed Pair’s different marketing messages and optimized spend based on the customer. They were able to properly attribute conversions by comparing their results with lift studies from Facebook.

Using a sensitivity model to experiment with pricing

Pair’s business model is doing limited-edition drops. This means a lot of one-unit orders when the drops happen. With the longevity of the business in mind, the team asked what would happen if they encouraged customers to to purchase two items with less frequency between them instead of just these one-time higher-priced drops?

Source: SoundCloud (Mokos)

Again, they started with the data. They looked at a distribution of their order values. As expected, they saw a normal distribution of orders and could see the average order value across all customers. Using this data, they could figure out what the order minimum customers were reaching for. Then came the sensitivity model to find the tradeoff between a lower conversion % and higher order value.

Hiring the right people for your data team

The sequencing of how Zach went about hiring members to join his data team might sound familiar to folks. The first hire was an analytics engineer, the Swiss army knife of the data world. The analytics engineer can help build the tech stack and do analysis. This breakdown of data engineer, analytics engineer, and data analyst is always good to know:

Source: LearnSQL

Once the data infrastructure is in place, Zach then hired the data analysts who do the more traditional exploratory analysis and dashboarding. From there, Zach built out a consumer insights team. The analytics team is now doing full-stack stuff which goes beyond Excel and Tableau. They are diving into dbt and machine learning as well.

Zach talked about encouraging data analysts to be generalists. One reason people leave their current job or employer is simply being bored with the work. If an analyst is a generalist, they can grow and learn and be excited about other aspects of their role. They will have the opportunity to touch multiple departments. More importantly, they can approach company problems from multiple angles.

What keeps Zach up at night: building in-house vs. managed services

Build vs. buy. No matter how trite this debate may seem to some of you, I think it’s always interesting to hear how different companies view this problem. There’s always a new set of constraints, contexts, and tools to consider this tradeoff. What doesn’t change, however, is that there is never a clear answer. Even when you think if you’d made the right decision, that all can change next quarter.

Source: Customer Success Memes

One of the things that keeps Zach up. at night is whether a certain task should be delegated to a managed service like Stitch or Fivetran. These tools make it easy to tap into APIs. They also allow teams to move quicker and get to impact faster. The problem is that it opens up your company to more risk. If one of the APIs or providers happens to go down, you’re at the mercy of the provider. Zach talked about an issue that Stitch had with the Shopify API and that there was nothing his team could do about it.

The other side is you build in-house and everything is under your control. This requires more resources and you move slower. According to Zach, this tradeoff is something. he revisits often and the work is never quite done even when you think it’s done.

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!

The post Dear Analyst #120: Marketing attribution, sensitivity models, and building data infrastructure from the ground up with Zach Wilner appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-120-marketing-attribution-sensitivity-models-and-building-data-infrastructure-from-the-ground-up-with-zach-wilner/feed/ 0 Data analytics and business analytics are still relatively new areas of study (in terms of academics). The subject borders business and computer science. When I went to school, the only data analytics classes available were special electives offered th... Data analytics and business analytics are still relatively new areas of study (in terms of academics). The subject borders business and computer science. When I went to school, the only data analytics classes available were special electives offered through our school's continuing education department. In this episode, I spoke with Zach Wilner who currently leads data and analytics at Pair Eyewear. Zach is a "classically trained" in data analytics (if one can call it such) since he studied business analytics at Boston College. He has worked at various DTC (direct-to-consumer) companies like Wayfair and Bombas before landing at Pair (also a DTC company). In addition to discussing marketing attribution and pricing projects, Zach also talks about building Pair Eyewear's data infrastructure from 0 and how to build the team around it.







Scaling a data stack in a step-wise approach



When Zach joined Pair, there wasn't really much of a data infrastructure in place. People wanted to analyze and visualize data but didn't know where to pull the data from. The classic multiple data silos problem.







The easy thing to do would've been to take the data stack at Bombas or Wayfair and try to implement it at Pair. Instead, Zach asked what if we started with a blank slate? With the help of a consultant, Zach spent 6 months building out a data warehouse with dbt, Stitch, and other ETL tools. After the foundation was placed, he then focused on BI and implemented Looker and Heap. The goal is to make analytics as self-service as possible. Today, 60%-70% of the company use Looker actively.



From a marketing analytics perspective, most DTC companies have similar marketing channels (e.g. Shopify, Facebook, TikTok). This means Zach could set up similar telemetry for tracking all of Pair's marketing initiatives. One area the team spent some time on is health data and they decided that they wouldn't be HIPPA compliant or deal with PHI data.



Customer centric vs. marketing attribution model



Marketing attribution. A never-ending battle between marketing channels and data to figure out which channel gives your company the best bang for your buck. The reason I know this problem hasn't been solved yet is because new marketing attribution vendors pop up every year claiming to be the end-all-be-all omnichannel tracking tool. If you work in martech, we've seen the industry evolve from last-click to multi-touch models.



Source: WordStream



Zach worked with Pair's head of marketing to figure out what model would work for the company. Surprise surprise, they started with the data. Using the data, they answered questions like how many sessions does it take before a customer makes a purchase? How many ads does the customer need to see before they make a purchase?



The team decided to build out a home-grown attribution model and called it a customer-centric attribution model. They basically looked at how individual customers viewed Pair's different marketing messages and optimized spend based on the customer. They were able to properly attribute conversions by comparing their results with lift studies from Facebook.



Using a sensitivity model to experiment with pricing



Pair's business model is doing limited-edition drops. This means a lot of one-unit orders when the drops happen. With the longevity of the business in mind, the team asked what would happen if they encouraged customers to to purchase two items with less frequency between them instead of just these one-time higher-priced drops?



Source: SoundCloud (Mokos)

]]>
Dear Analyst 120 120 full false 31:24 53861
Dear Analyst #119: Developing the holy “grail” model at Lyft, user journeys, and hidden analytics with Sean Taylor https://www.thekeycuts.com/dear-analyst-119-developing-the-holy-grail-model-at-lyft-user-journeys-and-hidden-analytics-with-sean-taylor/ https://www.thekeycuts.com/dear-analyst-119-developing-the-holy-grail-model-at-lyft-user-journeys-and-hidden-analytics-with-sean-taylor/#respond Mon, 18 Sep 2023 18:32:14 +0000 https://www.thekeycuts.com/?p=53852 Future Dear Analyst episodes will get more sporadic since, well, life gets in the way. Unfortunately curiosity (in most cases) doesn’t pay the bills. Nevertheless, when I come across an idea or person that I think is worth sharing/learning more about, I’ll try my best to post. In this episode, I interview the Chief Scientist […]

The post Dear Analyst #119: Developing the holy “grail” model at Lyft, user journeys, and hidden analytics with Sean Taylor appeared first on .

]]>
Future Dear Analyst episodes will get more sporadic since, well, life gets in the way. Unfortunately curiosity (in most cases) doesn’t pay the bills. Nevertheless, when I come across an idea or person that I think is worth sharing/learning more about, I’ll try my best to post. In this episode, I interview the Chief Scientist of a data startup who did his PhD at Stern NYU and was on track go becoming a professor. Then he got an internship at Facebook and everything changed. The speed of learning at a tech company outpaced what the academic was used to at university. Over the years, Sean Taylor has worked with and spoken to hundreds of data analysts and statisticians. We’ll dive into his data science work at Lyft, his notion of “hidden analytics,” and why he’s obsessed with user journeys in modern applications.

Modeling the Lyft marketplace and creating the GRAIL model

Sean worked at Facebook for 5 years as a research scientist and worked on general data problems. Eventually he joined the revenue operations science team at Lyft. His team’s goal was to help grow the marketplace of riders and drives on the platform. One of the most important aspects of the marketplace is the forecast. As Lyft runs promotions and enters new cities, how do you ensure there are enough drivers for the riders and vice versa?

The team ultimately decided that a simple cohort methodology would be best to help set the forecast for both drivers and riders. Every rider, for instance, would belong to a cohort based on when they first signed up for Lyft, when they booked their first ride, etc. There’s a “liquidation curve” for each cohort that eventually hugs the x-axis. There is much more detail about the cohort methodology in this blog post by the Lyft Engineering team from 2019.

Despite being such a simple model, the model worked surprisingly well. Goals of this model taken from the blog post mentioned in the previous paragraph:

  1. Forecast the behavior of each observed cohort and use it to project how many rides are taken or driver hours are provided within a specific cohort
  2. Forecast the behavior of the cohorts that are yet to be seen.
  3. Aggregate all the projected rides and driver hours to make forecasts for both the demand and supply side of our business.

Sean talked about how there were flaws in the model, and one of those flaws is that a marketplace is ver fluid and evolves over time. When a rider is exposed ot high prices, this may lead to churn and this was also not included in the model. Sean’s team tried building a better model called GRAIL but Sean left Lyft before completing the model.

Source: Symposiums

Speaking of Lyft’s data team, I had mentioned Amundsen, an open source data discovery platform Lyft released in 2019 (blog post). It’s great to see the data team at Lyft giving back to the ecosystem to help data analysts and data scientists do their job better!

Discovering a bug that cost the company $15M per year

One of the best feelings as a data analyst is using data to uncover the root cause or underlying trends in a given business situation. One might say this is like Moneyball where the Oakland As realize that On-base percentage (OBP) is the best predictor for player performance.

Source: Hire an Esquire

Sean believes there is a lot that data analysts do that is not necessarily taught in school or on the job. You’re expected to understand the business and how every day business operations are translated into the numbers on the dashboard.

When you’re working on a project because your are curious about the project rather than being forced to come up with an analysis, you are able to come up with the bigger wins that really move the needle. Sean calls this type of work “hidden analytics,” or as I like to say, there is much more behind the numbers.

Sean’s colleague at Lyft cam across some anomaly in the data and just started pulling on the thread some more. His colleague ultimately found a bug in the marketplace in how Lyft was dispersing driver incentives. Sean talks about how his colleague’s curiosity led them to discover this bug in the first place and squashing the bug led to saving Lyft $15M per year.

Why the systems for collecting user journey data are broken

Modern websites and applications collect a ton of data, but the actual user journey is harder to quantify. A customer signs up for a tool or service, goes through an onboarding process, and might engage with the tool at various times in the future. Modeling and visualizing this data on a spreadsheet or in a SQL database can be difficult. With these tools, you are aggregating data and parts of the user journey might be improperly reduced down to a single number when there is much more nuance to a user’s journey on a website.

Source: Wikipedia

Users are in different states when using a website or app. Sessionizing data has become the default way to capture the path a user takes but there are still many micro-sessions in just one experience like registering your account on a website.

Sean discusses this concept in the context of a rider taking or not taking a ride booked on Lyft. The customer requests the ride, and perhaps declines the first ride and books the second ride. The basic conversion rate would be 50%, but that statistic doesn’t answer why the customer didn’t book the first ride. Perhaps the customer couldn’t find the right address with the first ride, and just gave up. Perhaps the driver was too far away.

Balancing usability and expressivity in data tools

Browse any Hacker News article and you’ll inevitably see devs talking about why you should just build your own tool on-prem with code. The main reason is that you can fully customize the app if you know how to code. I’ve discussed at length on this podcast and through content I’ve created for my company how the need for low-code and no-code tools redefines who a “builder” is in a company.

Sean’s current company (Motif Analytics) is trying to strike that balance between giving data analysts and data scientists the ability to express their data question without diving right into the code. In terms of user journey data, Sean says most people use Amplitude, Mixpanel, or other similar tools. While these tools allow you to execute common data tasks, there are certain things these tools block you from doing. Python notebooks, for instance, are very expressive. But you kind of need to be an expert to use them to their full potential.

Source: Jupyter

Sean talks about how he drew inspiration from Ruby on Rails in terms of how the creators had strong opinions about how to do web development. I also first learned about web development through a Ruby on Rails book and it’s interesting to see how many of the patterns from Rails are still seen in frameworks using PHP or Javascript.

As we discussed the platform Sean and his team are building, we got into the weeds about a little-known SQL command called MATCH_RECOGNIZE(). There apparently isn’t much documentation about this function and the creators behind SQL rushed this pattern-matching function into the language because of competitors coming out with similar functionality. Nothing like real-world drama impacting the open source world!

Start with the questions instead of the tools

We ended the conversation with a bit of career talk. Sean talks about intrinsic motivation being the number one driving force in his career. While tools come and go, he said domain expertise is something that can give budding analysts a leg up when searching for their next role. Technical skills, unfortunately, are slowly becoming a commodity. What never goes out of style? Asking the right questions.

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!

The post Dear Analyst #119: Developing the holy “grail” model at Lyft, user journeys, and hidden analytics with Sean Taylor appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-119-developing-the-holy-grail-model-at-lyft-user-journeys-and-hidden-analytics-with-sean-taylor/feed/ 0 Future Dear Analyst episodes will get more sporadic since, well, life gets in the way. Unfortunately curiosity (in most cases) doesn't pay the bills. Nevertheless, when I come across an idea or person that I think is worth sharing/learning more about, Future Dear Analyst episodes will get more sporadic since, well, life gets in the way. Unfortunately curiosity (in most cases) doesn't pay the bills. Nevertheless, when I come across an idea or person that I think is worth sharing/learning more about, I'll try my best to post. In this episode, I interview the Chief Scientist of a data startup who did his PhD at Stern NYU and was on track go becoming a professor. Then he got an internship at Facebook and everything changed. The speed of learning at a tech company outpaced what the academic was used to at university. Over the years, Sean Taylor has worked with and spoken to hundreds of data analysts and statisticians. We'll dive into his data science work at Lyft, his notion of "hidden analytics," and why he's obsessed with user journeys in modern applications.







Modeling the Lyft marketplace and creating the GRAIL model



Sean worked at Facebook for 5 years as a research scientist and worked on general data problems. Eventually he joined the revenue operations science team at Lyft. His team's goal was to help grow the marketplace of riders and drives on the platform. One of the most important aspects of the marketplace is the forecast. As Lyft runs promotions and enters new cities, how do you ensure there are enough drivers for the riders and vice versa?







The team ultimately decided that a simple cohort methodology would be best to help set the forecast for both drivers and riders. Every rider, for instance, would belong to a cohort based on when they first signed up for Lyft, when they booked their first ride, etc. There's a "liquidation curve" for each cohort that eventually hugs the x-axis. There is much more detail about the cohort methodology in this blog post by the Lyft Engineering team from 2019.



Despite being such a simple model, the model worked surprisingly well. Goals of this model taken from the blog post mentioned in the previous paragraph:




* Forecast the behavior of each observed cohort and use it to project how many rides are taken or driver hours are provided within a specific cohort



* Forecast the behavior of the cohorts that are yet to be seen.



* Aggregate all the projected rides and driver hours to make forecasts for both the demand and supply side of our business.




Sean talked about how there were flaws in the model, and one of those flaws is that a marketplace is ver fluid and evolves over time. When a rider is exposed ot high prices, this may lead to churn and this was also not included in the model. Sean's team tried building a better model called GRAIL but Sean left Lyft before completing the model.



Source: Symposiums



Speaking of Lyft's data team, I had mentioned Amundsen, an open source data discovery platform Lyft released in 2019 (blog post). It's great to see the data team at Lyft giving back to the ecosystem to help data analysts and data scientists do their job better!



Discovering a bug that cost the company $15M per year



One of the best feelings as a data analyst is using data to uncover the root cause or underlying trends in a given business situation.]]>
Dear Analyst 119 119 full false 32:54 53852
Dear Analyst #118: Uncovering trends and insights behind Facebook News Feed, Reels, and Recommendations using data science with Akos Lada https://www.thekeycuts.com/dear-analyst-118-uncovering-trends-and-insights-behind-facebook-news-feed-reels-and-recommendations-using-data-science-with-akos-lada/ https://www.thekeycuts.com/dear-analyst-118-uncovering-trends-and-insights-behind-facebook-news-feed-reels-and-recommendations-using-data-science-with-akos-lada/#respond Mon, 03 Jul 2023 05:41:00 +0000 https://www.thekeycuts.com/?p=53604 No, this isn’t an episode about how Facebook’s algorithm and feed works. The data science function is popping up in companies small and large given the amount of data swimming around. No other company understand the power and influence that data science can have on the customer experience than Facebook (Meta, to be exact). Akos […]

The post Dear Analyst #118: Uncovering trends and insights behind Facebook News Feed, Reels, and Recommendations using data science with Akos Lada appeared first on .

]]>
No, this isn’t an episode about how Facebook’s algorithm and feed works. The data science function is popping up in companies small and large given the amount of data swimming around. No other company understand the power and influence that data science can have on the customer experience than Facebook (Meta, to be exact). Akos Lada is Facebook’s Director of Data Science for Feed Ranking and Recommendations. Akos has always been interested in the intersection of social science and data, so this role at Facebook seems fitting. In this episode, Akos discusses what the analytics team does at Facebook, an analytics framework his team developed and open-sourced, A/B testing, and more.

What does the data science team do at Facebook?

I know the company is called Meta, but I grew up calling it Facebook, so I’m just going to stick with Facebook for now. The data science team actually consists of two teams: Analytics and Core Applied Data Science.

The Analytics team partners with product managers and engineers and their focus is on delivering long-term value for users (you’ll hear a lot about this during this episode). There is also another data science team Akos used to work on, called Central Applied Science (formerly known as Core Data Science), which is a smaller team that focuses on scientific problems and research that every product team at Facebook might be able to benefit from. One of the frameworks the Central Applied science team created and open-sourced is called Ax. This framework helps optimize any kind of experiment including machine learning experiments, A/B tests, and simulations.

Making better decisions with the GTMF model

Akos’ team published a blog post on four analytics best practices at Facebook which is worth a read. The impetus for this blog post was one question: how does Facebook drive more long-term value for users?

There are many different lenses you can put on to answer this question. Of course, Akos’ team treats this question as a data science question. The Ground Truth Maturity Framework (GTMF) improves ground truth data–the data that powers Facebook’s machine learning models. In a sense, the GTMF model ensures your data is clean. One place where GTMF is used is News Feed ranking. The team’s ultimate goal with News Feed is trying to figure out if a post is something you would want to click on. You can read more about how machine learning is used in the News Feed algorithm here.

Running A/B experiments to figure out the right number of notifications to send to Facebook users

Akos discusses at length his team’s experimentation frameworks. One interesting insight is that the longer his team kept experiments running (say one year) then the outcome of the experiment would change. One of the more surprising results from a long-term experiment his team ran was that if you send less notifications to users, it led to better long-term value for users (e.g. clicking on more posts). In the short-term, sending less notifications would naturally lead to less people engaging with posts.

At the end of the day, this is a behavioral science challenge. Given the amount of data Akos’ team can analyze, they suggested that the product team drastically reduce the number of notifications being sent to Facebook users. You can read more about this experiment and the results here on the Facebook Analytics team’s blog.

While the data science team has so much data at their disposal to make data-driven decisions, Akos talks a bit about how the team also uses intuition for making decisions as well. In an organization as large as Facebook, you can run multiple experiments at a time, evaluate the results, and then ensure the knowledge and insights are spread between product teams. While the results from an experiment on News Feed may not necessarily apply to other product teams, other products at Facebook like Instagram and WhatsApp can benefit from the institutional knowledge.

What the future holds for data science at Facebook

There is a saying at Facebook that the work is only 1% done. Akos talks about how the data science field in general is a relatively new field that really began in the last decade. Compared to other fields like economics, data science is still in its infancy.

Akos’ team is investing more time in machine learning systems, neural networks, reinforcement learning, and all the new and sexy data science topics you’ve been reading about in the last few years. Akos’ interests in data science goes beyond Facebook as he’s published academic papers such as this one about heterogenous causal effects. Akos talks about his fascination with how activity can change when nodes are connected to each other (referring to Facebook’s social graph). If someone sees a post and they find it interesting, they will share that post with their friends. Then those friends share that same post with their friends. Given the connected nature of the social graph, how can Akos’ team help suggest posts that you might like? Facebook’s Recommendation system is built on this concept called collaborative filtering.

Advice for aspiring data scientists

It seems like a tradition now to ask people on the podcast about advice they have for upcoming data analysts, engineers, and scientists. Akos’ advice was a bit sobering but exactly what aspiring data scientists should keep in mind as they find their next role. It’s a tough time in the tech world, but don’t be discouraged. Akos believe that despite the downturn, data science will continue to grow as technology becomes ever more prevalent in our lives. Now is the time to double-down on building your skills. One of the reasons Facebook has their Analytics blog is to share their insights with the community in the hopes that data scientists can build off of Facebook’s work. Akos talks a bit about the generative AI trend, but he’s still focused on how regular “generic” AI can still help people around the world.

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!

The post Dear Analyst #118: Uncovering trends and insights behind Facebook News Feed, Reels, and Recommendations using data science with Akos Lada appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-118-uncovering-trends-and-insights-behind-facebook-news-feed-reels-and-recommendations-using-data-science-with-akos-lada/feed/ 0 No, this isn't an episode about how Facebook's algorithm and feed works. The data science function is popping up in companies small and large given the amount of data swimming around. No other company understand the power and influence that data scienc... No, this isn't an episode about how Facebook's algorithm and feed works. The data science function is popping up in companies small and large given the amount of data swimming around. No other company understand the power and influence that data science can have on the customer experience than Facebook (Meta, to be exact). Akos Lada is Facebook's Director of Data Science for Feed Ranking and Recommendations. Akos has always been interested in the intersection of social science and data, so this role at Facebook seems fitting. In this episode, Akos discusses what the analytics team does at Facebook, an analytics framework his team developed and open-sourced, A/B testing, and more.







What does the data science team do at Facebook?



I know the company is called Meta, but I grew up calling it Facebook, so I'm just going to stick with Facebook for now. The data science team actually consists of two teams: Analytics and Core Applied Data Science.



The Analytics team partners with product managers and engineers and their focus is on delivering long-term value for users (you'll hear a lot about this during this episode). There is also another data science team Akos used to work on, called Central Applied Science (formerly known as Core Data Science), which is a smaller team that focuses on scientific problems and research that every product team at Facebook might be able to benefit from. One of the frameworks the Central Applied science team created and open-sourced is called Ax. This framework helps optimize any kind of experiment including machine learning experiments, A/B tests, and simulations.







Making better decisions with the GTMF model



Akos' team published a blog post on four analytics best practices at Facebook which is worth a read. The impetus for this blog post was one question: how does Facebook drive more long-term value for users?



There are many different lenses you can put on to answer this question. Of course, Akos' team treats this question as a data science question. The Ground Truth Maturity Framework (GTMF) improves ground truth data--the data that powers Facebook's machine learning models. In a sense, the GTMF model ensures your data is clean. One place where GTMF is used is News Feed ranking. The team's ultimate goal with News Feed is trying to figure out if a post is something you would want to click on. You can read more about how machine learning is used in the News Feed algorithm here.







Running A/B experiments to figure out the right number of notifications to send to Facebook users



Akos discusses at length his team's experimentation frameworks. One interesting insight is that the longer his team kept experiments running (say one year) then the outcome of the experiment would change. One of the more surprising results from a long-term experiment his team ran was that if you send less notifications to users, it led to better long-term value for users (e.g. clicking on more posts). In the short-term, sending less notifications would naturally lead to less people engaging with posts.







At the end of the day, this is a behavioral science challenge. Given the amount of data Akos' team can analyze, they suggested that the product team drastically red...]]>
Dear Analyst 118 118 full false 30:24 53604
Dear Analyst #117: New 2023 Google Sheets functions for data manipulation that already exist in Excel https://www.thekeycuts.com/dear-analyst-117-new-2023-google-sheets-functions-for-data-manipulation-that-already-exist-in-excel/ https://www.thekeycuts.com/dear-analyst-117-new-2023-google-sheets-functions-for-data-manipulation-that-already-exist-in-excel/#respond Tue, 23 May 2023 05:09:00 +0000 https://www.thekeycuts.com/?p=53152 The Google Workspace team announced a slew of Google Sheets functions a few months ago (February 2023). These functions look familiar and that’s because Microsoft Excel released most of them two years ago. I never had a chance to play around with the new functions in Excel since I don’t have the latest Office 365 […]

The post Dear Analyst #117: New 2023 Google Sheets functions for data manipulation that already exist in Excel appeared first on .

]]>
The Google Workspace team announced a slew of Google Sheets functions a few months ago (February 2023). These functions look familiar and that’s because Microsoft Excel released most of them two years ago. I never had a chance to play around with the new functions in Excel since I don’t have the latest Office 365 version. Now that they are live in Google Sheets, I played around with them and find them pretty interesting for data manipulation purposes. I think what’s interesting about these new functions is that they help with both super basic data organization use cases but also more advanced data cleaning use cases too. Here’s a rundown of some of the new functions and more importantly, examples of real-life use cases. If you want a copy of the Google Sheet I use in this episode, go here.

Watch a tutorial showing all the new Google Sheets functions in 2023:

What’s interesting about these “new” Google Sheets functions?

Here’s a quick rant on these “new” Google Sheets functions. They aren’t new. They are basically a direct copy of what exists in Excel already (if you have Office 365). I think Google Sheets has some pretty awesome features that differentiate it from Excel (auto-fill, collaboration features, it’s free, etc.) But I’ve always viewed Google Sheets as a tool that is playing catchup to Excel. These functions are an example of Google playing catchup with Excel’s features versus coming up with something new.

These “new” functions in Google Sheets also highlight something Microsoft discovered a few years ago about how people are using spreadsheets: data is not organized in a structured way. You have time periods across the columns and the rows. You have headers and sub-headers. People don’t typically organize and clean their data for the purposes of a PivotTable but rather for ease of use. With this in mind, I think these new Google Sheets functions are targeted at the beginner spreadsheet user who may just be using Google Sheets to show who’s sitting at different tables at a banquet dinner or showing a shift schedule.

Next to each function, I also put a usefulness rating (🌶 being not useful and 🌶🌶🌶🌶🌶 being really useful) based on what I think would be useful for a beginner Google Sheets user.

1) EPOCHTODATE() – Turn computer-generated dates into a human-readable date format

USEFULNESS RATING: 🌶

This is a pretty basic one. You’ll typically get epoch dates when getting some output from a database or any type of computer-generated date/time. It’s usually a long string of numbers and EPOCHTODATE simply converts that “computer time” into a date and time that us humans can comprehend.

Gave this a rating of 1 because I don’t see many instances where you’ll have the epoch time format in your spreadsheet save the rare occasion you have a a Unix export of data that has these epoch times.

2) TOROW(), TOCOL() – Arrange a bunch of cells into a single row or column

USEFULNESS RATING: 🌶🌶🌶🌶🌶

Also a pretty simply formula that helps with basic data manipulation tasks. Big fan of this one because it removes the need to cut and paste ranges of data on top of each other. I think TOCOL() will be used more often just because you typically want to get a continuous list of values in one column. Here’s an example where you have a bunch of names arranged by groups (perhaps groups of students in a class) and you just want to get all the names in one column:

There are also some interesting options that let you remove errors and blanks as well as how the data should be “scanned” and put together. Someone just asked me how to do a data manipulation task similar to this and using TOCOL() with the scan_by_column flag set to false does the trick.

3) CHOOSEROWS(), CHOOSECOLS() – Choose which rows or columns you want from a data set

USEFULNESS RATING: 🌶🌶🌶🌶

I would put these new functions in the camp of “making it easier to filter out the data I don’t need.” I find this useful when you know when you want to quickly get the top 3 scores or maybe the top score and bottom score from a list of test scores, for instance. There are probably a bunch of other use cases I’m not able to think of, but in general it’s a really useful function to quickly “pull out” the rows or columns of data you need from a data set. CHOOSEROWS() in action:

While we’re at it, I’d say CHOOSECOLS() is equally as useful because you can just pull out the columns of data that matter for you. In this case, you can just pull out the list of students and just the scores from the subjects that matter for you. This feels like a more user-friendly version of the {} syntax for concatenating different ranges to create a custom range (typically used for creating a custom VLOOKUP formula with multiple conditions)

4) WRAPROWS(), WRAPCOLS() – Turn a bunch of cells into a specified number of rows or columns

USEFULNESS RATING: 🌶🌶

Kind of an interesting formula for a specific use case (I think). You put in a list of cells, and then the number of rows or columns you want to turn the list into. I don’t find these formulas that useful because your data has to be in really bad shape to warrant using these formulas. Then again, I may not be thinking of all the use cases where one would use these formulas.

For instance, you might have a list of employees with their location, job, etc. all listed out versus properly arranged in columns. This is where you would use the WRAPROWS() formula:

A more realistic use case is you have a list of names and you want to put them into 3 groups. You would use WRAPROWS() to quickly put this list of names into 3 columns:

In this case the number of names don’t fit perfectly into 3 columns so there are two N/As at the end. There’s this handy pad_width parameter which kind of acts like an IFERROR() function where you can just put in a placeholder value for those extra cells:

5) VSTACK(), HSTACK() – Stack rows from different sheets on top of each other

USEFULNESS RATING: 🌶🌶🌶

I think the reason why VSTACK() might be useful is when you have data coming in on multiple sheets. The data is also structured the same across those three sheets. Then you can have one primary sheet that aggregates all the data using VSTACK().

Not sure when you might use HSTACK() but the example Google shows is when you’re combining dates together. Kind of a weird scenario, but sure whatever.

In this Google Sheet, I have 3 sheets called shows1, shows2, and shows3. Each sheet has the same columns in the same order, but the data is different between the three:

Then with VSTACK(), you can “add” or concatenate all these data sources together on one page:

Again, this assumes your data is structured exactly the same across sheets or even on the same spreadsheet. If the data is, then using VSTACK() could be a nice way to put together these “disparate” data sources compared to using the bracket syntax {}. This feels like an alternative to CHOOSEROWS() where Google Sheets is just making it easier to use the {} syntax.

6) LET() – Assign the result of a formula to a variable to use in the future

USEFULNESS RATING: 🌶🌶

I have mixed feelings about the usefulness for this formula. It technically already exists using named ranges. But this is the formula version of named ranges. I also wouldn’t say it’s that much easier to understand compared to a named range hence the 2-pepper rating. It’s also not a “beginner” function.

Say you have a bunch of product ratings like in the table below. In the Average Score column, you want to put the word “High” if the average rating for a product is greater than 4. If the average rating is between 3-4, then you want the word “Medium.” 3 or below should say “Low”:

Today, you might write a simple formula like this to get this output of “High” and “Low”:

=if(average(B44:D44)>4,"High",if(average(B44:D44)>3,"Medium","Low"))

A typical nested IF statement. Now with the LET() function, you simple are assigning the average(B44:D44) “result” to a variable. The formula below would output the same exact thing as the nested IF statement above:

=LET(avg_rating, average(B44:D44), if(avg_rating>4,"High",if(avg_rating>3,"Medium","Low")))

Here’s a look at the formula in the context of the example:

The formula doesn’t look that much “easier” compared to writing out the nested IF statement. But for more complicated formulas beyond a regular average, this could make the formula much more readable and easier to debug.

One reason I like this function is that it starts to bridge the gap between working in a spreadsheet and using Google Apps Script (or Office Script if you’re in Excel). Starting to treat things like variables might make the learning curve to scripting in Google Apps Script easier and more approachable to a Google Sheets user who has never touched an Apps Script.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The post Dear Analyst #117: New 2023 Google Sheets functions for data manipulation that already exist in Excel appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-117-new-2023-google-sheets-functions-for-data-manipulation-that-already-exist-in-excel/feed/ 0 The Google Workspace team announced a slew of Google Sheets functions a few months ago (February 2023). These functions look familiar and that's because Microsoft Excel released most of them two years ago. I never had a chance to play around with the n... The Google Workspace team announced a slew of Google Sheets functions a few months ago (February 2023). These functions look familiar and that's because Microsoft Excel released most of them two years ago. I never had a chance to play around with the new functions in Excel since I don't have the latest Office 365 version. Now that they are live in Google Sheets, I played around with them and find them pretty interesting for data manipulation purposes. I think what's interesting about these new functions is that they help with both super basic data organization use cases but also more advanced data cleaning use cases too. Here's a rundown of some of the new functions and more importantly, examples of real-life use cases. If you want a copy of the Google Sheet I use in this episode, go here.



Watch a tutorial showing all the new Google Sheets functions in 2023:




https://www.youtube.com/watch?v=YQ8BG5frI3E




What's interesting about these "new" Google Sheets functions?



Here's a quick rant on these "new" Google Sheets functions. They aren't new. They are basically a direct copy of what exists in Excel already (if you have Office 365). I think Google Sheets has some pretty awesome features that differentiate it from Excel (auto-fill, collaboration features, it's free, etc.) But I've always viewed Google Sheets as a tool that is playing catchup to Excel. These functions are an example of Google playing catchup with Excel's features versus coming up with something new.







These "new" functions in Google Sheets also highlight something Microsoft discovered a few years ago about how people are using spreadsheets: data is not organized in a structured way. You have time periods across the columns and the rows. You have headers and sub-headers. People don't typically organize and clean their data for the purposes of a PivotTable but rather for ease of use. With this in mind, I think these new Google Sheets functions are targeted at the beginner spreadsheet user who may just be using Google Sheets to show who's sitting at different tables at a banquet dinner or showing a shift schedule.



Next to each function, I also put a usefulness rating ( being not useful and being really useful) based on what I think would be useful for a beginner Google Sheets user.



1) EPOCHTODATE() - Turn computer-generated dates into a human-readable date format



USEFULNESS RATING:



This is a pretty basic one. You'll typically get epoch dates when getting some output from a database or any type of computer-generated date/time. It's usually a long string of numbers and EPOCHTODATE simply converts that "computer time" into a date and time that us humans can comprehend.



Gave this a rating of 1 because I don't see many instances where you'll have the epoch time format in your spreadsheet save the rare occasion you have a a Unix export of data that has these epoch times.







2) TOROW(), TOCOL() - Arrange a bunch of cells into a single row or column



USEFULNESS RATING:



Also a pretty simply formula that helps with basic data manipulation tasks. Big fan of this one because it removes the need to cut and paste ranges of data on top of each other.]]>
Dear Analyst full false 33:05 53152
Dear Analyst Episode #116: Will Microsoft’s AI Copilot for Excel replace the need for analysts? https://www.thekeycuts.com/dear-analyst-episodes-116-will-microsofts-ai-copilot-for-excel-replace-the-need-for-analysts/ https://www.thekeycuts.com/dear-analyst-episodes-116-will-microsofts-ai-copilot-for-excel-replace-the-need-for-analysts/#respond Mon, 27 Mar 2023 15:48:55 +0000 https://www.thekeycuts.com/?p=53277 This news is a bit old but I figured it’s juicy enough to talk about its future implications on Excel and artificial intelligence in general. Mid-March 2023, Microsoft announced Copilot, it’s artificial intelligence bet that will supposedly change the way we work. The video discusses how Copilot integrates with Office 365 and all your Microsoft […]

The post Dear Analyst Episode #116: Will Microsoft’s AI Copilot for Excel replace the need for analysts? appeared first on .

]]>
This news is a bit old but I figured it’s juicy enough to talk about its future implications on Excel and artificial intelligence in general. Mid-March 2023, Microsoft announced Copilot, it’s artificial intelligence bet that will supposedly change the way we work. The video discusses how Copilot integrates with Office 365 and all your Microsoft apps including Excel. Around minute 18:00, they show a demo of how Copilot helps you find trends, make adjustments to your models, and more. It’s quite impressive. You can watch just that segment from the presentation below. I watched the video a few times and wondered: will Copilot eliminate the need for entry-level data analysts? Only time will tell.

Breaking down the features in Copilot for Excel

This is the corporate marketing blurb from the Microsoft blog post announcing Copilot for Excel:

Copilot in Excel works alongside you to help analyze and explore your data. Ask Copilot questions about your data set in natural language, not just formulas. It will reveal correlations, propose what-if scenarios, and suggest new formulas based on your questions—generating models based on your questions that help you explore your data without modifying it. Identify trends, create powerful visualizations, or ask for recommendations to drive different outcomes. Here are some example commands and prompts you can try:

  • Give a breakdown of the sales by type and channel. Insert a table.
  • Project the impact of [a variable change] and generate a chart to help visualize.
  • Model how a change to the growth rate for [variable] would impact my gross margin.

The video shows the above 3 bullet points using a dataset of product sales by country:

Finding key trends with Copilot for Excel

The first demo involves giving Copilot a prompt like “analyze the data and give me 3 trends.” The output is something you might expect if you’ve done anything with ChatGPT:

This feature in Copilot is table stakes and a version of this came out in Google Sheets in 2017. The Explore panel in Google Sheets can provide similar summary trends on your data and suggest charts you should add to your analysis. Google Sheets has slowly been adding AI-like features over the last few years, so don’t sleep on Google Workspace’s own AI announcement. Below is a dataset of hotels and their locations and I simply clicked on the Explore option in the bottom-right of the Google Sheet:

The trends don’t come in a free-form text format but the different widgets are interesting. The first widget shows additional questions you might ask of your dataset (and Google Sheets spits out the answer). Then the most common visualizations like Pivot Tables and charts are displayed afterwards which makes it easy to analyze and visualize your data. This leads into the next feature in Copilot for Excel: visualizing your data.

Visualizing your data with Copilot for Excel

What’s old is new. As I explained in the previous section, Google Sheets’ Explore panel already has a flavor of this feature. The next prompt for Copilot is “Show me a breakdown of Proseware sales growth.” Yes, it’s natural language. Yes, humans are lazy and it’s easy just to ask a question in plain English and get an answer back. But the summary and data and charts already exist in Google Sheets. This just happens to be Excel’s implementation of the Explore feature and the AI is the entry point to this feature:

I like Copilot responds to the prompt by saying:

Remember to check for accuracy.

That doesn’t inspire much confidence in you, Copilot! Nonetheless, Copilot does a few things that are interesting:

  • Created a chart with a title and the title has selective formatting (assuming the AI made the “Sales” word foreground color green)
  • The tables are nicely formatted with clear headers, formatted percentages, and growth rates
  • The background colors for all the cells are white (common formatting trick for making your visualizations stand out more)
  • Columns are re-sized to fit the width of the products and the growth rates
  • Column A and Row 1 are very narrow in width and height, respectively (another common trick to making dashboards look cleaner)

Was this all AI or just smoke and mirrors?

It’s hard to say which of the above formatting operations were done by the AI versus a human who just cleaned up the spreadsheet for a demo.

Does the AI know that a summary table looks better when the background color cells are all white?

Does the AI know that analysts like to make column A and row 1 super narrow/short so that the charts and tables are flush against the edges of the spreadsheet?

If Copilot knew all this, that’s pretty slick. But this just so happens to be the vanilla formatting you’ll see in a dashboard devoid of any custom coloring or branding. It will be interesting to see how an analyst would train Copilot to create visualizations that match the theme and brand guidelines for existing reports.

The next prompt is “Help me visualize what contributed to the decline in sales growth?” The interesting leap that Copilot makes here is translating a very simply business question into a feature (conditional formatting to highlight what contributed to the decline):

But simply applying conditional formatting to a table of numbers is not nearly as impressive as all the formatting steps the AI did in the previous step to create the table in the first place.

What-if scenario analysis with Copilot for Excel

This is probably the most interesting part of the demo. The next prompt is:

What would have happened if Reusable Containers had maintained the prior quarter’s growth rate?

Before Copilot, you’d have to start thinking about duplicating your summary table and start setting up cell references to replace the current growth rate with another number. Assuming this is not some human playing around with data for the demo, Copilot does the whole thing for you:

What’s impressive is that Copilot was able to copy the original summary table and paste it directly to the right of it. This makes comparing the growth rates easy. It was also able to change the title to reflect the answer to the original prompt. Finally, the step-by-step bullet points tell you exactly what Copilot did to create the analysis.

Perhaps this type of analysis is “easy” for Copilot since you have a relatively simple summary table with clearly spelled out products and growth rates. What if there are more variables involved or there are other one-off factors that would impact the analysis? According to the longer Copilot demo, Copilot has access to the full corpus of data for your organization so it should have the domain expertise that someone who works in the business knows. This means you could ask Copilot questions whose answers are tucked away in some Outlook email, Teams thread, or PowerPoint slide. That’s pretty freaking cool.

The question still remains: Will Copilot replace the need for data analysts?

Source: The Wall Street Journal

If the analysis is as simple as what Microsoft showed in this demo, I think the answer is yes.

If you’re an entry-level analyst, this type of task is not very uncommon. You have dataset where you need to build summary tables and put them into PowerPoint decks to present during meetings. Your manager tells you: “Hey, what would growth look like for Reusable Containers if we didn’t completely tank last quarter and used historical growth rates?” You would probably follow a similar step-by-step process as the above screenshot shows. Copilot appears to be able to do the basic analyst grunt work and format the analysis in a clear visualization.

Why Copilot won’t replace analysts at large enterprises

While Copilot does look impressive, it definitely won’t replace human data analysts who understand nuance, context, and business knowledge at large enterprises. If you are a startup and building a model from scratch, Copilot might be a good solution to get something off the ground and running. The Microsoft demo clearly shows that this is possible. I can foresee a few situations where Copilot would not be used in a large enterprise:

  1. A lot of money is on the line – The Copilot prompt already tells you to “check for accuracy.” If you are working on a multi-million dollar deal, you best be sure you have a human taking a look at the numbers.
  2. Company culture may not be captured in Microsoft applications – As much as our knowledge is “written down” in Word, Outlook, and Teams, there is a lot that is not formally written down in these applications. Humans understand the nuances about company culture and how that can impact the analyses and dashboards analysts create.
  3. Existing templates have already been created – In a large enterprise, you are most likely copying an existing file to build a model or dashboard. That institutional knowledge has resulted in well-formatted dashboards where Copilot may not add much value (if formatting is a big part of the task).

Long story short, I’d love to see Copilot tackle a more complicated task that can’t be solved with a simple template. If you’re well versed in Excel, doing what this demo did by “hand” might take all of 15 minutes and you build the knowledge on how to do this analysis in the future. This knowledge makes debugging and troubleshooting models easier.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The post Dear Analyst Episode #116: Will Microsoft’s AI Copilot for Excel replace the need for analysts? appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-episodes-116-will-microsofts-ai-copilot-for-excel-replace-the-need-for-analysts/feed/ 0 This news is a bit old but I figured it's juicy enough to talk about its future implications on Excel and artificial intelligence in general. Mid-March 2023, Microsoft announced Copilot, it's artificial intelligence bet that will supposedly change the ... This news is a bit old but I figured it's juicy enough to talk about its future implications on Excel and artificial intelligence in general. Mid-March 2023, Microsoft announced Copilot, it's artificial intelligence bet that will supposedly change the way we work. The video discusses how Copilot integrates with Office 365 and all your Microsoft apps including Excel. Around minute 18:00, they show a demo of how Copilot helps you find trends, make adjustments to your models, and more. It's quite impressive. You can watch just that segment from the presentation below. I watched the video a few times and wondered: will Copilot eliminate the need for entry-level data analysts? Only time will tell.




https://www.youtube.com/watch?v=I-waFp6rLc0




Breaking down the features in Copilot for Excel



This is the corporate marketing blurb from the Microsoft blog post announcing Copilot for Excel:




Copilot in Excel works alongside you to help analyze and explore your data. Ask Copilot questions about your data set in natural language, not just formulas. It will reveal correlations, propose what-if scenarios, and suggest new formulas based on your questions—generating models based on your questions that help you explore your data without modifying it. Identify trends, create powerful visualizations, or ask for recommendations to drive different outcomes. Here are some example commands and prompts you can try:




* Give a breakdown of the sales by type and channel. Insert a table.



* Project the impact of [a variable change] and generate a chart to help visualize.



* Model how a change to the growth rate for [variable] would impact my gross margin.





The video shows the above 3 bullet points using a dataset of product sales by country:







Finding key trends with Copilot for Excel



The first demo involves giving Copilot a prompt like "analyze the data and give me 3 trends." The output is something you might expect if you've done anything with ChatGPT:







This feature in Copilot is table stakes and a version of this came out in Google Sheets in 2017. The Explore panel in Google Sheets can provide similar summary trends on your data and suggest charts you should add to your analysis. Google Sheets has slowly been adding AI-like features over the last few years, so don't sleep on Google Workspace's own AI announcement. Below is a dataset of hotels and their locations and I simply clicked on the Explore option in the bottom-right of the Google Sheet:







The trends don't come in a free-form text format but the different widgets are interesting. The first widget shows additional questions you might ask of your dataset (and Google Sheets spits out the answer). Then the most common visualizations like Pivot Tables and charts are displayed afterwards which makes it easy to analyze and visualize your data. This leads into the next feature in Copilot for Excel: visualizing your data.



Visualizing your data with Copilot for Excel



What's old is new. As I explained in the previous section, Google Sheets' Explore panel already has a flavor of this feature. The next prompt for Copilot is "Show me a breakdown of ...]]>
Dear Analyst 116 116 full false 23:42 53277
Dear Analyst #115: How to count the number of colored cells or formatted cells in Google Sheets https://www.thekeycuts.com/dear-analyst-115-how-to-count-the-number-of-colored-cells-or-formatted-cells-in-google-sheets/ https://www.thekeycuts.com/dear-analyst-115-how-to-count-the-number-of-colored-cells-or-formatted-cells-in-google-sheets/#respond Mon, 27 Feb 2023 17:59:47 +0000 https://www.thekeycuts.com/?p=53182 Counting the number of colored cells or formatted cells in Google Sheets or Excel seems like it should be a basic operation. Unfortunately after much Googling, it doesn’t seem as easy as it looks. I came across this Mr. Excel forum thread where someone asks how to count the number of rows where there is […]

The post Dear Analyst #115: How to count the number of colored cells or formatted cells in Google Sheets appeared first on .

]]>
Counting the number of colored cells or formatted cells in Google Sheets or Excel seems like it should be a basic operation. Unfortunately after much Googling, it doesn’t seem as easy as it looks. I came across this Mr. Excel forum thread where someone asks how to count the number of rows where there is a colored cell. The answers range from VBA to writing formulas that indicate whether a cell should be colored to the usual online snark. I think the basic issue is this. A majority of Excel or or Google Sheets users will have a list of data and they will color-code cells to make it easier to read or comprehend the data. No fancy formulas or PivotTables. Just coloring and formatting cells so that important ones stick out. I thought this would be a simple exercise but after reading the thread, I came up with two solutions that work but have drawbacks. The Google Sheet for this episode is here.

Video walkthrough:

Color coding HR data

In the Mr. Excel thread, the original poster talks about their HR data set and the rules their team uses to color-code their data set. Many people in the thread talk about setting up rules for conditional formatting (which I agree with). But it sounds like people just look through the data set and manually color code the cells based on the “Color Key” mentioned in the post:

I think this manual color coding of cells is very common. Yes, someone could write conditional formatting logic to automate the formatting and color coding of these cells. But for most people, I’d argue just eyeballing the dataset and quickly switching the background or foreground color of the cell is easier, faster, and more understandable for a beginner spreadsheet user. If there isn’t that much data, then manually color coding cells feels less onerous.

I put a subset of the data into this Google Sheet and manually color-coded some of the cells into column B below:

Method #1 for counting colored cells: Filter by color and the SUBTOTAL formula

The quickest way to count the number of cells that have a certain color format is to filter the column by color. After applying the filter to all the column headers, you can filter a column by the cell’s background color through the column header menu. Filter by color -> Fill color -> Desired color:

Let’s say I filter this column by the yellow background color. You’ll see this results in a filtered data set with 9 rows remaining:

In order to actually count the number of cells in this filtered data set, you might be tempted to do a COUNTA() formula, but let’s see what happens when I put this into cell B51:

The formula counts all the rows in the data set including the rows that have been filtered out. Instead, you can use the SUBTOTAL() formula which magically returns the sum, count, etc. for a filtered data set. The key is to use the value “3” for the first parameter to tell Google Sheets to count only the cells in the filtered data set:

I don’t think this is the usual use case for the SUBTOTAL formula. But like many formulas in Google Sheets/Excel, it works! To recap on this method:

Pros

  • Easy to use and implement
  • Doesn’t require the use of VBA or Google Apps Script
  • Since it’s a formula, it’s dynamic and can change as your data changes (with caveats)

Cons

  • Requires a few steps to get it to work (e.g. filter your data set by a color)
  • Each time you want to count the number of formatted cells, you need to re-filter by a different color
  • Since your data is filtered, you can’t easily update the source data and requires you to re-filter by a color

Method #2: Filtered views to allow for dynamic updating of data with the SUBTOTAL formula

This is an extension of method #1. One of the cons of method #1 is that once you’ve filtered your data set, you need to un-filter the data set if you want to add or remove formatting from your cells. For instance, in column B we have a bunch of yellow colored cells. If you want to highlight another cell as yellow and then re-count the number of cells that are colored yellow, you have to un-filter the data set, highlight the cell that needs to be colored yellow, re-filter the column, and re-write the SUBTOTAL formula (assuming you put it at the bottom of column B):

To avoid filtering and un-filtering the data set, you can create a filtered view of the data set. Additionally, you can put the SUBTOTAL formula somewhere that’s not at the bottom of the data set. Let’s first create a a filtered view just on the background color yellow and we’ll call it “Yellow Cells”:

Now you can quickly switch between the filtered view of yellow-colored cells and the unfiltered data set:

Then we can put the SUBTOTAL formula somewhere below the bottom of the data set. Notice now how when we switch between the filtered view and the unfiltered data set, the SUBTOTAL formula automatically updates:

While this method is an improvement on method #1, it still has some drawbacks. A recap of this method:

Pros

  • Easily switch between the filtered and unfiltered data set
  • Update cells with new colors and have that flow into the SUBTOTAL formula dynamically

Cons

  • Filtered views are not an easily discoverable feature in Google Sheets
  • Still requires you to go through the Data menu and flip back and forth when you want to count the number of colored cells

Method #3: A macro to count the number of colored or formatted cells in a range

Almost all the other solutions for counting the number of colored or formatted cells on the Internet refer to a VBA script for Excel. This is a macro for Google Sheets using Google Apps Script. You can copy and paste the script from this gist. When you run the CountFormattedCells macro in Google Sheets, it counts all the cells that have a background color in column B below. It then outputs the count of cells in cell 52 after you’ve selected a range of cells where you want to count the colored cells:

If you want to specify a color to count, you can color cell C53 with color you want to count. Let’s say I want to count only the green cells. I would color cell C53 with green, select all the cells where I want to find the color green, and then run the macro:

The key to making this work is setting some variables up in the script. The two variables you have to set in the script are outputNumberOfFormattedCells and cellWithFormatToCount. The cells you pick will depend on the specific spreadsheets you’re working with. In the script below, you’ll see that you have to edit the first two variables fit the needs of your Google Sheet:


function CountFormattedCells() {
  
  // Output the number of formatted cells somewhere in your spreadsheet
  var outputNumberOfFormattedCells = 'C52'

  // Cell that contains the color you want to count. Default is blank.
  var cellWithFormatToCount = 'C53'

  var spreadsheet = SpreadsheetApp.getActive();
  var currentRangeColors = spreadsheet.getActiveRange().getBackgrounds();
  if (cellWithFormatToCount !== '') { var cellWithFormat = spreadsheet.getRange(cellWithFormatToCount).getBackground(); }
  var formattedCellCount = 0
  for (var i in currentRangeColors) {
    for (var j in currentRangeColors[i]) {
      if (currentRangeColors[i][j] !== '#ffffff' && cellWithFormatToCount == '') {
        formattedCellCount++
      } else if (cellWithFormatToCount !== '' && currentRangeColors[i][j] == cellWithFormat) {
        formattedCellCount++
      }
    }
  }
  if (outputNumberOfFormattedCells != '') {
    spreadsheet.getRange(outputNumberOfFormattedCells).setValue(formattedCellCount)
  }
};

The macro is very easy to use but it does require you knowing how to add macros to your Google Sheet and editing the script in Google Apps Script. The recap for this method:

Pros

  • Script is easy to copy and paste into Google Apps Script and works right out of the box
  • Just two variables to customize
  • Doesn’t require any filtering of your data set or any formulas
  • Can assign a keyboard shortcut to the macro to quickly run the macro
  • Could assign a time-based trigger to the macro so that it runs every minute or hour to give you a “dynamic” count

Cons

  • Requires knowledge of macros and editing a Google Apps Script
  • May need to change the location of the cell where you output the count of colored cells if your data changes a lot over time
  • Requires running the macro each time you want to get an updated count of the colored cells

Bottom line

None of these methods are that simple or easy to use in my opinion. Usually I have a preferred method for solving some Google Sheets or Excel problem, but in this case I can’t say I like or dislike a method over another one. If I had to pick one, I’d use method #3 since I’m comfortable with macros and editing Google Apps Scripts. But the Google Apps Script solution is far from easy to use for a beginner to Google Sheets.

The SUBTOTAL formula is indeed much easier to implement, but also comes with the added inconvenience of constantly filtering and unfiltering your data set.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The post Dear Analyst #115: How to count the number of colored cells or formatted cells in Google Sheets appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-115-how-to-count-the-number-of-colored-cells-or-formatted-cells-in-google-sheets/feed/ 0 Counting the number of colored cells or formatted cells in Google Sheets or Excel seems like it should be a basic operation. Unfortunately after much Googling, it doesn't seem as easy as it looks. I came across this Mr. Counting the number of colored cells or formatted cells in Google Sheets or Excel seems like it should be a basic operation. Unfortunately after much Googling, it doesn't seem as easy as it looks. I came across this Mr. Excel forum thread where someone asks how to count the number of rows where there is a colored cell. The answers range from VBA to writing formulas that indicate whether a cell should be colored to the usual online snark. I think the basic issue is this. A majority of Excel or or Google Sheets users will have a list of data and they will color-code cells to make it easier to read or comprehend the data. No fancy formulas or PivotTables. Just coloring and formatting cells so that important ones stick out. I thought this would be a simple exercise but after reading the thread, I came up with two solutions that work but have drawbacks. The Google Sheet for this episode is here.







Video walkthrough:




https://www.youtube.com/watch?v=h-hdZPGDbDg




Color coding HR data



In the Mr. Excel thread, the original poster talks about their HR data set and the rules their team uses to color-code their data set. Many people in the thread talk about setting up rules for conditional formatting (which I agree with). But it sounds like people just look through the data set and manually color code the cells based on the "Color Key" mentioned in the post:







I think this manual color coding of cells is very common. Yes, someone could write conditional formatting logic to automate the formatting and color coding of these cells. But for most people, I'd argue just eyeballing the dataset and quickly switching the background or foreground color of the cell is easier, faster, and more understandable for a beginner spreadsheet user. If there isn't that much data, then manually color coding cells feels less onerous.



I put a subset of the data into this Google Sheet and manually color-coded some of the cells into column B below:







Method #1 for counting colored cells: Filter by color and the SUBTOTAL formula



The quickest way to count the number of cells that have a certain color format is to filter the column by color. After applying the filter to all the column headers, you can filter a column by the cell's background color through the column header menu. Filter by color -> Fill color -> Desired color:







Let's say I filter this column by the yellow background color. You'll see this results in a filtered data set with 9 rows remaining:







In order to actually count the number of cells in this filtered data set, you might be tempted to do a COUNTA() formula, but let's see what happens when I put this into cell B51:







The formula counts all the rows in the data set including the rows that have been filtered out. Instead, you can use the SUBTOTAL() formula which magically returns the sum, count, etc. for a filtered data set. The key is to use the value "3" for the first parameter to tell Google Sheets to count only the cells in the filtered data set:
]]>
KeyCuts full false 34:26 53182
Dear Analyst Episode #114: How a small real estate investment company uses modern data and cloud tools to make data-driven decisions https://www.thekeycuts.com/episode-114-how-a-small-real-estate-investment-company-uses-modern-data-and-cloud-tools-to-make-data-driven-decisions/ https://www.thekeycuts.com/episode-114-how-a-small-real-estate-investment-company-uses-modern-data-and-cloud-tools-to-make-data-driven-decisions/#comments Tue, 17 Jan 2023 06:58:00 +0000 https://www.thekeycuts.com/?p=52480 When you think of data pipelines, data warehouses, and ETL tools, you may be thinking about some large enterprise that is collecting and processing data from IoT devices or from a mobile app. These companies are using tools from AWS and Google Cloud to build these complex workflows to get data to where it needs […]

The post Dear Analyst Episode #114: How a small real estate investment company uses modern data and cloud tools to make data-driven decisions appeared first on .

]]>
When you think of data pipelines, data warehouses, and ETL tools, you may be thinking about some large enterprise that is collecting and processing data from IoT devices or from a mobile app. These companies are using tools from AWS and Google Cloud to build these complex workflows to get data to where it needs to be. In this episode, you’ll hear about a relatively small company who is using modern cloud and data tools rivaling these aforementioned enterprises. Elite Development Group is a real estate investment and construction company based in York, Pennsylvania and is less than 50 employees. Doug Walters is the Director of Strategy and Technology and Elite and he discusses how data at Elite was trapped in Quickbooks and in their various tools like property management software. He spearheaded projects to build data connectors to aggregate various data sources to help build a modern data stack to help make real estate decisions.

Data is stuck in silos

Elite Development Group consists of a few divisions: HVAC, home performance, energy efficiency, etc. All the typical functions you’d expect a real estate company to have. Doug first started working in IT support and realized their company didn’t have easy access to their data to make data-driven decisions. You’ve probably heard this phrase over and over again:

Data is trapped in silos.

You buy some off-the-shelf software (in this case property management) that is meant for one specific use case. Over time, that data needs to be merged with your customers data or sales data. You end up exporting the data in these silos to CSVs to further combine these data sources down the line. For Elite, data was trapped in property management software, Quickbooks, you name it.

Starting the process to export data

After doing a survey of their tools, Doug realized that there weren’t many APIs to easily extract data from the source. So he helped set up data scrapers to get data off of the HTML pages. He also used tools like Docparser to extract data from Word docs and PDFs.

Most data was either in XLS or CSV format, so Doug was able to set up an automated system where every night he’d get an email with a CSV dump from their property management system. This data then ended up in a Google Sheet for everyone to see and collaborate on. After doing this with property management, Doug started exploring getting the data out from their work order tracking system.

Creating accurate construction cost estimates

One activity Doug wanted to shine the data lens on was cost estimates as they relate to construction. Hitting budgets is a big part of the construction process. You have multiple expenditures from a job and each job needs to have a specific estimate tied to it. This could all be done in Excel or Google Sheets, but given the importance of this data, Doug decided to create something more durable. He created an internal database where each cost estimate and a specific Estimate ID. A unique identifier to give to a cost estimate.

Since Elite uses Quickbooks for their accounting, each project had to be tied to a unique Estimate ID established previously. Then each work order had a unique Work Order ID. Now Elite is able to run reports on all their projects to see what the cost estimates and actual expenditures were for a job. Now they could do a traditional budget to actual variance analysis.

The result? Project teams could start to see when they were about to hit their budgets in real time.

More importantly, this started Doug down a journey of seeing how far he could automate the data extraction and reporting for his company. With the current implementation, the data could only get refreshed every 24 hours. He eventually set up the system so that any user could click a button to refresh a report. The data workflow started from exporting data into Excel and Google Sheets and into complex data connectors and using software for business intelligence.

Income lost due to vacancy metric

When Elite prioritizes which projects to work on, they look at a metric called “income lost due to vacancy.” Without the different data connectors and systems Doug help set up, this metric wouldn’t exist. This metric essentially helps a property owner figure out how much income they are losing due to vacancies.

When looking at a portfolio of properties to improve, Elite can use this metric to figure out which project would have more high-rent units available. Previously, they would have to rely on intuition to figure out where to invest more time and money into projects.

Building out the data stack

The list of tools Elite uses to extract and process data rivals that of large enterprises. Here is a rundown of Elite’s data stack:

  • Fivetran for data loading and extraction
  • AWS Redshift as the data warehouse
  • Google Cloud functions to run one-off tasks
  • dbt for transformation and for pushing data into a datamart
  • Sisense to create actionable insights

There are multiple data connectors involved for doing the ETL process as well. With all these modern tools, Elite is able to get the most up-to-date data every 5-15 minutes.

As Elite went through this data journey, Doug and his team started to ask some of their vendors to develop an API so they could get more data out. Their data vendors would push back and say they’ve never seen these requests from such a small company. Typically these data requests are coming from their large customers which shows how deeply Doug’s team has thought about automating their data workflows.

Advice for small companies working with big data

Doug gives some practical advice on how to use some of these tools that are supposedly meant for large enterprises. The first thing is to experiment with spreadsheets before diving deep into a complicated workflow. Doing your due diligence in a spreadsheet is low stakes and helps you uncover all the various relationships between your data.

In terms of learning how to use these tools, Doug mentioned that most of these vendors have their own free or paid workshops and tutorials. I’m always surprised by how much general data training these vendors provide that many not even be about their software. You can learn about databases, SQL, and data analysis from these vendors.

At a high level, Doug says that the data you collect and visualize needs to be tied to some business strategy. These overall goals might include increasing revenue, increasing customers satisfaction, or ensuring your employees are developing new skills. At Elite, the data has allowed the team to look at their portfolio of real estate at the 30,000-foot level all the way down to individual transactions. Data is actually helping them solve real business problems.

And one last plug for Google Sheets: Doug talked about how you would have to hire someone who was an “Excel guru” or a data analyst to help you decipher your Google Sheets files. Now Google Sheets has become so robust, extensible, and–dare I say–easy to use that anyone in the company can pick it up and mold it to their needs. No one ever gets fired for using a Google Sheet 😉.

Other Podcasts & Blog Posts

No other podcasts mentioned in this episode!

The post Dear Analyst Episode #114: How a small real estate investment company uses modern data and cloud tools to make data-driven decisions appeared first on .

]]>
https://www.thekeycuts.com/episode-114-how-a-small-real-estate-investment-company-uses-modern-data-and-cloud-tools-to-make-data-driven-decisions/feed/ 1 When you think of data pipelines, data warehouses, and ETL tools, you may be thinking about some large enterprise that is collecting and processing data from IoT devices or from a mobile app. These companies are using tools from AWS and Google Cloud to... When you think of data pipelines, data warehouses, and ETL tools, you may be thinking about some large enterprise that is collecting and processing data from IoT devices or from a mobile app. These companies are using tools from AWS and Google Cloud to build these complex workflows to get data to where it needs to be. In this episode, you'll hear about a relatively small company who is using modern cloud and data tools rivaling these aforementioned enterprises. Elite Development Group is a real estate investment and construction company based in York, Pennsylvania and is less than 50 employees. Doug Walters is the Director of Strategy and Technology and Elite and he discusses how data at Elite was trapped in Quickbooks and in their various tools like property management software. He spearheaded projects to build data connectors to aggregate various data sources to help build a modern data stack to help make real estate decisions.







Data is stuck in silos



Elite Development Group consists of a few divisions: HVAC, home performance, energy efficiency, etc. All the typical functions you'd expect a real estate company to have. Doug first started working in IT support and realized their company didn't have easy access to their data to make data-driven decisions. You've probably heard this phrase over and over again:




Data is trapped in silos.




You buy some off-the-shelf software (in this case property management) that is meant for one specific use case. Over time, that data needs to be merged with your customers data or sales data. You end up exporting the data in these silos to CSVs to further combine these data sources down the line. For Elite, data was trapped in property management software, Quickbooks, you name it.







Starting the process to export data



After doing a survey of their tools, Doug realized that there weren't many APIs to easily extract data from the source. So he helped set up data scrapers to get data off of the HTML pages. He also used tools like Docparser to extract data from Word docs and PDFs.



Most data was either in XLS or CSV format, so Doug was able to set up an automated system where every night he'd get an email with a CSV dump from their property management system. This data then ended up in a Google Sheet for everyone to see and collaborate on. After doing this with property management, Doug started exploring getting the data out from their work order tracking system.



Creating accurate construction cost estimates



One activity Doug wanted to shine the data lens on was cost estimates as they relate to construction. Hitting budgets is a big part of the construction process. You have multiple expenditures from a job and each job needs to have a specific estimate tied to it. This could all be done in Excel or Google Sheets, but given the importance of this data, Doug decided to create something more durable. He created an internal database where each cost estimate and a specific Estimate ID. A unique identifier to give to a cost estimate.



Since Elite uses Quickbooks for their accounting, each project had to be tied to a unique Estimate ID established previously. Then each work order had a unique Work Order ID. Now Elite is able to run reports on all their projects to see what the cost estimates and actual expenditures were for a job. Now they could do a traditional budget to actual variance analysis.







The result? Project teams could start to see when they were about to hit their budgets in real time.



]]>
Dear Analyst 114 114 full false 32:57 52480
Dear Analyst #113: Top 5 data analytics predictions for 2023 https://www.thekeycuts.com/dear-analyst-113-top-5-data-analytics-trends-for-2023/ https://www.thekeycuts.com/dear-analyst-113-top-5-data-analytics-trends-for-2023/#respond Tue, 27 Dec 2022 06:16:00 +0000 https://www.thekeycuts.com/?p=52829 It’s that time of the year again where data professionals look at their data predictions from 2022 and decide what they were wrong about and think: “this must be the year for XYZ.” Aside from the fact that these type of predictions are 100% subjective and nearly impossible to verify, it’s always fun to play […]

The post Dear Analyst #113: Top 5 data analytics predictions for 2023 appeared first on .

]]>
It’s that time of the year again where data professionals look at their data predictions from 2022 and decide what they were wrong about and think: “this must be the year for XYZ.” Aside from the fact that these type of predictions are 100% subjective and nearly impossible to verify, it’s always fun to play armchair quarterback and make a forecast about the future (see why forecasts are flawed in this episode about Superforecasting). The reason why predicting what will happen in 2023 is that my predictions are based on what other people are talking about, not necessarily what they are doing. The only data point I have on what’s actually happening within organizations is what I see happening in my own organization. So take everything with a grain of salt and let me know if these predictions resonate with you!

1) Artificial intelligence and natural language processing doesn’t eat your lunch

How could a prediction for 2023 not include something about artificial intelligence? It seems like the tech world was mesmerized by ChatGPT in the second half of 2022, and I can’t blame them. The applications and use cases are pretty slick and mind-blowing. Internally at my company, we’ve already started testing out this technology for summarizing meeting notes and it works out quite well and saves a human from having to manually summarize the notes. My favorite application of AI shared on Twitter (where else do you discover new technologies? Scientific journals?) is this bot that argues with a Comcast agent and successfully gets a discount on an Internet plan:

These examples are all fun and cute and may help you save on your phone bill, but I’m more interested in how AI will be used inside organizations to improve data quality.

Data quality is always an issue when you’re collecting large amounts in real-time every day. Historically, analysts and data engineers are running SQL queries to find data with missing values or duplicate values. With AI, could some of this manual querying and UPDATE and INSERT commands be replaced with a system that intelligently fills in the data for you? In a recent episode with Korhonda Randolph, Korhonda talks about fixing data by sometimes calling up customers to get their correct info which then gets inputted a master data management system. David Yakobovitch talks about some interesting companies in episode 101 that smartly help you augment your data using AI.

We’ve also seen examples of AI helping people code via Codex, for example. I think this might be an interesting trend to look out for as the demand for data engineers from organizations outpaces supply. Could an organization cut some corners and rely on Codex to develop some of this core infrastructure for their data warehouse? Seems unlikely if you ask me, but given the current funding environment for startups, who knows what a startup founder might do as runways shrink.

2) Enforcing data privacy and regulation in your user database

This trend has been going in since the introduction of GDPR in 2018. As digital transformation pushes all industries to move online, data privacy laws like GDPR and CCPA force these companies to put data security and governance as the number one priority for all the data these companies store. In particular is user data. Any company that has a website where you can transact allows you to create a user account. Most municipalities have a dedicated app where you can buy bus and metro tickets straight from the app. Naturally, they ask you to create a profile where your various payment methods are stored.

When it comes to SaaS tools, the issue of data privacy becomes even more tricky to navigate. Many user research and user monitoring services tout their abilities to give organizations the ability to see what your users and customers are “doing” on these organizations’ websites and apps. Every single click, mouseover, and keystroke can be tracked. How much of this information do you store? What do you anonymize? It’s a cat and mouse game where user monitoring software vendors claim they can track everything about your customers, but then you have to temper what information you actually process and store. The data team at my own company is constantly checking these data privacy regulations to ensure that we implement data storage policies that reflect current legislation.

Source: DIGIT

A closely related area to data privacy is data governance. Data governance vendors who help your organization ensure your data strategy is compliant have increased dramatically over the years as a result of data regulation and protection laws.

To bring this back to a personal use case, type in your email address in haveibeenpwned.com. This website basically tells you which companies have had data breaches and whether your personal information may have been compromised. To take this another step, try Googling your name and your phone number or address in quotes (e.g. “John Smith 123-123-1234”). You’ll be surprised by how many of these “people finder” websites have your personal information and of your family members. One of the many websites you’ve signed up for probably had a breach and this information is now out there being aggregated by these websites, and you have to manually ask these websites to take your information out of their databases. Talk about data governance.

3) Data operations and observability tools manage the data lifecycle

I’m seeing this happen within my own company and others. DevOps not only monitors the health of your organization’s website and mobile app, but also databases and warehouse. It’s becoming more important for companies who undergo the digital transformation to maintain close to 100% uptime so that customers can access their data whenever they want. Once you give your customers and users a taste of accessing their data no matter where they are, you can’t go back.

I think it’s interesting to think about treating your “data as code” and apply concepts of versioning from software engineering to your data systems. Sean Scott talks about data as code in episode #96. The ETL process is completely automated and a data engineer or analyst can clone the source code for how transformations happen to the underlying data.

I’m a bit removed from my own organization’s data systems and tooling, but I do know that the data pipeline consists of many microservices and dependencies. Observability tools help you understand this whole system and ensure that if a dependency fails, you have ways to keep your data flowing to the right endpoints. I guess the bigger question is whether microservices is the right architecture for your data systems vs. a monolith. Fortunately, this type of question is way beyond my pay grade.

Source: DevCamp

4) Bringing ESG data to the forefront

You can see this trend happening more and more, especially in consumer transportation. Organizations are more conscious about their impact on their environments with various ESG initiatives. In order to ensure organizations are following new regulations, the SEC and other regulatory bodies rely on quality data to ensure compliance.

One can guess which industries will be most impacted by providing this ESG data, but I imagine other ancillary industries will be affected too. Perhaps more data vendors will pop up to help with auditing this data so that organizations can meet compliance standards. Who knows. All I know is that consumers are asking for it, and as a result this data is required to be disclosed.

Google Flights showing CO2 emissions

We know that cloud computing and storage gets cheaper every year (e.g. Moore’s Law). Cheap from a monetary perspective, but what about the environmental impact? An interesting thought exercise is tracing the life of a query when you open Instagram on your phone and start viewing your timeline of photos. The storage and compute resources are monetarily cheap to serve that request, but there is still a data center that runs on electricity and water that needs to process that request. Apparently 1.8% of electricity and 0.5% of greenhouse gas emissions are caused by data centers in the United States (source).

When I think about all the cronjobs and DAGs that run to every second to patch up a database or serve up photos to one’s Instagram feed, I wonder how much of these tasks are unnecessarily taxing our data centers? I have created a few Google Apps Scripts over the years (like creating events from email or syncing Google Sheets with Coda). You could have these scripts run every minute or 5 minutes, but is it necessary? Considering that Google Apps Script is a 100% free service, it’s hard to understand the “cost” with running a script that hits a Google data center somewhere which may be moving gigabytes of data from one server to another. I started thinking about the cost of keeping these scripts alive for simple personal productivity hacks like creating calendar events from email. Sure, my personal footprint is small, but when you have millions of people running scripts, that naturally becomes a much bigger problem.

I still have a lot to learn about this area and my views are influenced by simple visualizations like the one above. It all starts with quality ESG data!

5) Organizations help employees acquire data literacy and data storytelling skills

This trend is a bit self-serving as I teach various online classes about Excel and Google Sheets. But as a result of data tools like Mode, Looker, and Google Data Studio pervading through organizations, not just the analysts are expected to know how to use and understand these tools. Unfortunately, data skills are not always taught in middle school or high school (they certainly weren’t taught when I was growing up). Yet, the top skills we need when entering the workforce are related to using spreadsheets and analyzing data (I talk about this subject in episode 22 referencing this Freakonomics episode). This episode with Sean Tibor and Kelly Schuster-Paredes is also worth a listen as Sean and Kelly were teachers who incorporated Python into the classroom.

In 2019, The New York Times provided a “data bootcamp” for reporters so that they could better work with data and tell stories with data. The Google Sheets files and training material from this bootcamp are still publicly available here. You can read more about this initiative by Lindsey Cook–an editor for digital storytelling and training at The Times–here. The U.S. Department of Education also believes that basic data literacy skills should be introduced earlier in the curriculum and they created this whole deck on why these skills are important. This is one of my favorite slides from that deck:

Source: U.S. Department of Education

What does this mean for organizations in 2023? Upskilling employees in data literacy and storytelling could mean online classes or simple a 1 or 2-day training with your data team. Interestingly, data vendors provide a ton of free training already. While some of this training can be specific to the data platform itself (like Google’s Analytics Academy), other platforms provide general training on databases, SQL, and Excel. So if you don’t pay for the training, at least utilize the free training provided by Mode, Looker, Google Data Studio, Tableau, etc.

Other Podcasts & Blog Posts

In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:

The post Dear Analyst #113: Top 5 data analytics predictions for 2023 appeared first on .

]]>
https://www.thekeycuts.com/dear-analyst-113-top-5-data-analytics-trends-for-2023/feed/ 0 It's that time of the year again where data professionals look at their data predictions from 2022 and decide what they were wrong about and think: "this must be the year for XYZ." Aside from the fact that these type of predictions are 100% subjective ... It's that time of the year again where data professionals look at their data predictions from 2022 and decide what they were wrong about and think: "this must be the year for XYZ." Aside from the fact that these type of predictions are 100% subjective and nearly impossible to verify, it's always fun to play armchair quarterback and make a forecast about the future (see why forecasts are flawed in this episode about Superforecasting). The reason why predicting what will happen in 2023 is that my predictions are based on what other people are talking about, not necessarily what they are doing. The only data point I have on what's actually happening within organizations is what I see happening in my own organization. So take everything with a grain of salt and let me know if these predictions resonate with you!







1) Artificial intelligence and natural language processing doesn't eat your lunch



How could a prediction for 2023 not include something about artificial intelligence? It seems like the tech world was mesmerized by ChatGPT in the second half of 2022, and I can't blame them. The applications and use cases are pretty slick and mind-blowing. Internally at my company, we've already started testing out this technology for summarizing meeting notes and it works out quite well and saves a human from having to manually summarize the notes. My favorite application of AI shared on Twitter (where else do you discover new technologies? Scientific journals?) is this bot that argues with a Comcast agent and successfully gets a discount on an Internet plan:




https://twitter.com/jbrowder1/status/1602353465753309195




These examples are all fun and cute and may help you save on your phone bill, but I'm more interested in how AI will be used inside organizations to improve data quality.



Data quality is always an issue when you're collecting large amounts in real-time every day. Historically, analysts and data engineers are running SQL queries to find data with missing values or duplicate values. With AI, could some of this manual querying and UPDATE and INSERT commands be replaced with a system that intelligently fills in the data for you? In a recent episode with Korhonda Randolph, Korhonda talks about fixing data by sometimes calling up customers to get their correct info which then gets inputted a master data management system. David Yakobovitch talks about some interesting companies in episode 101 that smartly help you augment your data using AI.



We've also seen examples of AI helping people code via Codex, for example. I think this might be an interesting trend to look out for as the demand for data engineers from organizations outpaces supply. Could an organization cut some corners and rely on Codex to develop some of this core infrastructure for their data warehouse? Seems unlikely if you ask me, but given the current funding environment for startups,]]>
Dear Analyst 113 113 full false 31:12 52829