Dear Analyst #119: Developing the holy “grail” model at Lyft, user journeys, and hidden analytics with Sean Taylor

Future Dear Analyst episodes will get more sporadic since, well, life gets in the way. Unfortunately curiosity (in most cases) doesn’t pay the bills. Nevertheless, when I come across an idea or person that I think is worth sharing/learning more about, I’ll try my best to post. In this episode, I interview the Chief Scientist of a data startup who did his PhD at Stern NYU and was on track go becoming a professor. Then he got an internship at Facebook and everything changed. The speed of learning at a tech company outpaced what the academic was used to at university. Over the years, Sean Taylor has worked with and spoken to hundreds of data analysts and statisticians. We’ll dive into his data science work at Lyft, his notion of “hidden analytics,” and why he’s obsessed with user journeys in modern applications.

Modeling the Lyft marketplace and creating the GRAIL model

Sean worked at Facebook for 5 years as a research scientist and worked on general data problems. Eventually he joined the revenue operations science team at Lyft. His team’s goal was to help grow the marketplace of riders and drives on the platform. One of the most important aspects of the marketplace is the forecast. As Lyft runs promotions and enters new cities, how do you ensure there are enough drivers for the riders and vice versa?

The team ultimately decided that a simple cohort methodology would be best to help set the forecast for both drivers and riders. Every rider, for instance, would belong to a cohort based on when they first signed up for Lyft, when they booked their first ride, etc. There’s a “liquidation curve” for each cohort that eventually hugs the x-axis. There is much more detail about the cohort methodology in this blog post by the Lyft Engineering team from 2019.

Despite being such a simple model, the model worked surprisingly well. Goals of this model taken from the blog post mentioned in the previous paragraph:

  1. Forecast the behavior of each observed cohort and use it to project how many rides are taken or driver hours are provided within a specific cohort
  2. Forecast the behavior of the cohorts that are yet to be seen.
  3. Aggregate all the projected rides and driver hours to make forecasts for both the demand and supply side of our business.

Sean talked about how there were flaws in the model, and one of those flaws is that a marketplace is ver fluid and evolves over time. When a rider is exposed ot high prices, this may lead to churn and this was also not included in the model. Sean’s team tried building a better model called GRAIL but Sean left Lyft before completing the model.

Source: Symposiums

Speaking of Lyft’s data team, I had mentioned Amundsen, an open source data discovery platform Lyft released in 2019 (blog post). It’s great to see the data team at Lyft giving back to the ecosystem to help data analysts and data scientists do their job better!

Discovering a bug that cost the company $15M per year

One of the best feelings as a data analyst is using data to uncover the root cause or underlying trends in a given business situation. One might say this is like Moneyball where the Oakland As realize that On-base percentage (OBP) is the best predictor for player performance.

Source: Hire an Esquire

Sean believes there is a lot that data analysts do that is not necessarily taught in school or on the job. You’re expected to understand the business and how every day business operations are translated into the numbers on the dashboard.

When you’re working on a project because your are curious about the project rather than being forced to come up with an analysis, you are able to come up with the bigger wins that really move the needle. Sean calls this type of work “hidden analytics,” or as I like to say, there is much more behind the numbers.

Sean’s colleague at Lyft cam across some anomaly in the data and just started pulling on the thread some more. His colleague ultimately found a bug in the marketplace in how Lyft was dispersing driver incentives. Sean talks about how his colleague’s curiosity led them to discover this bug in the first place and squashing the bug led to saving Lyft $15M per year.

Why the systems for collecting user journey data are broken

Modern websites and applications collect a ton of data, but the actual user journey is harder to quantify. A customer signs up for a tool or service, goes through an onboarding process, and might engage with the tool at various times in the future. Modeling and visualizing this data on a spreadsheet or in a SQL database can be difficult. With these tools, you are aggregating data and parts of the user journey might be improperly reduced down to a single number when there is much more nuance to a user’s journey on a website.

Source: Wikipedia

Users are in different states when using a website or app. Sessionizing data has become the default way to capture the path a user takes but there are still many micro-sessions in just one experience like registering your account on a website.

Sean discusses this concept in the context of a rider taking or not taking a ride booked on Lyft. The customer requests the ride, and perhaps declines the first ride and books the second ride. The basic conversion rate would be 50%, but that statistic doesn’t answer why the customer didn’t book the first ride. Perhaps the customer couldn’t find the right address with the first ride, and just gave up. Perhaps the driver was too far away.

Balancing usability and expressivity in data tools

Browse any Hacker News article and you’ll inevitably see devs talking about why you should just build your own tool on-prem with code. The main reason is that you can fully customize the app if you know how to code. I’ve discussed at length on this podcast and through content I’ve created for my company how the need for low-code and no-code tools redefines who a “builder” is in a company.

Sean’s current company (Motif Analytics) is trying to strike that balance between giving data analysts and data scientists the ability to express their data question without diving right into the code. In terms of user journey data, Sean says most people use Amplitude, Mixpanel, or other similar tools. While these tools allow you to execute common data tasks, there are certain things these tools block you from doing. Python notebooks, for instance, are very expressive. But you kind of need to be an expert to use them to their full potential.

Source: Jupyter

Sean talks about how he drew inspiration from Ruby on Rails in terms of how the creators had strong opinions about how to do web development. I also first learned about web development through a Ruby on Rails book and it’s interesting to see how many of the patterns from Rails are still seen in frameworks using PHP or Javascript.

As we discussed the platform Sean and his team are building, we got into the weeds about a little-known SQL command called MATCH_RECOGNIZE(). There apparently isn’t much documentation about this function and the creators behind SQL rushed this pattern-matching function into the language because of competitors coming out with similar functionality. Nothing like real-world drama impacting the open source world!

Start with the questions instead of the tools

We ended the conversation with a bit of career talk. Sean talks about intrinsic motivation being the number one driving force in his career. While tools come and go, he said domain expertise is something that can give budding analysts a leg up when searching for their next role. Technical skills, unfortunately, are slowly becoming a commodity. What never goes out of style? Asking the right questions.

Other Podcasts & Blog Posts

No other podcasts or blog posts mentioned in this episode!