Dear Analyst #88: How to learn data science and machine learning from scratch with Santiago Viquez

Companies are generating more big data these days, so dumping the data into a CSV for analysis just doesn’t cut it anymore. Sure you could use Power Query or Power BI, but more analysts are turning to Python and platforms built for big data processing. The next step is to use machine learning to help predict what the future might look like. Santiago Viquez is currently a data analytics mentor at Springboard, an education platform helping students prepare for new careers. On the side, Santiago has built a ton of cool projects related to data science, natural language processing, and more. In this conversation we dig into how Santiago learned data science from scratch during the pandemic, and how he thinks analysts should learn data science.

Started at the bottom now we’re at a multinational corporation

Santiago studied physics in Costa Rica, but realized he didn’t want to pursue a career in physics. After doing some research, he realized a career in data analytics and data science would be more suitable. Having known a little bit of Python, he started applying to a few positions and eventually got his data analytics career started as an intern at a small startup. His internship turned into a full-time role as a data analyst which he kept for two years.

Santiago left the startup and went in the complete opposite direction in terms of company size. He had roles in data analysis and data science at large corporations like Walmart and UPS working remotely the entire time. During his time at Walmart, he started working part-time at Springboard helping students land careers in data analytics.

The experience working at a startup versus a large company is night and day. We’ve seen stories of people like Preksha in episode 85 and Lauren in episode 64 make completely new transitions to a career in data. But we don’t hear about the data analytics professional who moves from startup to large company too often.

One example Santiago bought up is how corporations frame problems. You typically have clear success metrics, KPIs, stakeholders, and data sources to work with. At a startup, you are defining the problem by yourself. It’s just you. You’re in charge of collecting the data sources, providing analyses to key stakeholders, and owning the entire model or analysis end-to-end.

Reducing food waste for restaurants in Costa Rica with data science

When Santiago was a consultant, he was helping a big restaurant group in Costa Rica figure out ways to reduce food waste. The restaurant group consisted of 30-40 restaurants (which is big for Costa Rica). Each restaurant had its own manager and each manager would request food from various suppliers. The problem was that some managers were good at forecasting how much food they would need for the next 10-15 days, others were not so good.

Santiago’s goal was to create a tool that would help each manager predict how much food to order from the suppliers. The first phase of the project was gathering data. In this case, Santiago had to get the recipes from each restaurant manager. These recipes were then joined with each restaurant’s sales data to see the volume of ingredients required.

The interesting thing is that each recipe had to be broken down to its most granular ingredients. If it was a taco recipe, this meant getting tortillas. In order to make tortillas, you need flour. So the ingredient to procure from the supplier would be flour.

After Santiago collects the data, the fun part comes. He models and forecasts which ingredients are essential to the recipes. There are many other variables that impact how much raw ingredients to order like how long the ingredient can sit on the shelf before it goes bad. His team would just find information on the web to see how long the shelf life was for a certain ingredient.

At the end of the day, he set up benchmarks for each restaurant and we they were able to reduce food waste by 15-20% per restaurant.

Tips on how to learn data science if he were to do it all over again

Santiago wrote this awesome blog post right when the pandemic hit. It’s all about how he would learn data science from scratch if he were to do it all over again. The reason he wrote the post was because he was isolated in his house and just got to thinking: I got into the data science field kind of randomly. I gained most of my skills on the job. What would I have done differently to learn data science?

The way I like to learn is by doing.

Santiago likes to start at high-level concepts and then get deeper into specific topics. He might start with watching a YouTube video on neural networks instead of trying to learn a neural network model right away. The YouTube videos and blog posts would spark his curiosity to want to dig deeper into a topic.

Here’s a step-by-step on the tools and skills he would learn for aspiring data scientists:

Learn Python through online courses or through Kaggle
Data viz tools. People forget this is an important skill and just want to go straight into modeling stuff.
Start implementing models like scikit-learn
Try your hand at Kaggle competitions
Go deeper into neural networks and more advanced topics

I would start with learning Python through courses or Kaggle. Then I’d learn how to visualize things. A lot of people forget this step and just want to model stuff. After you know the basics of Python and visualization, I’d start learning about implementing models like scikit-learn. Then you move onto Kaggle competitions.

I’d highly recommend reading Santiago’s full blog post if you’re interested in learning data science from scratch. It’s Santiago’s most popular blog post by orders of magnitude. More than 200,000 people have read the blog post. After he published the post, famous YouTubers started creating videos similar to Santiago’s post.

I love posts like this because it prevents you from having to go through the same mistakes of learning a new topic from someone who has gone through those mistakes.

Building a data science trivia game to help you prep for data science interviews

Santiago has always wanted to create a physical card game. Instead of making a physical game, he created a data science trivia game to help people prepare for data science interviews.

The way Santiago built the game is pretty interesting. He collected 200 questions from people in the R Studio community, the Apple community, and other online communities. He also reached out to Kaggle who sent him a bunch of great interview questions. His wife designed the cards from the colors to typography and did all this in Figma. He put all the questions in an Google Sheets. There happens to be a Figma-Google Sheets plugin where you can sync data from Google Sheets to your designs in Figma.

He put the game up on Gumroad and to date has made 500 sales. Santiago believes the success of the game was due to the communities he worked with to get the questions, testimonials from customers, and building his game in public. It was the first time Santiago promoted his own project and got involved with different communities, instead of just being a participant or viewer from the sidelines.

Create your own Harry Potter fan fiction with a bot

One last project Santiago built on the side was a Harry Potter story generator using machine learning and natural language processing. Santiago was always been a fan of the Harry Potter series since university. Before he knew about data analysis or machine learning, he’d read stories about how people would teach a bot on how to write a fictional story for famous books or TV shows like Game of Thrones. With his new data science skills, he wanted to do the same thing for Harry Potter.

The project involves getting all the text from every Harry Potter book. This text then feeds into a neural network. He then used a platform called Streamlit—an open source platform for data science teams to share data—to build the actual “data app.”

On the app, you say you want your new story to include Dumbledore and that the “temperature” of the story would be “normal” or “weird.” The “temperature” is scale for how close the story would feel to the actual Harry Potter series versus something more outlandish.

Source: Harry Potter And The Deep Learning Experiment

Building and learning in public

I’ve talked about building and learning in public in a variety of episodes, and it was awesome hearing Santiago share how it has impacted his side projects. He went from seeing others build in public to actively participating in the movement. I’d say KP from On Deck coined the term a few years ago.

When Santiago was building his Harry Potter story generator, he already knew how to use Streamlit and had some basic experience with neural networks through online classes. But the online classes didn’t compare to the experience he gained from applying the skills to a real side project.

Similar to his project reducing food waste, he had to set his own project metrics, define data sources, and more importantly, figure out how to get his project seen. This is where the actual learning happens. You get error messages you’ve never seen, you Google stuff, and read a lot of Stack Overflow. The next time you come across these errors, however, you’ll have the experience of knowing how to handle the error or know that you came across a Stack Overflow post on how to solve it.

If you are learning, learning in public. Talk about stuff you’re learning because it will not only help you, but help others who are looking to learn the same thing.

Other Podcasts & Blog Posts

No other blog posts/podcasts mentioned in this episode!

Trackbacks/Pingbacks

Dear Analyst #118: Uncovering trends and insights behind Facebook News Feed, Reels, and Recommendations using data science with Akos Lada • - July 3, 2023
[…] is a saying at Facebook that the work is only 1% done. Akos talks about how the data science field in general is a relatively new field that really began in the last decade. Compared to other fields […]