It’s that time of the year again where data professionals look at their data predictions from 2022 and decide what they were wrong about and think: “this must be the year for XYZ.” Aside from the fact that these type of predictions are 100% subjective and nearly impossible to verify, it’s always fun to play armchair quarterback and make a forecast about the future (see why forecasts are flawed in this episode about Superforecasting). The reason why predicting what will happen in 2023 is that my predictions are based on what other people are talking about, not necessarily what they are doing. The only data point I have on what’s actually happening within organizations is what I see happening in my own organization. So take everything with a grain of salt and let me know if these predictions resonate with you!
1) Artificial intelligence and natural language processing doesn’t eat your lunch
How could a prediction for 2023 not include something about artificial intelligence? It seems like the tech world was mesmerized by ChatGPT in the second half of 2022, and I can’t blame them. The applications and use cases are pretty slick and mind-blowing. Internally at my company, we’ve already started testing out this technology for summarizing meeting notes and it works out quite well and saves a human from having to manually summarize the notes. My favorite application of AI shared on Twitter (where else do you discover new technologies? Scientific journals?) is this bot that argues with a Comcast agent and successfully gets a discount on an Internet plan:
These examples are all fun and cute and may help you save on your phone bill, but I’m more interested in how AI will be used inside organizations to improve data quality.
Data quality is always an issue when you’re collecting large amounts in real-time every day. Historically, analysts and data engineers are running SQL queries to find data with missing values or duplicate values. With AI, could some of this manual querying and
INSERT commands be replaced with a system that intelligently fills in the data for you? In a recent episode with Korhonda Randolph, Korhonda talks about fixing data by sometimes calling up customers to get their correct info which then gets inputted a master data management system. David Yakobovitch talks about some interesting companies in episode 101 that smartly help you augment your data using AI.
We’ve also seen examples of AI helping people code via Codex, for example. I think this might be an interesting trend to look out for as the demand for data engineers from organizations outpaces supply. Could an organization cut some corners and rely on Codex to develop some of this core infrastructure for their data warehouse? Seems unlikely if you ask me, but given the current funding environment for startups, who knows what a startup founder might do as runways shrink.
2) Enforcing data privacy and regulation in your user database
This trend has been going in since the introduction of GDPR in 2018. As digital transformation pushes all industries to move online, data privacy laws like GDPR and CCPA force these companies to put data security and governance as the number one priority for all the data these companies store. In particular is user data. Any company that has a website where you can transact allows you to create a user account. Most municipalities have a dedicated app where you can buy bus and metro tickets straight from the app. Naturally, they ask you to create a profile where your various payment methods are stored.
When it comes to SaaS tools, the issue of data privacy becomes even more tricky to navigate. Many user research and user monitoring services tout their abilities to give organizations the ability to see what your users and customers are “doing” on these organizations’ websites and apps. Every single click, mouseover, and keystroke can be tracked. How much of this information do you store? What do you anonymize? It’s a cat and mouse game where user monitoring software vendors claim they can track everything about your customers, but then you have to temper what information you actually process and store. The data team at my own company is constantly checking these data privacy regulations to ensure that we implement data storage policies that reflect current legislation.
A closely related area to data privacy is data governance. Data governance vendors who help your organization ensure your data strategy is compliant have increased dramatically over the years as a result of data regulation and protection laws.
To bring this back to a personal use case, type in your email address in haveibeenpwned.com. This website basically tells you which companies have had data breaches and whether your personal information may have been compromised. To take this another step, try Googling your name and your phone number or address in quotes (e.g. “John Smith 123-123-1234”). You’ll be surprised by how many of these “people finder” websites have your personal information and of your family members. One of the many websites you’ve signed up for probably had a breach and this information is now out there being aggregated by these websites, and you have to manually ask these websites to take your information out of their databases. Talk about data governance.
3) Data operations and observability tools manage the data lifecycle
I’m seeing this happen within my own company and others. DevOps not only monitors the health of your organization’s website and mobile app, but also databases and warehouse. It’s becoming more important for companies who undergo the digital transformation to maintain close to 100% uptime so that customers can access their data whenever they want. Once you give your customers and users a taste of accessing their data no matter where they are, you can’t go back.
I think it’s interesting to think about treating your “data as code” and apply concepts of versioning from software engineering to your data systems. Sean Scott talks about data as code in episode #96. The ETL process is completely automated and a data engineer or analyst can clone the source code for how transformations happen to the underlying data.
I’m a bit removed from my own organization’s data systems and tooling, but I do know that the data pipeline consists of many microservices and dependencies. Observability tools help you understand this whole system and ensure that if a dependency fails, you have ways to keep your data flowing to the right endpoints. I guess the bigger question is whether microservices is the right architecture for your data systems vs. a monolith. Fortunately, this type of question is way beyond my pay grade.
4) Bringing ESG data to the forefront
You can see this trend happening more and more, especially in consumer transportation. Organizations are more conscious about their impact on their environments with various ESG initiatives. In order to ensure organizations are following new regulations, the SEC and other regulatory bodies rely on quality data to ensure compliance.
One can guess which industries will be most impacted by providing this ESG data, but I imagine other ancillary industries will be affected too. Perhaps more data vendors will pop up to help with auditing this data so that organizations can meet compliance standards. Who knows. All I know is that consumers are asking for it, and as a result this data is required to be disclosed.
We know that cloud computing and storage gets cheaper every year (e.g. Moore’s Law). Cheap from a monetary perspective, but what about the environmental impact? An interesting thought exercise is tracing the life of a query when you open Instagram on your phone and start viewing your timeline of photos. The storage and compute resources are monetarily cheap to serve that request, but there is still a data center that runs on electricity and water that needs to process that request. Apparently 1.8% of electricity and 0.5% of greenhouse gas emissions are caused by data centers in the United States (source).
When I think about all the cronjobs and DAGs that run to every second to patch up a database or serve up photos to one’s Instagram feed, I wonder how much of these tasks are unnecessarily taxing our data centers? I have created a few Google Apps Scripts over the years (like creating events from email or syncing Google Sheets with Coda). You could have these scripts run every minute or 5 minutes, but is it necessary? Considering that Google Apps Script is a 100% free service, it’s hard to understand the “cost” with running a script that hits a Google data center somewhere which may be moving gigabytes of data from one server to another. I started thinking about the cost of keeping these scripts alive for simple personal productivity hacks like creating calendar events from email. Sure, my personal footprint is small, but when you have millions of people running scripts, that naturally becomes a much bigger problem.
I still have a lot to learn about this area and my views are influenced by simple visualizations like the one above. It all starts with quality ESG data!
5) Organizations help employees acquire data literacy and data storytelling skills
This trend is a bit self-serving as I teach various online classes about Excel and Google Sheets. But as a result of data tools like Mode, Looker, and Google Data Studio pervading through organizations, not just the analysts are expected to know how to use and understand these tools. Unfortunately, data skills are not always taught in middle school or high school (they certainly weren’t taught when I was growing up). Yet, the top skills we need when entering the workforce are related to using spreadsheets and analyzing data (I talk about this subject in episode 22 referencing this Freakonomics episode). This episode with Sean Tibor and Kelly Schuster-Paredes is also worth a listen as Sean and Kelly were teachers who incorporated Python into the classroom.
In 2019, The New York Times provided a “data bootcamp” for reporters so that they could better work with data and tell stories with data. The Google Sheets files and training material from this bootcamp are still publicly available here. You can read more about this initiative by Lindsey Cook–an editor for digital storytelling and training at The Times–here. The U.S. Department of Education also believes that basic data literacy skills should be introduced earlier in the curriculum and they created this whole deck on why these skills are important. This is one of my favorite slides from that deck:
What does this mean for organizations in 2023? Upskilling employees in data literacy and storytelling could mean online classes or simple a 1 or 2-day training with your data team. Interestingly, data vendors provide a ton of free training already. While some of this training can be specific to the data platform itself (like Google’s Analytics Academy), other platforms provide general training on databases, SQL, and Excel. So if you don’t pay for the training, at least utilize the free training provided by Mode, Looker, Google Data Studio, Tableau, etc.
Other Podcasts & Blog Posts
In the 2nd half of the episode, I talk about some episodes and blogs from other people I found interesting:
- Making Sense #299: Steps in the right direction – A conversation with Russ Roberts