Data Engineering

The Impact of AI on Data Engineering (Or, is it the Other Way Around?)

September 19, 2023
10 Min Read

Generative AI is THE topic of the moment, and it likely will be for a long time because its impact is so far reaching and, well, culture-shaping. 

The Wall Street Journal now carries a separate section dedicated just to AI. Snowflake and Databricks have gone so far as to make AI a focus of their annual, industry-defining conferences (note that “AI” has been added to both event titles). New executive roles for companies of all sizes are being created for Chief AI Officers. 

Stories of robot lawyers and AI-generated operas that rival Puccini’s best work make for great copy, but what gets lost in all of the hype are the details of how AI will impact the actual work being done by data engineers. Do AI innovations create more work for data teams? Does AI change the nature of what data experts focus their attention on? Or does it replace the people in these currently vital roles? 

We’ll get to all these questions and more, but let’s first get to the punchline: the emphasis on AI is going to create two things that are especially critical for data engineers:

More Data

Note that, contrary to reports of “unwieldy amounts of new data,” AI is not in the business of just creating new data. AI can be utilized to generate synthetic data or augment existing data, which can provide additional resources for data engineers to work with. While that is additional data that must be analyzed and managed, it can be developed and used in ways that support data engineering efforts. 

Generative AI techniques, such as generative adversarial networks (GANs) and variational autoencoders (VAEs) can generate synthetic data that mimics the characteristics and patterns of real-world data. This synthetic data can be used to supplement limited or incomplete datasets, enabling data engineers to perform more comprehensive analyses and testing without relying solely on actual data.

By creating more diverse and abundant data, AI improves the robustness and utility of data-driven models and systems. This, in turn, enables data engineers to build and fine-tune more effective machine learning models, develop comprehensive data pipelines, and design more robust data architectures.

More Data Gaps

The availability of more data, whether through synthetic generation or augmentation, will help data engineers identify and address gaps in data observability. With a larger and more diverse dataset, they have a broader foundation to analyze and monitor the performance, quality, and integrity of the data. The increased volume of data can provide additional insights and patterns that may not have been evident with limited datasets, enabling data engineers to detect anomalies, errors, or missing information more effectively.

At the same time, the expansion of data through generative AI also presents challenges for data engineers in terms of data observability. The generated or augmented data will likely introduce new complexities and nuances that need to be thoroughly understood and managed. Data engineers must ensure that the newly created data aligns with the desired properties and rules of the real-world data they are working with; in other words, there will be a need for rigorous discipline around data reliability. They will need to verify that the generative AI models used for data creation are reliable and produce data that accurately represents the original data sources. That, of course, necessitates having reliable data to feed the models. More on that later.

More Data Should Evolve Into Better Data With AI

Key Points

  • More data doesn't necessarily equal better data.
  • AI can enhance data engineering by improving data discovery and access.
  • AI-powered data analytics tools automate the extraction of insights from complex datasets.
  • AI contributes to data democratization by enabling self-service analytics and visualization.
  • Data observability ensures the reliability, quality, and accuracy of generated data.
  • The performance of large language models like GPT-4 can vary over time, posing challenges for data teams using generative AI for data products.

I was hesitant to write this heading because, to be clear, there are no simple equations in data. More data doesn’t = better data, and even AI can’t create the perfect algorithm to make it so. However, AI offers some gifts to data engineers for aspects of their jobs that previously were problemattic. Let’s start by saying that AI has the potential to greatly improve access to data and make it more ubiquitous by addressing several key challenges in data availability, integration, and analysis. That access is the “more data” part. Let’s look at how AI can transform that into “better data.”

Every data engineer probably already understands that AI technologies can enhance data discovery and access by automating the process of identifying and retrieving relevant data from various sources. Machine learning algorithms are employed to analyze metadata, understand data schemas, and recommend relevant datasets based on user requirements. This streamlines the data search process, saving time and effort for data consumers and enabling them to access a wider range of data. This alone eliminates time-intensive parts of the data engineer’s job while simultaneously providing them with more of what they need to improve the operational aspects of it. 

AI also facilitates data integration and interoperability. Data often resides in disparate systems and formats, making it challenging to consolidate and combine for analysis. AI techniques like natural language processing, entity resolution, and data mapping can help in harmonizing and aligning data from different sources. With the automation of these integration tasks, AI reduces the complexity and effort required to unify data from diverse systems, enabling a more comprehensive and holistic view of the information.

Access to data health information is critical for AI models to be supplied with accurate and timely data

Then, consider that AI-powered data analytics tools can automate the process of extracting insights from large and complex datasets. Machine learning algorithms can identify patterns, correlations, and anomalies in data, revealing valuable insights that may be difficult or time-consuming for humans to discover. This automation of data analysis enables users to quickly derive actionable insights and make data-driven decisions, ultimately democratizing the use of data and empowering a broader range of users.

There are also huge advantages in terms of data democratization. AI can contribute to data democratization by enabling self-service analytics and visualization capabilities. Natural language processing and chatbot interfaces allow users to interact with data using everyday language, eliminating the need for specialized query languages or technical expertise. This empowers individuals across organizations to independently access and explore data, fostering a culture of data-driven decision-making at all levels.

Data engineers are always faced with the overarching issue of data quality and reliability. While they’re focusing on building and optimizing data pipelines and optimizing performance, they’re also burdened by the ever-present concern of the accuracy of data in their enterprise and how early they can ensure its reliability. Automated data validation and cleansing algorithms can detect and correct errors, inconsistencies, and missing values in datasets. This improves the overall quality of data, making it more reliable and trustworthy for analysis and decision-making purposes.

Data observability becomes increasingly critical as AI creates more data and evolves it into better data because it ensures the reliability, quality, and accuracy of the generated data. Data engineers will get relief and support from data observability because it allows for real-time monitoring and analysis of this data. They’ll see the errors, behavioral anomalies, and data inconsistencies that could impact data quality, and can turn that all into actionable remediation.

Acceldata’s Chief Product Officer, Ramon Chen, recently pointed to a joint study conducted by AI researchers at Stanford and UC Berkeley about performance drift in large language models (LLMs) over time. The resulting paper, How is ChatGPT’s Behavior Changing Over Time, indicates that the performance and efficacy of GPT-4, especially for coding and compositional tasks, is rapidly changing, and that there has yet to appear a way to accurately determine if results are improving or declining. 

This is a problem not just of data source, but also behavior, integration, and model development. The question that any data team has to wonder is whether this attenuation will continue, is it cyclical, or do we not know? Any one of those circumstances is problematic for anyone seriously considering using generative AI to produce data products. 

We’re not suggesting that data observability is a salve that can be liberally applied to improve LLM performance, but ultimately, in the context of AI-generated data, data observability ensures that data engineers can confidently leverage the wealth of data at their disposal (this is where the “more data” and “better data” intersect) to drive better business decisions.

More AI, More Data, More Insights

“You’re gonna need a bigger boat…”

- Chief Brody (Jaws, 1975)

Key Points

  • Enterprises in the era of public cloud and digital transformation need multi-cloud software tools for efficient data handling.
  • Data observability offers comprehensive visibility into the data platforms that use AI-enhanced data.
  • Real-time observability is crucial for monitoring application and infrastructure health to ensure uninterrupted service and performance.
  • Data observability monitors large language models (LLM) outputs and ensures their stability.
  • Growing complexity in applications, distributed workloads, and cloud-native technologies necessitate data observability solutions.
  • Cloud-native technologies generate massive amounts of performance data that need continuous monitoring.
  • Frequent software releases in the modern development lifecycle require vigilant performance monitoring.
  • The increasing volume of data from various sources intensifies the challenges faced by data engineers in monitoring data applications and infrastructure.

As the adoption of public cloud and digital transformation continues, companies will increasingly rely on software tools specifically designed for multi-cloud environments. These tools enable efficient data ingestion, management, and extraction of valuable insights from the expanding data volumes. Maintaining real-time observability, which involves monitoring the health of applications and infrastructure, is crucial for ensuring uninterrupted service, optimal performance, and a consistent user experience.

Melissa Knox, Global Head of Software Investment Banking at Morgan Stanley, emphasizes the significant cost associated with downtime. By her calculations, when applications or infrastructure fail, digital businesses can lose millions of dollars per hour. Data engineers are already leveraging AI to predict and prevent outages, downtime, and subpar user experiences, thereby mitigating financial losses and maintaining high service quality. Their key tool in this is data observability.

There’s no question that data observability is powerful. While all data tools provide some level of visibility into their own operational behavior, without the comprehensive nature of a data observability solution, data teams will only have myopic views into specific groups of tools. For example, lakehouse monitoring and data warehouse insights are critical to assess and manage the operations of a data environment. But those types of insights neglect the activity of data happening outside of lakehouses and warehouses. And the problem isn’t just that it’s an incomplete picture, but that the lack of omnipresent visibility creates blind spots. Every data leader knows that what you don’t know can be a killer.

Consider all of this in the context of large language (LLM) models. Data observability becomes instrumental in recognizing any potential drifts in their outputs, ensuring their stability and consistency. And because of its unique foundational view across different layers and platforms of the environment, data observability aids in comprehending the repercussions of fine-tuning and other adjustments on the overall efficacy of these models.

It becomes clear that the role of data observability in the AI landscape becomes far more pronounced when you think about Knox’ estimates. This is especially true as the shift towards distributed applications and the adoption of cloud-native technologies has brought about significant changes in the way teams observe and manage their application environments. 

First off, consider the growing complexity of applications, with interconnected workloads and services spanning on-premises and cloud environments, as well as the use of ephemeral components like Kubernetes. All of this has rendered traditional IT monitoring tools inadequate. To address this, data observability solutions are required, specifically a data observability platform that can operate in a unified fashion across on-prem, cloud, multi-cloud, and hybrid environments. Given today’s complex architectures, there is no other way. This must enable real-time monitoring and historical analysis, offering a comprehensive view of the entire system.

Data engineers need insight into the volume of performance data, including metrics, logs, traces, and events. Cloud-native technologies such as microservices, serverless architectures, and Kubernetes generate a massive amount of services, components, and functions that need to be continuously monitored for optimal performance. The sheer scale and dynamic nature of these environments demand rigorous monitoring capabilities to identify and address any performance issues.

Also, consider the pace of change throughout the software development lifecycle, especially as more enterprises leverage their data investments to produce a myriad of data products. This has a direct impact on applications and infrastructure. The traditional model of one or two major software releases per year has given way to more frequent, incremental releases. This presents a significant challenge from an application performance perspective, as each release introduces the potential for performance issues. Continuous deployment and integration practices require vigilant monitoring to detect and resolve performance-related problems as new code is regularly deployed into production.

The rapidly increasing volume and usage of data for monitoring data applications and infrastructure have significantly intensified on a day-to-day basis. This is the challenge that data engineers face, and it’s a race against the challenge of exponential math. All these additional data sources require insights for their own activity, but they also require it for each interaction they have with other applications and sources. It’s unwieldy from square one.

The Future of Data Engineering and AI Collaboration

The collaboration between data engineering and AI is rapidly evolving, and as we’ve seen in this piece, it’s critical to recognize that more data doesn't automatically translate to better data. It's not just about quantity; for modern enterprises to capitalize on generative AI, they must emphasize the quality and accessibility of data. When data teams keep this focus, they can effectively apply AI to enhance data engineering by streamlining data discovery, automating analytics, and fostering data democratization.

AI-driven data analytics tools have the power to unlock valuable insights from complex datasets, empowering organizations to make data-driven decisions faster, and with more applicability to business operations. Through better access to reliable data, users across organizations can use data independently, which fosters a culture of data-driven decision-making.

Data observability provides the foundation for effective usability of AI because it ensures the reliability, quality, and accuracy of generated data. It plays a crucial role in monitoring the outputs of large language models like GPT-4, helping maintain their stability and consistency over time.

As the data landscape continues to evolve, collaboration between AI and data engineering will be crucial for organizations aiming to stay ahead in their data-driven journeys. By harnessing the best of both worlds, they can drive innovation, optimize operations, and make informed decisions in an increasingly data-rich world.

The Collaboration Between Data Observability and AI

  • Collaboration between generative AI and data engineering can transform data-driven operations.
  • Generative AI can automate code production and labor-intensive tasks, boosting data engineer productivity.
  • Rigorous validation and testing are essential for ensuring the accuracy and reliability of generated code.
  • Continuous oversight is necessary to maintain the efficacy of AI models in code generation.
  • Generative AI complements data engineers' work by accelerating solution development.
  • Data engineers retain their strategic roles as data environment architects and caretakers.
  • The human component provides context, intelligence, and judgment that AI models are not built to provide.

Learn more about Acceldata's acquisition of Bewgle, which will deepen enterprise data observability capabilities for AI and LLM.

Photo by CHUTTERSNAP on Unsplash

Similar posts

With over 2,400 apps available in the Slack App Directory.

Ready to start your
data observability journey?