Do I need Testing, monitoring, or none/both for my Data Pipelines?
Introduction
If you have a data pipeline that moves data from a sink (like cloud storage) into a data warehouse, how do you know that the data in the destination meets your expectations? You can use data testing, or a set of coded rules to check if the data conforms to certain requirements.
Aside from testing, another method of checking data involves monitoring and observability. A data monitoring solution collects multiple metrics about your data throughout the data pipeline. These metrics can be used to detect problems early on, both manually and algorithmically, without explicitly testing for these problems.
In this article, you’ll learn about data testing, data monitoring, and how to get started with both.
Why You Need Data Testing and Monitoring
Both data testing and data monitoring are crucial to your data engineering practices. The following are more details about each method.
Data Testing
You can’t always control how data is collected or what you collect. There’s uncertainty involved, whether you’re scraping the web, using sensors, or collecting user input from open text fields. When you’re building data products, though, you need to maintain a certain level of data quality in order to process, store, or visualize data.
There are many dimensions to data quality. The following are some key factors:
Data validity: To store dates or times, they need to be in the correct format. A `MM/DD/YY` string could be misinterpreted if `YYYY-MM-DD` is expected.
Data uniqueness: No two rows in a table should be the same.
Data completeness: Every row should contain the data.
Data consistency: If data in multiple places is not identical when it should be, it isn’t consistent. For example, when a customer profile exists in the e-commerce platform and the CRM, the address should be the same in both places.
To manage data that’s not behaving as expected, data engineers implement tests throughout an organization’s data pipelines. Data tests are based on assumptions that must be confirmed in order for data to be processed as planned. Erroneous data needs to be managed as well. That could mean marking the data, processing it differently, storing it for later processing, or triggering a request for manual intervention.
Data Monitoring
Data monitoring is closely related to data testing, in that they’re both meant to improve or preserve data quality, but monitoring starts from a different philosophy. Instead of testing your data against known scenarios, monitoring your data means collecting, storing, and analyzing multiple properties of your data. In other words, it’s collecting data about your data over time. When a data monitoring system detects something seemingly anomalous, it will alert you. Modern monitoring systems also provide observability of your data pipelines; by assessing the output of a data system, they indicate what might be the root cause of an anomaly
Data monitoring can address multiple issues:
Unknown unknowns: To write data tests, you need to know the scenarios you want to test for in advance. Large organizations might have hundreds or thousands of tests in place, but they can’t catch data issues they didn’t even know could happen. Data monitoring notifies them about such oddities so they can quickly find the root cause.
Data changes: Downstream tests are rarely designed to catch data drift, or changes in the data input.
Data pipeline changes:Businesses evolve and so do their data products. Implemented changes often break the existing logic downstream in ways that tests can’t account for. Proper monitoring tools can help quickly identify these problems in testing and production environments.
Testing debt:An organization’s data pipelines might have been in place for years, dating from an era when the internal data maturity was low and testing wasn’t a priority. With such technical debt, debugging pipelines can take an eternity. Monitoring tools can help organizations set up proper tests.
Data Testing vs. Data Monitoring
There are two key differences between data testing and data monitoring: deliberateness and specificity.
Data testing is deliberate, meaning you’re comparing a data object such as a single value, row, or table to a list of rules that the data object must follow. For example, a test ensures that a new customer row has both a first name and last name. Data monitoring is *indeterminate*, meaning that by tracking certain metrics over time, you’ll establish a baseline of what’s normal. If these metrics start deviating, you’ll investigate the cause. Say that you have a model that predicts which customers will churn. Since machine learning models are probabilistic, any data drift will cause your model’s prediction accuracy to decrease. If you monitor the share of inactive customers in the daily snapshots of your customer database, you’ll be able to catch data drift early and modify your model.
Data testing is specific, meaning a test validates a data object at one particular point in the data pipeline, while monitoring is valuable because of its holistic properties. By tracking various metrics in multiple components in the data pipeline over time, data engineers can interpret anomalies in relation to the whole data ecosystem. Say you’re tracking the number of new customers in the database daily, and one day the number of new customers drops. But you’re also tracking an identical metric in the raw or staging tables, as well as duplicate values in those tables. While the individual metrics aren’t very telling, analyzing them together helps you assess if your data pipeline is malfunctioning or if there was a decrease in customer acquisition.
Why You Need Both?
It’s not possible to develop tests to detect every possible data issue. Data monitoring helps you detect problems that your data tests can’t cover. But monitoring is also augmentative and gives superpowers to your tests. Your tests detect data quality issues, while monitoring indicates what the root cause could be.
You gain several advantages by monitoring a data stack on top of writing and implementing business logic tests:
You can make changes to a data system with confidence. In order to prevent regression, your data tests check if business logic is respected, while monitoring detects unforeseen problems that occur downstream.
You can detect changes in source systems early, infer the likely cause, and determine what changes your data processing system needs — for example, changes in volume, variety, and veracity — so that modifications can be handled promptly.
Getting Started with Data Testing
Getting started with data testing
Data tests can be implemented and executed in multiple phases of the DataOps development process. The following are types of tests to include.
Unit Testing
Data engineers run unit tests while developing changes to the data pipelines. They are executed on tiny isolated components (units), like extractions, loads, or transformations. By running a single data object through these individual processes, you can check if the output matches the expectation.
End-to-End Testing
An end-to-end test is a straightforward way of checking if a complete data pipeline behaves as expected. Typically you would run this test once changes are deployed and integrated in a staging environment. The test requires a data object for which you have the initial and the expected final form. When you run the initial data object through your data pipeline, you can compare your expected form to the actual result.
Data Quality Testing
Unlike the previous tests, a data quality test runs continuously with every new piece of data that flows through a component of a pipeline. Every time a data object is processed, the test runs a routine check of the output for specific properties, such as null values, date formats, capitalization, and data type. One popular open source framework for implementing data quality tests is Great Expectations.
Data quality tests can be implemented in various steps and tools across a data pipeline. Typically, modern data stacks use a tool like Airflow or Prefect to orchestrate the data processing steps from ingestion to output via directed acyclic graphs(DAGs).
Implementing tests as a component of a DAG offers three main advantages:
1. Tests can be implemented at every step, not just at the end of a pipeline.
2. Tests are run at the moment of processing.
3. Tests are centralized in the same tool, making it easier to maintain an overview of them.
Getting started is relatively straightforward. For example, you could use a Great Expectations operator in your Airflow DAG.
Adding tests to the orchestrator is mainly suited for extract, transform, load (ETL) workloads because the transformations are often scattered across multiple tools. For extract, load, transform (ELT) workloads, tests can be added at the end of the pipeline, in which transformation is performed as a stored procedure in a data warehouse. For example, dbt, a popular tool for managing transformations within an ELT paradigm, offers a specific test module.
You can also implement tests inside specific data processing tools. For example, Apache Beam offers Passert while Deequ can test data in Spark workloads. This approach is great for adding tests to the least amount of abstraction. Your processed data object is tested in real time within the processing tool that handles it, so no other elements are able to create data quality issues between processing and testing.
Getting Started with Data Monitoring
To implement data monitoring, you need to create logs, process them, visualize them, and set up alerts. This can mean developing an in-house monitoring solution. You can write logs via Python logging , stream them to a NoSQL database , visualize them using a tool like Grafana , and train an anomaly detection algorithm to alert when monitoring metrics deviate.
Data monitoring and observability tools also offer these capabilities out of the box. For instance, it only takes a few minutes to integrate Telmai with your data stack to detect drifts in row counts, schema changes, distributions, and outliers.
Conclusion
Data testing and data monitoring each provide different ways of analyzing your data pipelines, but there are clear advantages to each method. As demonstrated in this article, you can achieve more complete and actionable results using both data testing and data monitoring. Tools that handle these methods will help you optimize your data workflow.
One such tool is [Telmai](https://www.telm.ai/). The no-code platform offers fast data monitoring and automated alerts on data drift, as well as data and metadata analysis so that you can determine the cause of a problem. To learn more, sign up for a free starter account.
Credits : This article is first published on Telmai by Roel Peters.