Why You Need to Shift Your Data Monitoring Left

7 min readJun 8, 2022

In today’s world, data-driven businesses have the advantage. You can ensure that your data is high quality with data monitoring, which helps you assess the quality of collected data against a set standard. Data monitoring is typically integrated into data pipelines in the form of metrics that can be updated as needed, as well as notifications that are sent to data owners when a drift is identified.

Data monitoring enables you to address inaccuracies and potential issues more quickly. It also helps you increase organizational efficiency and make more accurate business decisions. However, even this practice may not be giving you a complete picture of your data. If you only implement monitoring once your data has been loaded into a data warehouse, you might be missing crucial information earlier in the process.

This article will explain the data monitoring process and where traditional pipelines might be falling short. You’ll also learn strategies to address these problems so that you can optimize your data usage.

What Is a Data Warehouse?

A data warehouse provides centralized storage for data that’s been acquired for processing from multiple sources. The warehouse stores data as it’s retrieved over time, so there are many versions of that data. This is in contrast to transactional databases, which only keep the most recent state of data, or real-time data.

A well-built data warehouse executes complex queries in short periods of time, delivers a large amount of data quickly, and allows authorized team members to access segments of data for business insights. It serves as the functional foundation for producing analytical reports and dashboards.

The goal of data warehousing is to make it possible to analyze chronological data. Data collected from multiple sources, including historical data, can provide insights into a company’s performance and help guide its future activities. Data entered into the warehouse does not change, making the warehouse the source for analytics on prior events and changes over time. This means warehoused data must be kept in a safe, retrievable, and manageable way.

Data Pipelines

A data pipeline is a set of connected stages for processing data from its source or multiple sources to a preset destination. At the start of the pipeline, the data is ingested and goes through a sequence of steps to produce a specific output. That output becomes the input for the next phase in the pipeline. Data pipelines can be built to follow either the extract, transform, load (ETL) or the extract, load, transform (ELT) data integration methodology.

In an Extract Transform Load (ETL) data pipeline, data is first ingested from one or multiple sources. Next, the data is transformed into a staging environment, which could consist of one or more microservices. When the different desired transformations are completed, the data is stored and persisted in the data warehouse. Extract Load Transform(ELT) data pipelines also start with extracting data from a source, but the next step is to persist the raw data in the staging data storage. When processed data is needed, cleaning, enriching, and transformation are handled during data consumption.

ELT data pipelines have seen increasing adoption in recent times (over the traditional ETL methodology). Since, in this approach, loading comes before transformations and monitoring typically occurs at the load step, these pipelines are designed with data monitoring already placed leftward. The ETL approach requires the transformation stage before loading, so more operations can be monitored before data warehousing. The examples in this article follow the ETL methodology.

‍

Example data pipeline

In the above diagram, the process begins when the data sources are identified. They could be user-generated data, data from the web, or application-related data. These sources are connected to the service(s) at the ingestion stage to extract the data. Next, transformation is typically handled by microservices. The raw data will be persisted in a data lake — data storage that can contain raw and preprocessed data. The required cleaning and transformational processes are performed until the data is in the desired state, after which it is stored in the data warehouse. The data warehouse serves as the final destination of processed data.

Data Monitoring in a Data Warehouse

Data warehousing is the final step in the data pipeline, but it’s also traditionally when data monitoring is performed. This means that the data has been persisted in the data warehouse before any bugs or other issues are identified and corrected. This is analogous to a feedback control system that only monitors quality in the output and then attempts to correct the cause of problems. The output is already contaminated, however.

Any problem with data already in a data warehouse will only raise an alert at the end of the pipeline. Since data pipelines typically involve multiple microservices, this makes finding the problem’s source more difficult. This is because the data from ingestion will have been transformed through a series of operations and in some cases combined with data from other sources. To identify the problem, data engineers will need to investigate the various stages in the transformation pipeline and analyze data stored in pre-warehouse locations like data lakes.

These additional steps are required because the data was assumed to be fine as it passed through all the previous pipeline stages without any essential monitoring and quality checks. Such assumptions can prove to be expensive and time-consuming as end-stage problems are diagnosed and fixed.

As you attempt to identify the source of the problem, the corrupted data already in the warehouse also needs to be handled. This means the data must also be fixed in downstream usage, such as analytical reports, data consumers, and ML models. To resolve the problem, the transformations carried out on the data must be reversed, and often this is no trivial operation that needs lineage anaysis.

One way to save updated versions of the data in the warehouse is to retrace the incorrect transformation path. To achieve this, you will have to create reverse transforms for each relevant step. Then the data can be sent through the right transformation steps and saved in the data warehouse. In another scenario, if data from a point before the erroneous step was cached or saved (in a data lake for instance) and the specific incorrect data point can be identified, you can send that data through the corrected transformations to save updated values in the data warehouse.

Shifting Left

The solution to this problem is to change how you build data pipelines and monitoring systems. You should shift your quality practices to earlier stages in the pipeline, such as imposing rules and quality requirements for data as it leaves various services. Deviations will trigger alerts at these earlier stages so that they can be investigated and handled.

This works because when data that doesn’t satisfy the set standards is processed and identified, the service in which that occurs picks up on the problem and alerts data owners. Allowing for such actions creates a “fail-fast” system in which problems stop a process immediately rather than allow it to proceed to the next level of operation. You can more easily identify the source of a problem and devise a solution before this vital information gets lost in a sea of transformations, many of which may not be easy to reverse-engineer.

If you detect and fix problems earlier in the process, you’ll reduce the amount of damaged data that users interact with, which can help avoid damage to your business’s reputation. You’ll also gain more visibility and insight into the inner workings of your data pipeline. You can use that information to generate analytical dashboards that track performance, which can help you further refine your monitoring strategy or make other business-related decisions.

As seen in the first image below, data monitoring is traditionally performed on the data already persisted in the data warehouse. The defined metrics and data are used to generate analytical dashboards to help other technical and non-technical personnel understand the state of the system.

‍

Pipeline with data monitoring

The second image shows a sample data pipeline in which data monitoring systems and quality checks have been shifted left to the data transformation services in the pipeline. This means you’re not just monitoring the final output but also intermediate stages in the process.

Pipeline with data monitoring shifted left

Conclusion

Data monitoring is a crucial component in ensuring that your data is accurate and free of errors, but if you only implement it on data that’s already in your data warehouse, you may face bigger challenges in detecting and fixing contaminated data.

The best way to address this issue is to shift your data monitoring to earlier stages of your data pipeline. You’ll gain multiple benefits by doing so, including faster detection of issues, faster investigation and resolution, and lower downstream impact. Shifting your data monitoring left will give you more control over your data pipeline and your overall workflow.

Article first posted on Telm.ai by Fortune Adekogbe