Use Data Contracts To Improve Data Quality And For Better Analytics

A software engineer’s primary job is to build operational systems and keep them running. An analytics team’s needs are the last thing on their mind. So software engineers make changes to table schemas without thinking about downstream impacts. When database schema changes are made in transaction systems, the ETL pipelines break, and the data team is expected to fix the problem caused by upstream changes.

Consider an ML model to predict user churn. It relies on NPS scores that come from an upstream table. As part of their routine cleanup, the developers delete this field because it’s redundant. The ML model breaks and the data scientist is scrambling to get the model back up. These kinds of scenarios are common in companies. This is where data contracts come in.

A data contract is an agreement between upstream data providers (software engineers) and data consumers. The contract states what data is being ingested by the data consumer and how often. The contract also states the method of ingestion (full or incremental) and identifies the contacts for the source systems and the data consumers.

Data teams use Fivetran or Stitch to ingest data into the data warehouse. A data contract allows you to create an abstraction between transaction data and downstream consumers by using tools such as Kafka. With a data contract, the data consumers can define the schema they need rather than accept what’s available from the transaction systems.

The data contract can be written in JSON and stored on GitHub. In the JSON, you can define the properties of the data and the frequency. After the data contract is established, you can build a Git Hook to monitor changes to the source system tables. The software engineering teams and the data engineering teams will be notified of issues when a new pull request is made.

Bottom Line

With data contracts, data teams take an active approach by defining the data they need for analytics and not just simply accepting the data sent by the source systems. You can create abstraction between the source and consumer systems by using tools such as Kafka. A data contract can be defined in JSON and stored in GitHub. You can send notifications to relevant parties in case there are data quality issues.

Leave a Reply