Data Quality Frameworks

Production patterns for implementing data quality with Great Expectations, dbt tests, and data contracts to ensure reliable data pipelines

Source: wshobson/agents

Data Quality Frameworks

What Is This

Data Quality Frameworks are a set of production-ready patterns and practices for implementing robust data quality validation in modern data engineering workflows. This skill focuses on leveraging leading open-source tools such as Great Expectations, dbt tests, and data contracts to automate, monitor, and enforce the quality of data as it flows through pipelines. By adopting these frameworks, teams ensure that their data assets are reliable, trustworthy, and fit for downstream analytics, machine learning, and business intelligence applications.

Data Quality Frameworks are essential for teams that need to define, validate, and monitor data quality across multiple dimensions, including completeness, uniqueness, validity, accuracy, consistency, and timeliness. This skill provides templates and best practices for integrating data quality checks directly into ETL pipelines, establishing clear data contracts between producers and consumers, and automating validation in CI/CD workflows.

Why Use It

Data quality issues are a leading cause of broken dashboards, failed machine learning models, and lost business trust. Undetected problems such as missing values, duplicates, or schema drift can propagate through pipelines and create significant downstream impact. By implementing Data Quality Frameworks, organizations benefit in several ways:

Early Detection: Identify data issues at ingestion or transformation time, not after they reach critical systems.
Reliability: Ensure data meets agreed-upon quality standards before it is published or consumed.
Automation: Integrate testing and validation into CI/CD so that data quality checks run automatically.
Accountability: Use data contracts to formalize expectations and responsibilities between data producers and consumers.
Observability: Monitor data quality metrics over time and set up timely alerts for anomalies.

Data Quality Frameworks help operationalize trust in data and enable a scalable approach to maintaining high standards across complex pipelines.

How to Use It

1. Define Data Quality

Dimensions

Begin by specifying which data quality dimensions are important for your use case. Common dimensions include:

Dimension	Description	Example Check
Completeness	No missing values	`expect_column_values_to_not_be_null`
Uniqueness	No duplicates	`expect_column_values_to_be_unique`
Validity	Values in expected range	`expect_column_values_to_be_in_set`
Accuracy	Data matches reality	Cross-reference with external sources
Consistency	No contradictions	`expect_column_pair_values_A_to_be_greater_than_B`
Timeliness	Data is recent	Check timestamps against current date

2. Implement Validation with Great

Expectations

Great Expectations (GE) is a popular open-source tool for expressing and validating data quality rules as code. Here’s an example of defining expectations in a GE suite:

import great_expectations as ge

df = ge.read_csv("data/my_table.csv")
df.expect_column_values_to_not_be_null("user_id")
df.expect_column_values_to_be_unique("order_id")
df.expect_column_values_to_be_in_set("status", ["active", "inactive", "pending"])

You can automate GE validation in your data pipeline and fail the pipeline if expectations are not met.

3. Build dbt Test

Suites

dbt (data build tool) allows you to define tests as part of your data transformation models. Example schema.yml for dbt:

version: 2

models:
  - name: users
    columns:
      - name: user_id
        tests:
          - not_null
          - unique
      - name: email
        tests:
          - not_null
          - unique
      - name: signup_date
        tests:
          - not_null

Run dbt test as part of your deployment pipeline to ensure your data model meets quality criteria before publishing.

4. Establish Data

Contracts

A data contract is a formal agreement about the schema, semantics, and quality requirements for a dataset. Data contracts provide:

Schema definitions (fields, types, nullability)
Expected ranges or enumerations for values
Quality thresholds (e.g., 99 percent completeness)
Clear change management for schema evolution

A simple data contract example (YAML):

dataset: users
fields:
  - name: user_id
    type: integer
    required: true
    unique: true
  - name: email
    type: string
    required: true
    unique: true
  - name: status
    type: string
    allowed_values: [active, inactive, pending]

Automated tools can enforce these contracts during pipeline runs.

5. Monitor and Automate in

CI/CD

Integrate data quality checks into your CI/CD process so that every deployment runs validation tests. For example, add dbt and Great Expectations tests as GitHub Actions or other CI jobs. Fail deployments if critical data quality rules are violated.

When to Use It

When building or refactoring data pipelines that require high trust and reliability.
When onboarding new data sources and need to validate incoming data quality.
When establishing clear interfaces between data engineering and analytics or data science teams.
When automating data quality validation as part of CI/CD workflows.
When you need to provide SLAs or data quality guarantees to downstream consumers.
When monitoring and alerting on key data quality metrics is required.

Important Notes

Data quality frameworks are not a one-time setup; they require ongoing maintenance as data evolves.
Start with critical tables and columns, then expand coverage over time.
Combine automated tests with periodic manual reviews for edge cases.
Communicate data quality expectations clearly across teams using data contracts.
Monitor alerts and track historical data quality trends to identify recurring issues.
Integrate with observability platforms to connect data quality events to overall pipeline health.
Regularly review and update data contracts and test suites as business requirements change.

By implementing Data Quality Frameworks with tools like Great Expectations, dbt, and data contracts, teams can systematically ensure that their data is accurate, reliable, and ready for use in production systems.

More Skills You Might Like

Explore similar skills to enhance your workflow

Data Quality Frameworks

Data Quality Frameworks

What Is This

Why Use It

How to Use It

1. Define Data Quality

2. Implement Validation with Great

3. Build dbt Test

4. Establish Data

5. Monitor and Automate in

When to Use It

Important Notes

More Skills You Might Like

Nx Workspace Patterns

Review All GDDs

Phase 0: Parse Arguments and Context Check

Openclaw

Lark Workflow Standup Report

Netlify AI Gateway