Data Quality Frameworks
Production patterns for implementing data quality with Great Expectations, dbt tests, and data contracts to ensure reliable data pipelines
Data Quality Frameworks
What Is This
Data Quality Frameworks are a set of production-ready patterns and practices for implementing robust data quality validation in modern data engineering workflows. This skill focuses on leveraging leading open-source tools such as Great Expectations, dbt tests, and data contracts to automate, monitor, and enforce the quality of data as it flows through pipelines. By adopting these frameworks, teams ensure that their data assets are reliable, trustworthy, and fit for downstream analytics, machine learning, and business intelligence applications.
Data Quality Frameworks are essential for teams that need to define, validate, and monitor data quality across multiple dimensions, including completeness, uniqueness, validity, accuracy, consistency, and timeliness. This skill provides templates and best practices for integrating data quality checks directly into ETL pipelines, establishing clear data contracts between producers and consumers, and automating validation in CI/CD workflows.
Why Use It
Data quality issues are a leading cause of broken dashboards, failed machine learning models, and lost business trust. Undetected problems such as missing values, duplicates, or schema drift can propagate through pipelines and create significant downstream impact. By implementing Data Quality Frameworks, organizations benefit in several ways:
- Early Detection: Identify data issues at ingestion or transformation time, not after they reach critical systems.
- Reliability: Ensure data meets agreed-upon quality standards before it is published or consumed.
- Automation: Integrate testing and validation into CI/CD so that data quality checks run automatically.
- Accountability: Use data contracts to formalize expectations and responsibilities between data producers and consumers.
- Observability: Monitor data quality metrics over time and set up timely alerts for anomalies.
Data Quality Frameworks help operationalize trust in data and enable a scalable approach to maintaining high standards across complex pipelines.
How to Use It
1. Define Data Quality
Dimensions
Begin by specifying which data quality dimensions are important for your use case. Common dimensions include:
| Dimension | Description | Example Check |
|---|---|---|
| Completeness | No missing values | expect_column_values_to_not_be_null |
| Uniqueness | No duplicates | expect_column_values_to_be_unique |
| Validity | Values in expected range | expect_column_values_to_be_in_set |
| Accuracy | Data matches reality | Cross-reference with external sources |
| Consistency | No contradictions | expect_column_pair_values_A_to_be_greater_than_B |
| Timeliness | Data is recent | Check timestamps against current date |
2. Implement Validation with Great
Expectations
Great Expectations (GE) is a popular open-source tool for expressing and validating data quality rules as code. Here’s an example of defining expectations in a GE suite:
import great_expectations as ge
df = ge.read_csv("data/my_table.csv")
df.expect_column_values_to_not_be_null("user_id")
df.expect_column_values_to_be_unique("order_id")
df.expect_column_values_to_be_in_set("status", ["active", "inactive", "pending"])You can automate GE validation in your data pipeline and fail the pipeline if expectations are not met.
3. Build dbt Test
Suites
dbt (data build tool) allows you to define tests as part of your data transformation models. Example schema.yml for dbt:
version: 2
models:
- name: users
columns:
- name: user_id
tests:
- not_null
- unique
- name: email
tests:
- not_null
- unique
- name: signup_date
tests:
- not_nullRun dbt test as part of your deployment pipeline to ensure your data model meets quality criteria before publishing.
4. Establish Data
Contracts
A data contract is a formal agreement about the schema, semantics, and quality requirements for a dataset. Data contracts provide:
- Schema definitions (fields, types, nullability)
- Expected ranges or enumerations for values
- Quality thresholds (e.g., 99 percent completeness)
- Clear change management for schema evolution
A simple data contract example (YAML):
dataset: users
fields:
- name: user_id
type: integer
required: true
unique: true
- name: email
type: string
required: true
unique: true
- name: status
type: string
allowed_values: [active, inactive, pending]Automated tools can enforce these contracts during pipeline runs.
5. Monitor and Automate in
CI/CD
Integrate data quality checks into your CI/CD process so that every deployment runs validation tests. For example, add dbt and Great Expectations tests as GitHub Actions or other CI jobs. Fail deployments if critical data quality rules are violated.
When to Use It
- When building or refactoring data pipelines that require high trust and reliability.
- When onboarding new data sources and need to validate incoming data quality.
- When establishing clear interfaces between data engineering and analytics or data science teams.
- When automating data quality validation as part of CI/CD workflows.
- When you need to provide SLAs or data quality guarantees to downstream consumers.
- When monitoring and alerting on key data quality metrics is required.
Important Notes
- Data quality frameworks are not a one-time setup; they require ongoing maintenance as data evolves.
- Start with critical tables and columns, then expand coverage over time.
- Combine automated tests with periodic manual reviews for edge cases.
- Communicate data quality expectations clearly across teams using data contracts.
- Monitor alerts and track historical data quality trends to identify recurring issues.
- Integrate with observability platforms to connect data quality events to overall pipeline health.
- Regularly review and update data contracts and test suites as business requirements change.
By implementing Data Quality Frameworks with tools like Great Expectations, dbt, and data contracts, teams can systematically ensure that their data is accurate, reliable, and ready for use in production systems.
More Skills You Might Like
Explore similar skills to enhance your workflow
Nx Workspace Patterns
| Type | Purpose | Example |
Review All GDDs
argument-hint: "[focus: full | consistency | design-theory | since-last-review]"
Phase 0: Parse Arguments and Context Check
allowed-tools: Read, Glob, Grep, Write, Edit, Task, AskUserQuestion
Openclaw
Open-source skill management and sharing platform for Claude Code and AI agents
Lark Workflow Standup Report
lark-cli calendar +agenda --start "2026-03-26T00:00:00+08:00" --end "2026-03-26T23:59:59+08:00"
Netlify AI Gateway
Guide for using Netlify AI Gateway to access AI models. Use when adding AI capabilities or selecting/changing AI models. Must be read before