Data Quality Auditor

Audit datasets for completeness, consistency, accuracy, and validity. Profile data distributions, detect anomalies and outliers, surface structural is

What Is Data Quality Auditor?

The Data Quality Auditor is a productivity-focused skill designed to systematically assess the health of your datasets. Built to support both exploratory and targeted data quality investigations, this tool helps data engineers, analysts, and scientists audit datasets for completeness, consistency, accuracy, and validity. It profiles data distributions, detects anomalies and outliers, surfaces structural issues, and provides an actionable remediation plan. By integrating the Data Quality Auditor into your data workflows, you can ensure that downstream analytics, models, and business decisions are driven by trustworthy data.

The skill is open-source and available at https://github.com/alirezarezvani/claude-skills/tree/main/engineering/data-quality-auditor.

Why Use Data Quality Auditor?

Poor data quality can silently undermine even the most sophisticated analytics and machine learning initiatives. Issues like missing values, inconsistent types, duplicate records, and undetected outliers can lead to misleading insights or model degradation. Manual data checks are slow, error-prone, and often miss subtle issues that only robust profiling can uncover.

The Data Quality Auditor addresses these challenges by:

  • Automating Data Audits: Quickly surface hidden issues across multiple dimensions of data quality.
  • Providing Quantitative Scores: Assign a Data Quality Score (DQS) to objectively measure and track dataset health.
  • Generating Actionable Remediation Plans: Move beyond diagnostics to prescribe prioritized, practical fixes.
  • Supporting Both Full and Targeted Audits: Adapt to new datasets or investigate specific columns or pipeline stages with equal rigor.
  • Reducing Downstream Risks: Ensure that models and dashboards are built on reliable, clean data.

By integrating this tool into ETL processes or ad-hoc analyses, organizations can prevent data quality issues from poisoning critical business outcomes.

How to Get Started

Getting started with the Data Quality Auditor is straightforward. The repository provides Python scripts for various aspects of data quality assessment. Here’s a quick guide:

  1. Clone the Repository:

    git clone https://github.com/alirezarezvani/claude-skills.git
    cd claude-skills/engineering/data-quality-auditor
  2. Install Dependencies: Make sure you have Python 3.8+ and install required packages:

    pip install -r requirements.txt
  3. Run a Full Audit on a New Dataset: Suppose you have a CSV dataset named data.csv.

    python data_profiler.py --input data.csv
    python missing_value_analyzer.py --input data.csv
    python outlier_detector.py --input data.csv
    # Run cross-column checks and scoring as per the documentation
  4. Perform a Targeted Scan: If you suspect issues in a specific column, say age, you can run:

    python outlier_detector.py --input data.csv --columns age
  5. Review Reports and Remediation Plans: Each script outputs findings and suggested next steps. Integrate these into your data remediation workflow.

Key Features

The Data Quality Auditor skill provides a comprehensive set of features:

  • Profiling: Uses data_profiler.py to summarize dataset shape, data types, completeness, and statistical distributions. Example:

    # data_profiler.py snippet
    import pandas as pd
    
    df = pd.read_csv('data.csv')
    print(df.info())
    print(df.describe(include='all'))
  • Missing Value Analysis: missing_value_analyzer.py classifies missingness patterns as MCAR (Missing Completely at Random), MAR (Missing at Random), or MNAR (Missing Not at Random), informing the best imputation strategy.

  • Outlier Detection: outlier_detector.py flags anomalies using IQR and Z-score methods. For example:

    # outlier_detector.py snippet
    import numpy as np
    
    col = df['age']
    z_scores = np.abs((col - col.mean()) / col.std())
    outliers = df[z_scores > 3]
    print(outliers)
  • Cross-column Integrity Checks: Inspects referential integrity, detects duplicate rows, and validates logical constraints (e.g., start_date < end_date).

  • Data Quality Scoring and Reporting: Aggregates findings into a Data Quality Score (DQS) and generates a prioritized remediation plan.

Best Practices

To maximize the value of the Data Quality Auditor, consider these best practices:

  • Automate Regular Audits: Integrate the auditor into your ETL pipelines to catch issues early, before they propagate.
  • Baseline and Monitor: Establish a data quality baseline and monitor for deviations over time using DQS.
  • Iterate on Remediation: Use the actionable recommendations to drive continuous improvement—fix the most impactful issues first.
  • Document Data Quality: Always attach audit reports and remediation plans to datasets for transparency.
  • Investigate Root Causes: For issues flagged by targeted scans, trace problems upstream to prevent recurrence.

Important Notes

  • Scalability: For very large datasets, consider sampling or running checks in distributed environments to maintain performance.
  • Interpretation of DQS: The Data Quality Score is a composite metric; always review underlying issues and context before making decisions.
  • Customization: Scripts can be extended for domain-specific rules (e.g., custom validation logic).
  • Data Privacy: Ensure compliance with data privacy regulations when profiling sensitive datasets.
  • Skill Limitations: Automated detection catches most, but not all, data issues. Manual review remains essential for edge cases and business-specific checks.

For more details, usage documentation, and contributions, visit the official repository. By systematically auditing and improving your data, you can safeguard the integrity of all downstream analytics and machine learning initiatives.