Hypothesis Generation
Accelerate scientific discovery by automating hypothesis generation and experimental data integration
Hypothesis Generation is a community skill for systematically formulating research hypotheses from data analysis, covering statistical pattern detection, anomaly identification, causal inference framing, experiment design, and research question development for data-driven research.
What Is This?
Overview
Hypothesis Generation provides patterns for transforming exploratory data analysis findings into formal research hypotheses. It covers statistical pattern detection that identifies significant trends and associations in datasets, anomaly identification that flags unexpected observations worthy of investigation, causal inference framing that structures hypotheses as testable cause-effect propositions, experiment design templates that specify variables, controls, and expected outcomes, and effect size estimation for power analysis and sample size planning. The skill enables researchers to move systematically from data exploration to formal hypothesis testing.
Who Should Use This
This skill serves researchers translating exploratory findings into formal testable hypotheses, data scientists identifying patterns that warrant further investigation, and teams designing experiments to validate data-driven observations.
Why Use It?
Problems It Solves
Exploratory analysis produces many patterns but lacks a systematic framework for prioritizing which to investigate further. Converting observed correlations into causal hypotheses requires careful framing to avoid logical errors. Planning experiments to test hypotheses needs power calculations that estimate required sample sizes. Documenting the reasoning chain from data to hypothesis ensures reproducibility.
Core Highlights
Pattern detector identifies statistically significant trends and correlations in datasets. Anomaly flagger highlights observations that deviate from expected distributions. Hypothesis formatter structures findings into null and alternative hypothesis pairs. Power calculator estimates sample sizes needed to detect specified effect sizes.
How to Use It?
Basic Usage
from dataclasses import dataclass, field
import numpy as np
from scipy import stats
@dataclass
class HypothesisPair:
null: str
alternative: str
variable: str
effect_size: float = 0.0
p_value: float = 1.0
class PatternDetector:
def __init__(self, alpha: float = 0.05):
self.alpha = alpha
def test_correlation(
self, x: np.ndarray,
y: np.ndarray,
x_name: str, y_name: str
) -> HypothesisPair:
r, p = stats.pearsonr(x, y)
return HypothesisPair(
null=f"No linear relationship between "
f"{x_name} and {y_name}",
alternative=f"{x_name} is correlated "
f"with {y_name}",
variable=f"{x_name}_vs_{y_name}",
effect_size=round(r, 4),
p_value=round(p, 6))
def test_group_difference(
self, group_a: np.ndarray,
group_b: np.ndarray,
name: str) -> HypothesisPair:
t_stat, p = stats.ttest_ind(
group_a, group_b)
cohen_d = ((group_a.mean() - group_b.mean())
/ np.sqrt((group_a.std()**2
+ group_b.std()**2) / 2))
return HypothesisPair(
null=f"No difference in {name} "
f"between groups",
alternative=f"Groups differ in {name}",
variable=name,
effect_size=round(cohen_d, 4),
p_value=round(p, 6))Real-World Examples
import numpy as np
from scipy import stats
class ResearchDesigner:
def power_analysis(
self, effect_size: float,
alpha: float = 0.05,
power: float = 0.8) -> dict:
from scipy.stats import norm
z_alpha = norm.ppf(1 - alpha / 2)
z_beta = norm.ppf(power)
n = ((z_alpha + z_beta) / effect_size) ** 2
return {"effect_size": effect_size,
"alpha": alpha,
"power": power,
"sample_size_per_group":
int(np.ceil(n))}
def prioritize_hypotheses(
self, hypotheses: list[HypothesisPair]
) -> list[dict]:
scored = []
for h in hypotheses:
feasibility = self.power_analysis(
abs(h.effect_size)
if h.effect_size != 0 else 0.5)
scored.append({
"hypothesis": h.alternative,
"effect_size": h.effect_size,
"p_value": h.p_value,
"sample_needed":
feasibility[
"sample_size_per_group"]})
return sorted(scored,
key=lambda x: abs(x["effect_size"]),
reverse=True)
designer = ResearchDesigner()
power = designer.power_analysis(
effect_size=0.5)
print(f"Need {power['sample_size_per_group']} per group")Advanced Tips
Apply multiple testing correction when screening many variables to control false discovery rates. Document the exploratory analysis that led to each hypothesis for transparent reporting. Use effect size rather than p-value alone to prioritize hypotheses by practical significance.
When to Use It?
Use Cases
Build an automated hypothesis screening tool that tests all pairwise variable relationships in a dataset. Create an experiment planner that computes sample sizes for testing observed effects. Implement a research prioritization dashboard that ranks hypotheses by effect size and feasibility.
Related Topics
Statistical hypothesis testing, experimental design, power analysis, exploratory data analysis, and research methodology.
Important Notes
Requirements
Python with NumPy and SciPy for statistical calculations. Pandas for data manipulation and pattern detection. Understanding of statistical testing principles for valid hypothesis formulation.
Usage Recommendations
Do: clearly distinguish between exploratory findings and confirmatory hypotheses. Report effect sizes alongside p-values for meaningful interpretation. Pre-register hypotheses before conducting confirmatory experiments.
Don't: treat exploratory correlations as confirmed causal relationships without further testing. Perform multiple hypothesis tests without correcting for family-wise error rate. Design experiments with insufficient power to detect the expected effect.
Limitations
Statistical significance does not imply practical importance or causation. Power calculations require effect size estimates that may be uncertain from exploratory data. Automated hypothesis generation can produce many candidates that require domain expertise to evaluate.
More Skills You Might Like
Explore similar skills to enhance your workflow
Membervault Automation
Automate Membervault tasks via Rube MCP (Composio)
Variant Analysis
Automate and integrate Variant Analysis for scalable genomic variant detection and interpretation
Segment Anything
Segment Anything image segmentation automation and seamless integration
Background Removal
Automate background removal and integrate it into your image pipelines
D2lbrightspace Automation
Automate D2lbrightspace tasks via Rube MCP (Composio)
Eodhd Apis Automation
Automate Eodhd Apis operations through Composio's Eodhd Apis toolkit