An open-source, multi-agent research platform that surfaces hidden correlations between health outcomes and environmental, occupational, and geographic factors. Integrates public datasets, scientific literature, and structured causal analysis to identify leads that warrant formal investigation.
The Unlikely Correlations Engine
Systematic detection of hidden health-environment associations
Epidemiological research identified elevated Parkinson's disease rates among populations living near golf courses. Investigation revealed the underlying factor was chronic pesticide exposure from course maintenance. Residential proximity to treated land — not the recreational facility itself — was the relevant variable. The association was detectable in existing public datasets, but required cross-referencing sources that are not typically analyzed together.
This project applies that same cross-referencing approach systematically. The platform correlates health outcome data with geographic, occupational, environmental, and demographic variables at the US county level. When a statistically significant association is detected, it searches the published literature for supporting or contradicting evidence, evaluates causal plausibility, and generates structured study proposals for further investigation.
Correlation Detection
Cross-reference health outcomes with geographic, occupational, environmental, and demographic data at the US county level. Compute statistical correlations, run sweep mode across all available variables, and identify spatial clusters using LISA geospatial autocorrelation.
- CDC Wonder — mortality by ICD-10 code
- CMS Chronic Conditions — disease prevalence by county
- CDC PLACES — chronic disease prevalence estimates
- County Health Rankings — composite health indicators
- EPA — pesticide use, toxic release inventory, Superfund sites, EJScreen
- EPA AQS — air quality monitoring (PM2.5, ozone, HAPs)
- USGS — land use, water quality, pesticide application estimates
- USDA CropScape — crop-specific land cover by county
- OpenStreetMap — golf courses, industrial sites, facility locations
- Census ACS — demographic and socioeconomic confounders
- CDC Social Vulnerability Index
Literature Mining
Search PubMed, Semantic Scholar, and other academic databases for published evidence related to detected correlations. Synthesize findings across studies, identify contradictions, and flag evidence gaps. Co-designed with domain expert.
- PubMed / NCBI E-utilities
- Semantic Scholar API
- openFDA FAERS — adverse event reports
Causation Analysis
Evaluate detected associations against Bradford Hill criteria. Assess confounders via partial correlation, generate alternative explanations (ecological fallacy, selection bias, shared upstream cause), and produce an overall causal plausibility rating. Co-designed with domain expert.
- Correlation and literature results from upstream agents
- Bradford Hill criteria framework
- Confounder adjustment via partial correlation
Study Design
Generate two output types: a Rapid Investigation Brief for operational decision-making, and a Full Research Proposal for formal study design. Both address specific evidence gaps identified by the Causation Agent. Co-designed with domain expert.
- Pipeline output from all upstream agents
- NIH and WHO study design frameworks
- Precedent studies from literature results
Learning Path
The 10 Projects
#1 Unlikely Correlations Engine
activeMulti-agent pipeline for detecting hidden associations between health outcomes and environmental, occupational, or geographic factors. Integrates 20+ public data sources, searches published literature for supporting evidence, evaluates causal plausibility, and generates structured study proposals for further investigation.
- CDC Wonder / CMS Chronic Conditions — health outcomes by county
- EPA — pesticide use, toxic releases, Superfund sites, air quality
- USGS — land use, water quality, pesticide application estimates
- PubMed / Semantic Scholar — academic literature
- Census ACS / County Health Rankings — demographic confounders
- openFDA FAERS — adverse event reports
End-to-end demonstration using the Parkinson's disease / pesticide exposure case. Reproduce the known correlation, retrieve supporting literature, and generate a causation assessment.
Multi-agent architecture, data APIs, geospatial analysis, literature synthesis
#2 Occupational Disease Pattern Miner
plannedAnalyze health outcome data by occupation to identify statistically unusual illness rates. Surface occupations with elevated risk for specific disease categories relative to population baselines.
- NIOSH occupational health datasets
- BLS occupational data
- Census County Business Patterns
Ranked table of occupations with the highest deviation from baseline rates for a user-selected disease category, with confidence intervals.
Statistical z-scores, pandas dataframes, Streamlit data display
#3 FDA Adverse Event Anomaly Detector
plannedAnalyze FDA FAERS reports to identify unexpected co-occurrences between drugs and health conditions not listed in current labeling.
- openFDA FAERS API
Input a drug name, return a ranked list of reported conditions sorted by divergence from expected adverse event profile.
REST API integration, statistical anomaly detection, AI-assisted interpretation
#4 Policy Brief Summarizer
plannedStructured summarization of health policy documents, academic papers, and institutional reports. Extracts key findings, recommendations, evidence quality, and identifies gaps requiring further research.
- User-uploaded PDFs or pasted text
Single-page application accepting document text and returning a structured summary with cited findings and evidence assessment.
Claude API integration, prompt engineering, Streamlit interface design
#5 Multi-Document Research Q&A
plannedRetrieval-augmented generation system for querying across a library of research documents. Synthesizes answers from multiple sources with citations to specific documents and passages.
- User-uploaded PDF library
Upload three documents, submit a natural language query, receive a synthesized answer with source attribution.
RAG architecture, vector embeddings, document indexing
#6 Cross-Study Contradiction Finder
plannedComparative analysis of multiple studies on the same health topic. Identifies agreements, contradictions, and methodological differences. Assesses weight of evidence accounting for study design, population, and potential sources of bias.
- User-uploaded PDFs or PubMed IDs
Input two paper abstracts, receive structured comparison of findings, methodology, and identified discrepancies.
Structured prompting, comparative analysis, evidence synthesis
#7 Health System Scorecard Dashboard
plannedComparative visualization of health system indicators across countries. Covers UHC service coverage index, health expenditure, out-of-pocket costs, workforce density, and related metrics.
- WHO Global Health Observatory API
- World Bank Health Nutrition and Population API
Select countries and indicators, generate side-by-side comparison dashboard with trend lines.
Data API integration, interactive charting, dashboard layout
#8 Global Health Funding Tracker
plannedAnalysis of global health funding flows from major donors. Identifies trends, gaps, and alignment opportunities across disease areas and geographic regions.
- IATI (International Aid Transparency Initiative) open data
- Gates Foundation grant database
Filter grants by disease area and year, generate funding trend analysis with identified gaps.
Data filtering, AI-assisted synthesis, export functionality
#9 Stakeholder Interview Synthesizer
plannedAutomated extraction of themes, key findings, notable statements, and action items from interview transcripts and stakeholder consultation records.
- User-uploaded .txt or .docx transcripts
Process a single transcript, extract structured themes and action items with source attribution.
File handling, structured extraction, JSON output formatting
#10 Health Policy Diffusion Tracker
plannedAnalysis of how specific health policies propagate across countries over time. Identifies contextual factors — economic, political, demographic — that correlate with adoption timing.
- WHO policy databases
- Academic literature via OpenAlex API
- World Bank governance indicators
Map the adoption timeline of a specific policy globally and generate a hypothesis about factors correlated with early adoption.
Timeline visualization, multi-source data joins, hypothesis generation