An open-source, multi-agent research platform that surfaces hidden correlations between health outcomes and environmental, occupational, and geographic factors. Integrates public datasets, scientific literature, and structured causal analysis to identify leads that warrant formal investigation.
The Unlikely Correlations Engine
Systematic detection of hidden health-environment associations
Epidemiological research identified elevated Parkinson's disease rates among populations living near golf courses. Investigation revealed the underlying factor was chronic pesticide exposure from course maintenance. Residential proximity to treated land — not the recreational facility itself — was the relevant variable. The association was detectable in existing public datasets, but required cross-referencing sources that are not typically analyzed together.
This project applies that same cross-referencing approach systematically. The platform correlates health outcome data with geographic, occupational, environmental, and demographic variables at the US county level. When a statistically significant association is detected, it searches the published literature for supporting or contradicting evidence, evaluates causal plausibility, and generates structured study proposals for further investigation.
Correlation Detection
Cross-reference health outcomes with geographic, occupational, environmental, and demographic data at the US county level. Compute statistical correlations and identify spatial clusters using geospatial autocorrelation analysis (LISA).
- CDC Wonder — mortality by ICD-10 code
- CMS Chronic Conditions — disease prevalence by county
- EPA — pesticide use, toxic release inventory, Superfund sites
- USGS — land use, water quality, pesticide application
- Census ACS — demographic and socioeconomic confounders
- County Health Rankings — composite health indicators
Literature Mining
Search PubMed, Semantic Scholar, and other academic databases for published evidence related to detected correlations. Synthesize findings across studies, identify contradictions, and flag evidence gaps.
- PubMed / NCBI E-utilities
- Semantic Scholar API
- FDA FAERS — adverse event reports
- WHO Global Health Observatory
Causation Analysis
Evaluate detected associations against established causal frameworks. Assess confounders, dose-response relationships, biological plausibility, temporality, and consistency across studies.
- Cross-study comparison and contradiction analysis
- Exposure pathway modeling
- Bradford Hill criteria assessment
Study Design
For associations with sufficient supporting evidence, generate structured research proposals including study population, methodology, controls, outcome measures, and timeline.
- Existing study methodologies as templates
- NIH and WHO study design frameworks
Learning Path
The 10 Projects
#1 Unlikely Correlations Engine
activeMulti-agent pipeline for detecting hidden associations between health outcomes and environmental, occupational, or geographic factors. Integrates 20+ public data sources, searches published literature for supporting evidence, evaluates causal plausibility, and generates structured study proposals for further investigation.
- CDC Wonder / CMS Chronic Conditions — health outcomes by county
- EPA — pesticide use, toxic releases, Superfund sites, air quality
- USGS — land use, water quality, pesticide application estimates
- PubMed / Semantic Scholar — academic literature
- Census ACS / County Health Rankings — demographic confounders
- openFDA FAERS — adverse event reports
End-to-end demonstration using the Parkinson's disease / pesticide exposure case. Reproduce the known correlation, retrieve supporting literature, and generate a causation assessment.
Multi-agent architecture, data APIs, geospatial analysis, literature synthesis
#2 Occupational Disease Pattern Miner
plannedAnalyze health outcome data by occupation to identify statistically unusual illness rates. Surface occupations with elevated risk for specific disease categories relative to population baselines.
- NIOSH occupational health datasets
- BLS occupational data
- Census County Business Patterns
Ranked table of occupations with the highest deviation from baseline rates for a user-selected disease category, with confidence intervals.
Statistical z-scores, pandas dataframes, Streamlit data display
#3 FDA Adverse Event Anomaly Detector
plannedAnalyze FDA FAERS reports to identify unexpected co-occurrences between drugs and health conditions not listed in current labeling.
- openFDA FAERS API
Input a drug name, return a ranked list of reported conditions sorted by divergence from expected adverse event profile.
REST API integration, statistical anomaly detection, AI-assisted interpretation
#4 Policy Brief Summarizer
plannedStructured summarization of health policy documents, academic papers, and institutional reports. Extracts key findings, recommendations, evidence quality, and identifies gaps requiring further research.
- User-uploaded PDFs or pasted text
Single-page application accepting document text and returning a structured summary with cited findings and evidence assessment.
Claude API integration, prompt engineering, Streamlit interface design
#5 Multi-Document Research Q&A
plannedRetrieval-augmented generation system for querying across a library of research documents. Synthesizes answers from multiple sources with citations to specific documents and passages.
- User-uploaded PDF library
Upload three documents, submit a natural language query, receive a synthesized answer with source attribution.
RAG architecture, vector embeddings, document indexing
#6 Cross-Study Contradiction Finder
plannedComparative analysis of multiple studies on the same health topic. Identifies agreements, contradictions, and methodological differences. Assesses weight of evidence accounting for study design, population, and potential sources of bias.
- User-uploaded PDFs or PubMed IDs
Input two paper abstracts, receive structured comparison of findings, methodology, and identified discrepancies.
Structured prompting, comparative analysis, evidence synthesis
#7 Health System Scorecard Dashboard
plannedComparative visualization of health system indicators across countries. Covers UHC service coverage index, health expenditure, out-of-pocket costs, workforce density, and related metrics.
- WHO Global Health Observatory API
- World Bank Health Nutrition and Population API
Select countries and indicators, generate side-by-side comparison dashboard with trend lines.
Data API integration, interactive charting, dashboard layout
#8 Global Health Funding Tracker
plannedAnalysis of global health funding flows from major donors. Identifies trends, gaps, and alignment opportunities across disease areas and geographic regions.
- IATI (International Aid Transparency Initiative) open data
- Gates Foundation grant database
Filter grants by disease area and year, generate funding trend analysis with identified gaps.
Data filtering, AI-assisted synthesis, export functionality
#9 Stakeholder Interview Synthesizer
plannedAutomated extraction of themes, key findings, notable statements, and action items from interview transcripts and stakeholder consultation records.
- User-uploaded .txt or .docx transcripts
Process a single transcript, extract structured themes and action items with source attribution.
File handling, structured extraction, JSON output formatting
#10 Health Policy Diffusion Tracker
plannedAnalysis of how specific health policies propagate across countries over time. Identifies contextual factors — economic, political, demographic — that correlate with adoption timing.
- WHO policy databases
- Academic literature via OpenAlex API
- World Bank governance indicators
Map the adoption timeline of a specific policy globally and generate a hypothesis about factors correlated with early adoption.
Timeline visualization, multi-source data joins, hypothesis generation