Tuva Project: Open-Source Healthcare Modeling

Welcome to the Tuva ML Models Hub — an open-source ecosystem for healthcare risk prediction, cost benchmarking, and expected value modeling.

Mission

The Tuva Project is dedicated to democratizing healthcare knowledge.
We believe that access to robust models should not be locked behind paywalls or proprietary systems.

These models are typically:

Expensive to build and maintain
Trained on complex healthcare data
Essential for policy, research, and actuarial strategy

By open-sourcing these tools, we empower health systems, researchers, and startups to build with transparency and scale with trust.

What You'll Find Here

This hub is a growing library of machine learning models designed to support:

Cost prediction
Encounter forecasting
Risk stratification
Benchmarking for Medicare, Medicaid, and commercial populations

Each model includes:

Trained model artifacts (e.g., .pkl, .joblib)
Scripts for running predictions
Complete documentation and evaluation metrics

Quick Start: End-to-End Workflow

This section provides high-level instructions for running a model with the Tuva Project. The workflow involves preparing benchmark data using dbt, running a Python prediction script, and optionally ingesting the results back into dbt for analysis.

1. Configure Your dbt Project

You need to enable the correct variables in your dbt_project.yml file to control the workflow.

A. Enable Benchmark Marts

These two variables control which parts of the Tuva Project are active. They are false by default.

# in dbt_project.yml
vars:
  benchmarks_train: true
  benchmarks_already_created: true

benchmarks_train: Set to true to build the datasets that the ML models will use for making predictions.
benchmarks_already_created: Set to true to ingest model predictions back into the project as a new dbt source.

B. (Optional) Set Prediction Source Locations

If you plan to bring predictions back into dbt for analysis, you must define where dbt can find the prediction data.

# in dbt_project.yml
vars:
  predictions_person_year: "{{ source('benchmark_output', 'person_year') }}"
  predictions_inpatient: "{{ source('benchmark_output', 'inpatient') }}"
  predictions_inpatient_prospective: "{{ source('benchmark_output', 'inpatient_predictions_prospective') }}"
  predictions_person_year_prospective: "{{ source('benchmark_output', 'pmpm_predictions_prospective') }}"

C. Configure `sources.yml`

Ensure your sources.yml file includes a definition for the source you referenced above (e.g., benchmark_output) that points to the database and schema where your model's prediction outputs are stored.

2. The 3-Step Run Process

This workflow can be managed by any orchestration tool (e.g., Airflow, Prefect, Fabric Notebooks) or run manually from the command line.

Step 1: Generate the Training & Benchmarking Data

Run the Tuva Project with benchmarks_train enabled. This creates the input data required by the ML model.

dbt build --vars '{benchmarks_train: true}'

To run only the benchmark mart:

dbt build --select tag:benchmarks_train --vars '{benchmarks_train: true}'

Step 2: Run the Prediction Python Code

Execute the Python script to generate predictions. This script will read the data created in Step 1 and write the prediction outputs to a persistent location (e.g., a table in your data warehouse).

We have provided example Snowflake Notebook code within each model's repository that was used in Tuva's environment.

Step 3: (Optional) Analyze Predictions in dbt

To bring the predictions back into the Tuva Project for analysis, run dbt again with benchmarks_already_created enabled. This populates the analytics marts.

dbt build --vars '{benchmarks_already_created: true, benchmarks_train: false}'

To run only the analysis models:

dbt build --select tag:benchmarks_analysis --vars '{benchmarks_already_created: true, benchmarks_train: false}'

Current Focus: Medicare (CMS)

Our initial models use de-identified CMS data to calculate:

Expected values for paid amounts and encounter counts at the member-year level
Readmission rate
Discharge location
Length of stay

Models like the Encounter Cost Prediction Model are trained on the 2022/23 Medicare Standard Analytic Files (SAF), using standardized preprocessing and evaluation pipelines.

What's Next

We are expanding to include:

Commercial claims models (e.g., ESI, employer-based populations)
Medicaid utilization and cost models

Contribute

This hub is open to community contributions.

If you're working on a healthcare machine learning model and want to share it:

Fork one of our repositories
Upload your trained model and code
Document your inputs, outputs, and evaluation
Open a pull request or reach out to our team

We believe risk modeling should be open infrastructure.
Help us build a future where healthcare knowledge is free and shared.