MLOpsInsurance AIRisk ModelingSenior Design · 2024–2025

Homeowner Loss
History Prediction

A production-style MLOps system for homeowner loss prediction that turns a manual actuarial modeling workflow into an automated, auditable, human-in-the-loop AI pipeline. Built with Grange Insurance × University of Toledo.

R² 0.982

Model accuracy

RMSE 7216

Prediction error

26%

vs GLM baseline

75%

Manual review ↓

MAE $29.10

Avg prediction error

100+

Engineered features

View Architecture ↓Final Report ↓GitHub ↗

Overview

From notebook workflow to governed, continuously operating AI platform

Homeowner Loss History Prediction was a senior design project developed in partnership with Grange Insurance to modernize how homeowner risk and pure premium predictions are modeled, validated, monitored, and updated. Instead of treating machine learning as a one-time notebook workflow, the project reframed risk modeling as a continuously operating MLOps system.

The system automates the full lifecycle: raw data ingestion, schema validation, preprocessing, feature engineering, drift detection, model training, hyperparameter tuning, model evaluation, experiment tracking, monitoring, alerting, and human-approved deployment.

The key contribution was not just improving model accuracy. The bigger engineering contribution was building a governed ML operating system around the model: every dataset, schema, model run, drift event, metric, and manual decision can be tracked, reviewed, and reproduced.

The Problem

Risk patterns shift faster than manual modeling can adapt

Insurance companies need accurate homeowner risk models for pricing, underwriting, and financial stability. However, many modeling workflows are still too manual for modern claim volatility — claim volatility, inflation, weather events, and regional property-risk trends all create windows where stale models make pricing decisions.

Manual rebuilds are slow

Model updates required manual data pulls, notebook-based preprocessing, and repeated rebuilds — consuming actuarial time that should be focused on interpretation.

Silent schema failures

Data schema changes could silently break downstream training with no validation layer catching type mismatches or missing columns before they corrupted model runs.

No drift detection

Model drift was difficult to detect early without automated monitoring. Stale models could price risk for weeks before a human noticed performance degradation.

No experiment lineage

Without tracking, experiments and model versions were hard to reproduce. There was no audit trail for when a model was promoted, what data it saw, or who approved it.

Interpretability gap

Business stakeholders and actuaries needed explainability, not just predictions. A black-box model creates regulatory and trust problems in insurance.

No governance layer

High-stakes insurance decisions require human oversight, auditability, and the ability to roll back. A notebook-only workflow provides none of these guarantees.

System Architecture

Modular MLOps pipeline — end-to-end

The architecture was built as a modular MLOps workflow instead of a single monolithic script. Each stage is independently testable, observable, and replaceable.

Raw Dataset → AWS S3

Raw homeowner loss data stored in S3 and pulled into the pipeline by Airflow.

Data Ingestion → Airflow

Airflow schedules ingestion jobs, handles file staging, and coordinates downstream tasks.

Schema Validation → Pandera

Validates required columns, data types, and constraints. Logs and schema snapshots stored for auditability.

Preprocessing → Pandas / sklearn

Missing values, outliers, encoding, feature transformations, and train/test splitting.

Drift Detection

Statistical drift checks against reference distributions. Detected drift triggers alerts and proposed remediation actions.

Model Training → XGBoost

XGBoost models trained on 100+ engineered loss-history features with monotonic constraints.

Hyperparameter Optimization → Hyperopt

Bayesian optimization searches improved configurations and logs the best parameters to MLflow.

Model Tracking → MLflow

Tracks experiments, metrics, artifacts, model versions, and registry transitions across every run.

Monitoring & Alerts → Prometheus + Slack

Prometheus tracks system metrics. Slack sends alerts for drift, failures, retraining events, and approval requests.

Human-in-the-Loop Governance

Manual approval gates allow analysts to approve retraining, reject proposed fixes, override decisions, or promote models.

Deployment → AWS EC2/S3 + CI/CD

Containerized services deployed on AWS EC2/S3, with CI/CD support for repeatable builds and updates.

Technical Stack

Purpose-selected tools for each layer

Machine Learning

XGBoost

Hyperopt / Bayesian Opt.

scikit-learn

SHAP explainability

Decile lift charts

RMSE · MAE · R²

MLOps & Automation

Apache Airflow

MLflow

Pandera

Prometheus

Slack API

GitHub Actions

Docker

Cloud & Storage

AWS EC2

AWS S3

IAM-style access control

Containerized deployment

CI/CD pipeline

Data & Features

Pandas

NumPy

100+ engineered features

Schema snapshots

Drift reference distributions

Monotonic constraints

Model Design

Why XGBoost — and what it took to govern it

XGBoost was chosen because it offered the best balance between predictive performance, explainability, speed, and insurance-industry practicality — not just the highest raw metric.

Why XGBoost

✓Strong performance on structured/tabular insurance data

✓Supports monotonic constraints for logically consistent risk modeling

✓Works well with SHAP for feature-level explainability

✓More governable than deep neural networks

✓Scales to production without infrastructure overhead

Alternatives Considered

GLM PipelineBaseline — 26% worse RMSE

Random ForestSlower, less interpretable

LightGBMStrong but less insurance-standard

CatBoostGood, but SHAP integration less mature

Deep Learning / MLPToo opaque for actuarial review

Hyperparameter Optimization

Hyperopt with Bayesian optimization searched across learning rate, max depth, subsample, colsample, and regularization parameters. Every tuning run was tracked in MLflow with the associated training dataset snapshot, enabling reproducible comparison across experiments and preventing the common problem of "which run was the best one?"

Results & Impact

Strong predictive performance and meaningful automation gains

R² 0.982

Final model accuracy

RMSE 7216

Root mean square error

MAE $29.10

Mean absolute error

26%

RMSE improvement vs GLM

75%

Manual review reduction

100+

Engineered + validated features

Automation Gains

—75% reduction in manual iteration / review burden

—3× improvement in retraining responsiveness

—Schema snapshots and model lineage tracked for auditability

—Drift detection connected to self-healing retraining workflows

—Slack alerts added for operational visibility

—Manual override dashboard for human governance

Business Impact

—Reduced manual validation effort for actuarial team

—Faster retraining response to market changes

—Improved reproducibility across all model runs

—Better model governance with full decision audit trail

—More transparent actuarial review process

—Stronger deployment readiness than notebook-only model

Key Insights & Lessons

The real value: automation with governance

The 75% reduction didn't happen because humans were removed from the workflow. It happened because the system moved humans to the right point in the loop — intervening only where judgment matters, not at every routine validation step.

“Instead of asking actuaries or analysts to manually rebuild and inspect every step, the pipeline handles routine validation, training, logging, and alerting automatically. Humans intervene only when judgment matters: drift remediation, suspicious model behavior, hyperparameter override, rollback, or production promotion.”

Core Design Principle

Schema validation is not optional in production ML

It prevents silent failures from cascading into bad model training. Data schema changes can break downstream training without any visible error until the model produces wrong outputs.

Monitoring matters as much as modeling

Prometheus-style metrics and alerting helped expose system-level problems earlier. A model that works but can't be observed isn't production-ready.

Human oversight increases trust

Actuary involvement and manual approval gates made the automation more credible to stakeholders, not less. Governance isn't overhead — it's what makes automation trustworthy.

Experiment tracking is essential

MLflow made it easier to compare runs, preserve model lineage, and reproduce results. Without it, the question of "which configuration produced this model?" has no good answer.

Drift response must be governed

Automatic retraining is powerful, but unsafe without approval, rollback, and explanation. Self-healing workflows need human checkpoints before they touch production.

Future Work

Where the platform goes next

Real-time pipeline

Event-driven architecture for faster scoring and streaming updates rather than batch ingestion cycles.

Multi-model peril registry

Separate model tracks for water, wind/hail, fire, and property-loss categories rather than a single monolithic model.

AWS SageMaker integration

Managed training, hosted endpoints, model monitoring, and registry workflows through SageMaker to reduce operational overhead.

Fairness automation

AIF360 or Fairlearn integration to continuously audit model outputs for demographic fairness and flag bias drift.

AI fix-proposal sandbox

Secure sandbox where AI agents can propose drift fixes before human approval — explanation tracing included.

Cohort-level dashboard

Actual-vs-predicted premium overlays and cohort-level variance views to give actuaries richer inspection of model behavior by risk segment.

Grange Insurance × University of Toledo · 2024–2025

Read the full technical report

The final report covers methodology, feature engineering details, model evaluation, pipeline architecture diagrams, and governance framework in full.

Download Final Report ↓View on GitHub ↗← Back to Portfolio

Homeowner LossHistory Prediction