Back to Portfolio
Report ↓GitHub ↗
MLOpsInsurance AIRisk ModelingSenior Design · 2024–2025

Homeowner Loss
History Prediction

A production-style MLOps system for homeowner loss prediction that turns a manual actuarial modeling workflow into an automated, auditable, human-in-the-loop AI pipeline. Built with Grange Insurance × University of Toledo.

R² 0.982
Model accuracy
RMSE 7216
Prediction error
26%
vs GLM baseline
75%
Manual review ↓
MAE $29.10
Avg prediction error
100+
Engineered features
View Architecture ↓Final Report ↓GitHub ↗
Overview

From notebook workflow to governed, continuously operating AI platform

Homeowner Loss History Prediction was a senior design project developed in partnership with Grange Insurance to modernize how homeowner risk and pure premium predictions are modeled, validated, monitored, and updated. Instead of treating machine learning as a one-time notebook workflow, the project reframed risk modeling as a continuously operating MLOps system.

The system automates the full lifecycle: raw data ingestion, schema validation, preprocessing, feature engineering, drift detection, model training, hyperparameter tuning, model evaluation, experiment tracking, monitoring, alerting, and human-approved deployment.

The key contribution was not just improving model accuracy. The bigger engineering contribution was building a governed ML operating system around the model: every dataset, schema, model run, drift event, metric, and manual decision can be tracked, reviewed, and reproduced.

MLOps pipeline architecture
The Problem

Risk patterns shift faster than manual modeling can adapt

Insurance companies need accurate homeowner risk models for pricing, underwriting, and financial stability. However, many modeling workflows are still too manual for modern claim volatility — claim volatility, inflation, weather events, and regional property-risk trends all create windows where stale models make pricing decisions.

Manual rebuilds are slow

Model updates required manual data pulls, notebook-based preprocessing, and repeated rebuilds — consuming actuarial time that should be focused on interpretation.

Silent schema failures

Data schema changes could silently break downstream training with no validation layer catching type mismatches or missing columns before they corrupted model runs.

No drift detection

Model drift was difficult to detect early without automated monitoring. Stale models could price risk for weeks before a human noticed performance degradation.

No experiment lineage

Without tracking, experiments and model versions were hard to reproduce. There was no audit trail for when a model was promoted, what data it saw, or who approved it.

Interpretability gap

Business stakeholders and actuaries needed explainability, not just predictions. A black-box model creates regulatory and trust problems in insurance.

No governance layer

High-stakes insurance decisions require human oversight, auditability, and the ability to roll back. A notebook-only workflow provides none of these guarantees.

System Architecture

Modular MLOps pipeline — end-to-end

The architecture was built as a modular MLOps workflow instead of a single monolithic script. Each stage is independently testable, observable, and replaceable.

01
Raw Dataset → AWS S3
Raw homeowner loss data stored in S3 and pulled into the pipeline by Airflow.
02
Data Ingestion → Airflow
Airflow schedules ingestion jobs, handles file staging, and coordinates downstream tasks.
03
Schema Validation → Pandera
Validates required columns, data types, and constraints. Logs and schema snapshots stored for auditability.
04
Preprocessing → Pandas / sklearn
Missing values, outliers, encoding, feature transformations, and train/test splitting.
05
Drift Detection
Statistical drift checks against reference distributions. Detected drift triggers alerts and proposed remediation actions.
06
Model Training → XGBoost
XGBoost models trained on 100+ engineered loss-history features with monotonic constraints.
07
Hyperparameter Optimization → Hyperopt
Bayesian optimization searches improved configurations and logs the best parameters to MLflow.
08
Model Tracking → MLflow
Tracks experiments, metrics, artifacts, model versions, and registry transitions across every run.
09
Monitoring & Alerts → Prometheus + Slack
Prometheus tracks system metrics. Slack sends alerts for drift, failures, retraining events, and approval requests.
10
Human-in-the-Loop Governance
Manual approval gates allow analysts to approve retraining, reject proposed fixes, override decisions, or promote models.
11
Deployment → AWS EC2/S3 + CI/CD
Containerized services deployed on AWS EC2/S3, with CI/CD support for repeatable builds and updates.
Technical Stack

Purpose-selected tools for each layer

Machine Learning
XGBoost
Hyperopt / Bayesian Opt.
scikit-learn
SHAP explainability
Decile lift charts
RMSE · MAE · R²
MLOps & Automation
Apache Airflow
MLflow
Pandera
Prometheus
Slack API
GitHub Actions
Docker
Cloud & Storage
AWS EC2
AWS S3
IAM-style access control
Containerized deployment
CI/CD pipeline
Data & Features
Pandas
NumPy
100+ engineered features
Schema snapshots
Drift reference distributions
Monotonic constraints
Model Design

Why XGBoost — and what it took to govern it

XGBoost was chosen because it offered the best balance between predictive performance, explainability, speed, and insurance-industry practicality — not just the highest raw metric.

Why XGBoost
Strong performance on structured/tabular insurance data
Supports monotonic constraints for logically consistent risk modeling
Works well with SHAP for feature-level explainability
More governable than deep neural networks
Scales to production without infrastructure overhead
Alternatives Considered
GLM PipelineBaseline — 26% worse RMSE
Random ForestSlower, less interpretable
LightGBMStrong but less insurance-standard
CatBoostGood, but SHAP integration less mature
Deep Learning / MLPToo opaque for actuarial review
Hyperparameter Optimization

Hyperopt with Bayesian optimization searched across learning rate, max depth, subsample, colsample, and regularization parameters. Every tuning run was tracked in MLflow with the associated training dataset snapshot, enabling reproducible comparison across experiments and preventing the common problem of "which run was the best one?"

Results & Impact

Strong predictive performance and meaningful automation gains

R² 0.982
Final model accuracy
RMSE 7216
Root mean square error
MAE $29.10
Mean absolute error
26%
RMSE improvement vs GLM
75%
Manual review reduction
100+
Engineered + validated features
Automation Gains
75% reduction in manual iteration / review burden
3× improvement in retraining responsiveness
Schema snapshots and model lineage tracked for auditability
Drift detection connected to self-healing retraining workflows
Slack alerts added for operational visibility
Manual override dashboard for human governance
Business Impact
Reduced manual validation effort for actuarial team
Faster retraining response to market changes
Improved reproducibility across all model runs
Better model governance with full decision audit trail
More transparent actuarial review process
Stronger deployment readiness than notebook-only model
Key Insights & Lessons

The real value: automation with governance

The 75% reduction didn't happen because humans were removed from the workflow. It happened because the system moved humans to the right point in the loop — intervening only where judgment matters, not at every routine validation step.

“Instead of asking actuaries or analysts to manually rebuild and inspect every step, the pipeline handles routine validation, training, logging, and alerting automatically. Humans intervene only when judgment matters: drift remediation, suspicious model behavior, hyperparameter override, rollback, or production promotion.”
Core Design Principle
Schema validation is not optional in production ML

It prevents silent failures from cascading into bad model training. Data schema changes can break downstream training without any visible error until the model produces wrong outputs.

Monitoring matters as much as modeling

Prometheus-style metrics and alerting helped expose system-level problems earlier. A model that works but can't be observed isn't production-ready.

Human oversight increases trust

Actuary involvement and manual approval gates made the automation more credible to stakeholders, not less. Governance isn't overhead — it's what makes automation trustworthy.

Experiment tracking is essential

MLflow made it easier to compare runs, preserve model lineage, and reproduce results. Without it, the question of "which configuration produced this model?" has no good answer.

Drift response must be governed

Automatic retraining is powerful, but unsafe without approval, rollback, and explanation. Self-healing workflows need human checkpoints before they touch production.

Future Work

Where the platform goes next

Real-time pipeline

Event-driven architecture for faster scoring and streaming updates rather than batch ingestion cycles.

Multi-model peril registry

Separate model tracks for water, wind/hail, fire, and property-loss categories rather than a single monolithic model.

AWS SageMaker integration

Managed training, hosted endpoints, model monitoring, and registry workflows through SageMaker to reduce operational overhead.

Fairness automation

AIF360 or Fairlearn integration to continuously audit model outputs for demographic fairness and flag bias drift.

AI fix-proposal sandbox

Secure sandbox where AI agents can propose drift fixes before human approval — explanation tracing included.

Cohort-level dashboard

Actual-vs-predicted premium overlays and cohort-level variance views to give actuaries richer inspection of model behavior by risk segment.

Grange Insurance × University of Toledo · 2024–2025

Read the full technical report

The final report covers methodology, feature engineering details, model evaluation, pipeline architecture diagrams, and governance framework in full.

Download Final Report ↓View on GitHub ↗← Back to Portfolio