Skip to Content

Model Architecture

Factor analysis uses XGBoost gradient boosting models to predict RoL and SHAP values to attribute predictions to individual features. This page documents the model architecture, training process, and key design decisions.

Model Type

The system trains three separate XGBoost regression models, currently supporting Multi-Family:

ModelCoverage TypesAdditional Features
Package ModelPackage policiesProperty + Liability features
Property ModelProperty-only policiesProperty-specific features
Liability ModelLiability-only policiesLiability-specific features

Separate models allow coverage-type-specific features to be included only where relevant, improving both prediction accuracy and attribution clarity.

Why XGBoost?

XGBoost was selected for several reasons:

  • Mixed feature handling: Naturally handles numeric, categorical, and boolean features without extensive preprocessing
  • Non-linear relationships: Decision trees capture threshold effects and interactions that linear models miss
  • Interaction detection: Tree ensembles automatically discover feature interactions
  • SHAP compatibility: TreeExplainer provides exact, efficient SHAP value computation
  • Industry standard: Well-established for tabular data with extensive validation

Target Variable

Log-Transformed RoL

Models predict log(RoL) rather than raw RoL. Log transformation is used because:

  • Distribution normalization: RoL has a right-skewed distribution; log transform produces approximately normal residuals
  • Improved fit: Log transformation reduces the influence of outliers and improves model accuracy
  • Multiplicative interpretation: SHAP values in log space translate to percentage effects, which are more intuitive than absolute dollar impacts

When reporting, SHAP values are converted back to percentage impacts using: % change = (exp(SHAP) - 1) × 100

Feature Engineering

Numeric Features

Numeric features are used directly without transformation. The model handles scale differences internally through tree splits.

FeatureSource
Unit CountDirect from data
Year BuiltDirect from data
StoriesDirect from data
Annual Gross RentDirect from data
RCV per DoorCalculated: RCV / Unit Count
EGI per DoorCalculated: EGI / Unit Count
RCV of Existing StructureDirect from data (property models)

Categorical Features

Categorical features are label-encoded for XGBoost consumption. The model treats these as ordinal values, but tree-based learning effectively handles this by learning appropriate splits.

FeatureCardinalityEncoding
State~50Label encoded
ZIP Prefix~900Label encoded
CarrierVariableLabel encoded
BrokerageVariableLabel encoded
Policy Structure~10Label encoded
AM Best Rating~15Label encoded
Product TypeVariableLabel encoded

Boolean Features

Coverage flags are encoded as 0/1 integers:

  • Wind/Hail Coverage
  • Named Storm Coverage
  • Terrorism Property Coverage
  • SFHA (Special Flood Hazard Area)
  • General Liability Coverage
  • Umbrella Liability Coverage
  • Terrorism Liability Coverage

Regularization Parameters

Models use aggressive regularization to prevent overfitting to noise. These parameters are more conservative than typical XGBoost defaults, prioritizing model stability and generalization over pure predictive performance:

ParameterValuePurpose
n_estimators2,000Maximum boosting rounds; actual count determined by early stopping
max_depth3Very shallow trees limit complexity; captures at most 3-way interactions
min_child_weight10Requires minimum 10 samples per leaf; prevents fitting noise in small subgroups
learning_rate0.02Very conservative learning rate; combined with early stopping finds stable optimum
subsample0.660% row subsampling adds randomness and reduces variance
colsample_bytree0.660% column subsampling forces feature diversity across trees
gamma1.0Minimum loss reduction required for splits; penalizes complex trees
reg_alpha0.5L1 regularization promotes feature sparsity
reg_lambda3.0L2 regularization smooths leaf weights, reducing extreme predictions

Statistical Rationale

The combination of shallow trees (max_depth=3), aggressive subsampling (60% rows and columns), and strong regularization (reg_alpha=0.5, reg_lambda=3.0) produces a model that:

  • Generalizes well to unseen peer groups
  • Produces stable SHAP attributions that don’t change dramatically with small data changes
  • Avoids overfitting to noise in smaller peer groups
  • Captures the primary pricing relationships without memorizing idiosyncratic patterns

Training Process

Data Splitting

Training data is split into training (80%) and validation (20%) sets. The validation set is used for early stopping to find optimal model complexity.

Early Stopping

Training continues until validation loss stops improving for 40 consecutive rounds. This prevents overfitting by stopping before the model memorizes training data noise.

After early stopping identifies the optimal number of boosting rounds, the model is retrained on the full dataset using that round count to maximize data utilization.

Output

For each peer group query, the trained model produces:

  1. Predicted RoL for each policy in the peer group
  2. SHAP values for each feature for each policy
  3. Aggregated importance scores across the peer group
  4. Value-level breakdowns for categorical and binned numeric features

These outputs are then transformed into the factor metrics displayed in the application.

Last updated on