Model Architecture

Factor analysis uses XGBoost gradient boosting models to predict RoL and SHAP values to attribute predictions to individual features. This page documents the model architecture, training process, and key design decisions.

Model Type

The system trains three separate XGBoost regression models, currently supporting Multi-Family:

Model	Coverage Types	Additional Features
Package Model	Package policies	Property + Liability features
Property Model	Property-only policies	Property-specific features
Liability Model	Liability-only policies	Liability-specific features

Separate models allow coverage-type-specific features to be included only where relevant, improving both prediction accuracy and attribution clarity.

Why XGBoost?

XGBoost was selected for several reasons:

Mixed feature handling: Naturally handles numeric, categorical, and boolean features without extensive preprocessing
Non-linear relationships: Decision trees capture threshold effects and interactions that linear models miss
Interaction detection: Tree ensembles automatically discover feature interactions
SHAP compatibility: TreeExplainer provides exact, efficient SHAP value computation
Industry standard: Well-established for tabular data with extensive validation

Target Variable

Log-Transformed RoL

Models predict log(RoL) rather than raw RoL. Log transformation is used because:

Distribution normalization: RoL has a right-skewed distribution; log transform produces approximately normal residuals
Improved fit: Log transformation reduces the influence of outliers and improves model accuracy
Multiplicative interpretation: SHAP values in log space translate to percentage effects, which are more intuitive than absolute dollar impacts

When reporting, SHAP values are converted back to percentage impacts using: % change = (exp(SHAP) - 1) × 100

Feature Engineering

Numeric Features

Numeric features are used directly without transformation. The model handles scale differences internally through tree splits.

Feature	Source
Unit Count	Direct from data
Year Built	Direct from data
Stories	Direct from data
Annual Gross Rent	Direct from data
RCV per Door	Calculated: RCV / Unit Count
EGI per Door	Calculated: EGI / Unit Count
RCV of Existing Structure	Direct from data (property models)

Categorical Features

Categorical features are label-encoded for XGBoost consumption. The model treats these as ordinal values, but tree-based learning effectively handles this by learning appropriate splits.

Feature	Cardinality	Encoding
State	~50	Label encoded
ZIP Prefix	~900	Label encoded
Carrier	Variable	Label encoded
Brokerage	Variable	Label encoded
Policy Structure	~10	Label encoded
AM Best Rating	~15	Label encoded
Product Type	Variable	Label encoded

Boolean Features

Coverage flags are encoded as 0/1 integers:

Wind/Hail Coverage
Named Storm Coverage
Terrorism Property Coverage
SFHA (Special Flood Hazard Area)
General Liability Coverage
Umbrella Liability Coverage
Terrorism Liability Coverage

Regularization Parameters

Models use aggressive regularization to prevent overfitting to noise. These parameters are more conservative than typical XGBoost defaults, prioritizing model stability and generalization over pure predictive performance:

Parameter	Value	Purpose
`n_estimators`	2,000	Maximum boosting rounds; actual count determined by early stopping
`max_depth`	3	Very shallow trees limit complexity; captures at most 3-way interactions
`min_child_weight`	10	Requires minimum 10 samples per leaf; prevents fitting noise in small subgroups
`learning_rate`	0.02	Very conservative learning rate; combined with early stopping finds stable optimum
`subsample`	0.6	60% row subsampling adds randomness and reduces variance
`colsample_bytree`	0.6	60% column subsampling forces feature diversity across trees
`gamma`	1.0	Minimum loss reduction required for splits; penalizes complex trees
`reg_alpha`	0.5	L1 regularization promotes feature sparsity
`reg_lambda`	3.0	L2 regularization smooths leaf weights, reducing extreme predictions

Statistical Rationale

The combination of shallow trees (max_depth=3), aggressive subsampling (60% rows and columns), and strong regularization (reg_alpha=0.5, reg_lambda=3.0) produces a model that:

Generalizes well to unseen peer groups
Produces stable SHAP attributions that don’t change dramatically with small data changes
Avoids overfitting to noise in smaller peer groups
Captures the primary pricing relationships without memorizing idiosyncratic patterns

Training Process

Data Splitting

Training data is split into training (80%) and validation (20%) sets. The validation set is used for early stopping to find optimal model complexity.

Early Stopping

Training continues until validation loss stops improving for 40 consecutive rounds. This prevents overfitting by stopping before the model memorizes training data noise.

After early stopping identifies the optimal number of boosting rounds, the model is retrained on the full dataset using that round count to maximize data utilization.

Output

For each peer group query, the trained model produces:

Predicted RoL for each policy in the peer group
SHAP values for each feature for each policy
Aggregated importance scores across the peer group
Value-level breakdowns for categorical and binned numeric features

These outputs are then transformed into the factor metrics displayed in the application.