Model Architecture
Factor analysis uses XGBoost gradient boosting models to predict RoL and SHAP values to attribute predictions to individual features. This page documents the model architecture, training process, and key design decisions.
Model Type
The system trains three separate XGBoost regression models, currently supporting Multi-Family:
| Model | Coverage Types | Additional Features |
|---|---|---|
| Package Model | Package policies | Property + Liability features |
| Property Model | Property-only policies | Property-specific features |
| Liability Model | Liability-only policies | Liability-specific features |
Separate models allow coverage-type-specific features to be included only where relevant, improving both prediction accuracy and attribution clarity.
Why XGBoost?
XGBoost was selected for several reasons:
- Mixed feature handling: Naturally handles numeric, categorical, and boolean features without extensive preprocessing
- Non-linear relationships: Decision trees capture threshold effects and interactions that linear models miss
- Interaction detection: Tree ensembles automatically discover feature interactions
- SHAP compatibility: TreeExplainer provides exact, efficient SHAP value computation
- Industry standard: Well-established for tabular data with extensive validation
Target Variable
Log-Transformed RoL
Models predict log(RoL) rather than raw RoL. Log transformation is used because:
- Distribution normalization: RoL has a right-skewed distribution; log transform produces approximately normal residuals
- Improved fit: Log transformation reduces the influence of outliers and improves model accuracy
- Multiplicative interpretation: SHAP values in log space translate to percentage effects, which are more intuitive than absolute dollar impacts
When reporting, SHAP values are converted back to percentage impacts using: % change = (exp(SHAP) - 1) × 100
Feature Engineering
Numeric Features
Numeric features are used directly without transformation. The model handles scale differences internally through tree splits.
| Feature | Source |
|---|---|
| Unit Count | Direct from data |
| Year Built | Direct from data |
| Stories | Direct from data |
| Annual Gross Rent | Direct from data |
| RCV per Door | Calculated: RCV / Unit Count |
| EGI per Door | Calculated: EGI / Unit Count |
| RCV of Existing Structure | Direct from data (property models) |
Categorical Features
Categorical features are label-encoded for XGBoost consumption. The model treats these as ordinal values, but tree-based learning effectively handles this by learning appropriate splits.
| Feature | Cardinality | Encoding |
|---|---|---|
| State | ~50 | Label encoded |
| ZIP Prefix | ~900 | Label encoded |
| Carrier | Variable | Label encoded |
| Brokerage | Variable | Label encoded |
| Policy Structure | ~10 | Label encoded |
| AM Best Rating | ~15 | Label encoded |
| Product Type | Variable | Label encoded |
Boolean Features
Coverage flags are encoded as 0/1 integers:
- Wind/Hail Coverage
- Named Storm Coverage
- Terrorism Property Coverage
- SFHA (Special Flood Hazard Area)
- General Liability Coverage
- Umbrella Liability Coverage
- Terrorism Liability Coverage
Regularization Parameters
Models use aggressive regularization to prevent overfitting to noise. These parameters are more conservative than typical XGBoost defaults, prioritizing model stability and generalization over pure predictive performance:
| Parameter | Value | Purpose |
|---|---|---|
n_estimators | 2,000 | Maximum boosting rounds; actual count determined by early stopping |
max_depth | 3 | Very shallow trees limit complexity; captures at most 3-way interactions |
min_child_weight | 10 | Requires minimum 10 samples per leaf; prevents fitting noise in small subgroups |
learning_rate | 0.02 | Very conservative learning rate; combined with early stopping finds stable optimum |
subsample | 0.6 | 60% row subsampling adds randomness and reduces variance |
colsample_bytree | 0.6 | 60% column subsampling forces feature diversity across trees |
gamma | 1.0 | Minimum loss reduction required for splits; penalizes complex trees |
reg_alpha | 0.5 | L1 regularization promotes feature sparsity |
reg_lambda | 3.0 | L2 regularization smooths leaf weights, reducing extreme predictions |
Statistical Rationale
The combination of shallow trees (max_depth=3), aggressive subsampling (60% rows and columns), and strong regularization (reg_alpha=0.5, reg_lambda=3.0) produces a model that:
- Generalizes well to unseen peer groups
- Produces stable SHAP attributions that don’t change dramatically with small data changes
- Avoids overfitting to noise in smaller peer groups
- Captures the primary pricing relationships without memorizing idiosyncratic patterns
Training Process
Data Splitting
Training data is split into training (80%) and validation (20%) sets. The validation set is used for early stopping to find optimal model complexity.
Early Stopping
Training continues until validation loss stops improving for 40 consecutive rounds. This prevents overfitting by stopping before the model memorizes training data noise.
After early stopping identifies the optimal number of boosting rounds, the model is retrained on the full dataset using that round count to maximize data utilization.
Output
For each peer group query, the trained model produces:
- Predicted RoL for each policy in the peer group
- SHAP values for each feature for each policy
- Aggregated importance scores across the peer group
- Value-level breakdowns for categorical and binned numeric features
These outputs are then transformed into the factor metrics displayed in the application.