Feature Selection
The factors included in the pricing attribution model were selected based on four criteria: data availability, predictive power, business interpretability, and non-redundancy.
Selection Criteria
Data Availability
Only features consistently captured across policies in the terminal data can be included. Features with high missing rates or inconsistent definitions across data sources are excluded to ensure model reliability.
Predictive Power
Features must demonstrate correlation with RoL in exploratory analysis. Including non-predictive features adds noise without improving explanatory power.
Business Interpretability
Users must be able to understand and act on factor results. Abstract or highly technical features that users cannot relate to their own properties or decisions are excluded.
Non-Redundancy
Each feature should capture unique information. When two features are highly correlated (e.g., total RCV and property size), derived ratios or one representative feature is used to avoid double-counting effects.
Common Features (All Coverage Types)
These features are included in all three model variants (package, property, liability).
Numeric Features
| Feature | Description | Why Included |
|---|---|---|
| Unit Count | Number of units in the property | Property size is a primary pricing driver; larger properties have different risk profiles |
| Year Built | Construction year | Building age correlates with construction quality, code compliance, and maintenance |
| Stories | Number of floors | Vertical density affects fire risk, evacuation complexity, and liability exposure |
| Annual Gross Rent | Total annual rental income | Income proxy indicating property quality and tenant profile |
| RCV per Door | Replacement Cost Value divided by unit count | Normalized replacement cost; controls for property value independent of size |
| EGI per Door | Effective Gross Income divided by unit count | Normalized income; indicates revenue intensity per unit |
Categorical Features
| Feature | Description | Why Included |
|---|---|---|
| State | Property state location | Geographic location determines regulatory environment, weather exposure, and litigation climate |
| ZIP Prefix | First 3 digits of ZIP code | Regional grouping (~900 regions) captures local market conditions without overfitting to individual ZIPs |
| Carrier | Insurance carrier name | Different carriers have distinct pricing strategies and risk appetites |
| Brokerage | Placing brokerage | Brokerages negotiate different rates and have varying market access |
| Policy Structure | Policy structure classification | How coverage is structured affects pricing |
| AM Best Rating | Carrier financial strength rating | Carrier financial strength may correlate with pricing discipline |
| Product Type | Insurance product classification | Product type affects base pricing |
Coverage-Specific Features
Property Model Features
In addition to common features, property and package models include:
| Feature | Type | Description |
|---|---|---|
| RCV of Existing Structure | Numeric | Total replacement cost value; fundamental to property pricing |
| Wind/Hail Coverage | Boolean | Indicates coverage for major property peril; CAT exposure marker |
| Named Storm Coverage | Boolean | Coverage for named storms; coastal exposure indicator |
| Terrorism Property Coverage | Boolean | Property terrorism coverage; urban/high-value property indicator |
| Special Flood Hazard Area (SFHA) | Boolean | FEMA flood zone designation; flood risk indicator |
Liability Model Features
In addition to common features, liability and package models include:
| Feature | Type | Description |
|---|---|---|
| General Liability Coverage | Boolean | Core liability coverage presence |
| Umbrella Liability Coverage | Boolean | Excess coverage indicates higher limits or exposure |
| Terrorism Liability Coverage | Boolean | Liability terrorism coverage; additional exposure indicator |
Derived Features
Several features are derived from raw data fields to improve model performance:
RCV per Door
Calculated as RCV / Unit Count. Using this ratio instead of raw RCV allows the model to assess whether pricing reflects value intensity rather than absolute property size.
EGI per Door
Calculated as EGI / Unit Count. Normalizes income by property size to capture revenue quality independent of scale.
ZIP Prefix
The first 3 digits of the ZIP code. Full 5-digit ZIP codes would create thousands of categories, leading to data sparsity and overfitting. The 3-digit prefix provides approximately 900 geographic regions, sufficient granularity to capture regional variation while maintaining adequate sample sizes per category.
Validation testing confirmed that ZIP prefix pricing patterns correlate with external risk indicators, including local crime data and crime scores. This suggests the feature captures meaningful geographic risk variation rather than spurious patterns.