Feature Selection

The factors included in the pricing attribution model were selected based on four criteria: data availability, predictive power, business interpretability, and non-redundancy.

Selection Criteria

Data Availability

Only features consistently captured across policies in the terminal data can be included. Features with high missing rates or inconsistent definitions across data sources are excluded to ensure model reliability.

Predictive Power

Features must demonstrate correlation with RoL in exploratory analysis. Including non-predictive features adds noise without improving explanatory power.

Business Interpretability

Users must be able to understand and act on factor results. Abstract or highly technical features that users cannot relate to their own properties or decisions are excluded.

Non-Redundancy

Each feature should capture unique information. When two features are highly correlated (e.g., total RCV and property size), derived ratios or one representative feature is used to avoid double-counting effects.

Common Features (All Coverage Types)

These features are included in all three model variants (package, property, liability).

Numeric Features

Feature	Description	Why Included
Unit Count	Number of units in the property	Property size is a primary pricing driver; larger properties have different risk profiles
Year Built	Construction year	Building age correlates with construction quality, code compliance, and maintenance
Stories	Number of floors	Vertical density affects fire risk, evacuation complexity, and liability exposure
Annual Gross Rent	Total annual rental income	Income proxy indicating property quality and tenant profile
RCV per Door	Replacement Cost Value divided by unit count	Normalized replacement cost; controls for property value independent of size
EGI per Door	Effective Gross Income divided by unit count	Normalized income; indicates revenue intensity per unit

Categorical Features

Feature	Description	Why Included
State	Property state location	Geographic location determines regulatory environment, weather exposure, and litigation climate
ZIP Prefix	First 3 digits of ZIP code	Regional grouping (~900 regions) captures local market conditions without overfitting to individual ZIPs
Carrier	Insurance carrier name	Different carriers have distinct pricing strategies and risk appetites
Brokerage	Placing brokerage	Brokerages negotiate different rates and have varying market access
Policy Structure	Policy structure classification	How coverage is structured affects pricing
AM Best Rating	Carrier financial strength rating	Carrier financial strength may correlate with pricing discipline
Product Type	Insurance product classification	Product type affects base pricing

Coverage-Specific Features

Property Model Features

In addition to common features, property and package models include:

Feature	Type	Description
RCV of Existing Structure	Numeric	Total replacement cost value; fundamental to property pricing
Wind/Hail Coverage	Boolean	Indicates coverage for major property peril; CAT exposure marker
Named Storm Coverage	Boolean	Coverage for named storms; coastal exposure indicator
Terrorism Property Coverage	Boolean	Property terrorism coverage; urban/high-value property indicator
Special Flood Hazard Area (SFHA)	Boolean	FEMA flood zone designation; flood risk indicator

Liability Model Features

In addition to common features, liability and package models include:

Feature	Type	Description
General Liability Coverage	Boolean	Core liability coverage presence
Umbrella Liability Coverage	Boolean	Excess coverage indicates higher limits or exposure
Terrorism Liability Coverage	Boolean	Liability terrorism coverage; additional exposure indicator

Derived Features

Several features are derived from raw data fields to improve model performance:

RCV per Door

Calculated as RCV / Unit Count. Using this ratio instead of raw RCV allows the model to assess whether pricing reflects value intensity rather than absolute property size.

EGI per Door

Calculated as EGI / Unit Count. Normalizes income by property size to capture revenue quality independent of scale.

ZIP Prefix

The first 3 digits of the ZIP code. Full 5-digit ZIP codes would create thousands of categories, leading to data sparsity and overfitting. The 3-digit prefix provides approximately 900 geographic regions, sufficient granularity to capture regional variation while maintaining adequate sample sizes per category.

Validation testing confirmed that ZIP prefix pricing patterns correlate with external risk indicators, including local crime data and crime scores. This suggests the feature captures meaningful geographic risk variation rather than spurious patterns.