The modern headache of categorical data
Categorical features, such as city names or policy types, pose a significant challenge for model builders. Because mathematical models, including GLMs, cannot interpret text directly, these values must be translated into numerical formats.
The most common method, One-Hot Encoding, translates each category into a separate binary feature, greatly increasing model complexity and risking overfitting.
For instance, two cities with similar insurance claim statistics might be treated separately, producing misleading coefficient values. Merging them could simplify the model and improve accuracy—but deciding which categories to group is a complex task.
Why classic regularisation falls short
One approach is regularisation, which helps reduce complexity by penalising certain model coefficients.
However, traditional GLM techniques using One-Hot Encoding allow merging only with a reference level. This creates problems when the chosen reference, such as a large city with unusually high claims, doesn’t represent the average case.
As a result, merging becomes biased or ineffective, and alternative reference choices offer only limited improvements.
Inside Smart Grouping
Earnix addresses these limitations through Smart Grouping, a feature built into its Auto-GLM tool within the Model Accelerator suite.
Smart Grouping enables clustering of categorical variables using a two-step regularisation process. First, the algorithm ranks categories using a regularised multivariate GLM, then merges adjacent categories using variable fusion, akin to how numeric variables are binned.
By turning the categorical variable into an ordinal feature for merging, and then translating it back to grouped binary categories, this method produces clearer, more accurate models.
For example, in predicting claim frequency based on city and age, Smart Grouping intelligently determines which cities behave similarly and groups them accordingly.
Smart Grouping offers key benefits over traditional methods. It improves interoperability by clearly defining how categories relate to the outcome variable.
It also enhances multivariate compatibility, as groupings are determined with respect to the full set of covariates in the model. Earnix also tackles overfitting through regularised model ranking and validation schemes when forming category groups.
This results in models that are more accurate and more interpretable, key advantages for insurers and banks seeking both compliance and clarity in their analytical tools.
Earnix’s Smart Grouping is now fully operational in Auto-GLM, giving data professionals immediate access to this enhanced functionality.
Read the full blog from Earnix here.
Copyright © 2025 FinTech Global


