Generalised linear models (GLMs) remain a core tool in insurance and banking analytics, but building accurate, interpretable models is not always straightforward. One of the biggest challenges is handling hierarchical categorical data—common in insurance tables, such as Location. Using the wrong level of detail can lead to overfitting, multicollinearity, slow training, and unstable coefficients.
At Earnix, tackling these issues starts with asking the right questions and analysing the structure of the data. For example, when predicting claim costs for vehicles, should the model use Car Brand, Car Model, or both?
Using only the top-level variable may miss critical signals, while including every level creates a large number of categories, increasing memory requirements and training time. Furthermore, GLMs can struggle with one-hot encoding hierarchical variables, producing convergence errors or unstable coefficient estimates.
To address this, Earnix developed the Hierarchical Level Selector, part of the Preprocessing Hub lab.
This algorithm automatically evaluates each level of a hierarchical feature and decides whether lower levels provide meaningful information or should be merged into their parent category.
By reducing unnecessary granularity, it mitigates overfitting, improves interpretability, and accelerates model training—all critical for GLMs in insurance and banking.
The process is target-based and specific to each model. Categories are represented as a hierarchical tree, and the algorithm works from the bottom up.
At each level, the data is split into training and validation sets. A grid of category subsets is generated based on credibility, which accounts for both the number of observations and variability of the target variable.
The algorithm tests predictive performance on the validation set, determining which categories remain distinct and which merge into their parent. Categories merged up to the root are labelled as “Other.”
For example, when predicting vehicle claim costs, most Toyota models may merge into “Toyota” if their claims behave similarly, while BMW models might remain separate due to greater variability. Smaller brands with similar behaviour, such as Skoda and Seat, could merge into “Other.” The result is a single new column—Vehicle Definition—that captures the optimal granularity for modelling.
In the Preprocessing Hub, the Hierarchical Level Selector also provides visualisations, such as pie charts showing the proportion of categories retained at each level, allowing analysts to see how the data has been optimised. The result is a faster, cleaner, and more interpretable GLM, with reduced dimensionality and more stable coefficients.
In conclusion, improving GLM models in insurance and banking requires more than just adding variables. By leveraging tools like Earnix’s Hierarchical Level Selector, analysts can automatically optimise hierarchical features, enhancing predictive performance, reducing overfitting, and improving explainability. For insurers and banks, this is a critical step towards building more reliable and actionable models.
Read the full blog from Earnix here.
Read the daily FinTech news
Copyright © 2025 FinTech Global


