[中图分类号]F842.634[文献标识码]A[文章编号]1004-3306(2024)09-0026-18 DOI:10.13497/j.cnki.is.2024.09.003
资源价格:30积分
[摘 要]自2020年9月起,车险综合改革对车险精准定价的要求日益严格,尤其在大数据时代背景下,数据特征复杂性的增加以及分类变量水平数的增多,使得广义线性模型等传统精算统计方法在处理此类数据时面临重大挑战。基于国内某保险公司的一个车团险数据集,本文应用四类机器学习方法和四种分类变量编码方式,使用不同评价指标比较了不同编码方式在车险索赔次数预测中的性能表现,并借助SHAP提高了机器学习模型的可解释性。实证结果显示:第一,不同的机器学习模型适用的分类变量编码方式可能不一样,需要根据机器学习模型的特点选择适配的分类变量编码方式;第二,相比于one-hot编码而言,分类嵌入方法能够显著降低模型的运行时间,提高运行效率;第三,根据SHAP输出的可解释性结果,车队交强险近三年平均赔付率是影响车辆索赔次数最重要的因素;第四,分类嵌入方法生成的嵌入向量对应于分类变量的不同水平,嵌入向量之间的距离可应用于投保主体的划分和风险评级。本文完善了分类嵌入方法在车险定价领域的应用,能够切实改善预测精度,提高运行效率,为推动车险定价的精准化和差异化做出贡献。
[关键词]车险定价;索赔次数;分类变量;嵌入;SHAP
[基金项目]本文得到教育部人文社会科学重点研究基地重大项目“数字时代风险管理与精算模型研究”(项目号22JJD910003)和天津市研究生科研创新项目(编号2022SKY038)的资助。
[作者简介]张连增,南开大学金融学院教授、博士生导师;罗来娟(通讯作者),南开大学金融学院博士研究生;肖宇谷,中国人民大学应用统计科学研究中心研究员,中国人民大学统计学院教授、博士;李浩男,南开大学金融学院博士研究生。
Application of Categorical Embedding in Predicting Automobile Insurance Claim Frequency
ZHANG Lian-zeng,LUO Lai-juan,XIAO Yu-gu,LI Hao-nan
Abstract:Since September 2020,the comprehensive reform of auto insurance has put forward higher requirements for precise pricing,especially in the context of the big data era.The increasing complexity of data features and the large number of levels for categorical variables impose significant challenges to traditional actuarial statistical methods such as Generalized Linear Models(GLMs).Using a dataset from a domestic insurance company on vehicle group insurance,this paper comprehensively considers the impact of four machine learning methods and four categorical variable encoding techniques on predicting automobile insurance claim frequency.Different evaluation metrics are employed to analyze the performance of categorical embedding in various scenarios,and the interpretability of machine learning models is enhanced using SHAP.Empirical results are shown as follows:firstly,different machine learning models may require different encoding methods for categorical variables,and it is necessary to select an appropriate encoding method based on the characteristics of the machine learning model.Moreover,compared to one-hot encoding,categorical embedding methods can significantly reduce model runtime and improve efficiency.Additionally,according to the interpretability results output by SHAP,the average cumulative ratio of claims in the past three years for group vehicles compulsory insurance(AveCumpRatio)is identified as the most crucial factor influencing vehicle claim frequency.Lastly,the embedding vectors generated by categorical embedding methods correspond to different levels of the categorical variables,and the distance between the embedding vectors can be applied to the segmentation and risk rating of insurance policyholders.This article contributes to the refinement of the application of categorical embedding in the field of automobile insurance pricing,effectively improving prediction accuracy,enhancing operation efficiency,and making contributions to advancing the precision and differentiation of automobile insurance pricing.
Key words:automobile insurance;claim frequency;categorical variables;embedding;SHAP
《保险研究》20240904-《数字鸿沟与家庭商业保险决策》(杨碧云、黎卓妍、易行健、张凌霜)
《保险研究》20240901-《农业保险承保机构遴选中的赛马机制研究》(谭莉、丁少群、汪洋)
《保险研究》20240902-《政策性农业保险促进了农村地区包容性增长吗?——基于收入再分配视角》(令涛、赵桂芹、张宗军)
《保险研究》20240903-《分类嵌入在车险索赔次数预测中的应用》(张连增、罗来娟、肖宇谷、李浩男)
《保险研究》20240905-《三阶风险态度对保险需求的影响:理论分析和实验验证》(郭振华、任钊弘、倪红霞)
《保险研究》20240906-《中国特色保险公司ESG投资体系》(白雪石、郭亮、孙恺健)