Affiliation:
1. University of Huelva , Huelva, paola.morales@alu.uhu.es
2. University of Huelva , Huelva, antonio.tallon@diesia.uhu.es
Abstract
Abstract
This paper copes with a real-world classification problem related to the management of claims received in an insurance company. The way to obtain the classifier is not easy due to the high amount of missing values as well as the inherent imbalanced scenario within class labels. Once the data partition has been done, the training set is submitted to an intensive double grid search in order to obtain the most promising type of missing value imputation approach and then a step ahead is done using the best method and it starts the next round of data mining strategies which now falls into data rebalancing umbrella. Again, a grid search from an undersampling and oversampling family with different settings is done taking into account only seen data. The training data obtained after the first grid search are now submitted to the second step according the second grid search in order to get the ready training set for the further classifier training. The main objective of the work is to find the best combination of data mining techniques that suits the data set with a pipeline containing two types of data preparation methods coming from different families. As an outcome, first the problem of the presence of missing values has been addressed and then the data rebalancing techniques has been applied. The study focuses on obtaining classifiers based on Bayesian and lazy approaches as well as decision trees, evaluated on metrics such as the area under the ROC curve (AUC), Cohen’s kappa, Accuracy and the F-measure, among others. The imputation by the mean the mode is preferable to the Expectation Maximization Imputation in the scenario faced in this paper taking into account that the amount of missing values is higher than a forty percent for many features.
Publisher
Oxford University Press (OUP)
Reference33 articles.
1. Imputing missing values in modelling the pm10 concentrations;Razak;Sains Malaysiana,2014
2. A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios;Alejo;Pattern Recognition Letters
3. Machine Learning
4. Probabilistic estimation-based data mining for discovering insurance risks;Apte;IEEE Intelligent Systems and Their Applications,1999
5. Introduction to information quality;Batini,2016