A huge amount of data is generally referred as Big Data. Itis enormous in size, diverse variety and has highest velocity of data arrival.This huge information is useless unless the data is examined to uncover the newcorrelations, customer experiences and other useful information that can helpthe organization to take more informed business decisions. Big data is widelyapplied in all sectors like healthcare, insurance, finance and many more. Big data ininsurance sector is one of the most promising. Traditional marketing system ofinsurance is offline based sales business. They generally sell the insurancepolicies by calling and visiting the customers. This fixed marketing systemalso achieved good results in past time.
But currently many new privateinsurances companies also have entered into the marketplace which giveshealthier competition. On other hand, eagerness of people to pay for theinsurance service is also increased. Therefore, understanding the need andpurchase plan of clients is extremely essential for insurance companies toraise the sales volume. Big data technologysupports the insurance companies’ transformations. Due to lack of principle andinnovation of traditional marketing, badly structured insurance data, unclearcustomers purchasing characteristics leads to imbalanced data, which brings thedifficulty of classification of user and insurance product recommendation.Decision making task is difficult with imbalanced datadistribution. To solve this problem, we usually use few resamplingmethodswhich will construct the balanced training datasets. This will improvethe performance of predictive model.
Main purpose of thispaper is to identify the potential customer with help of big data technology.This paper does not only provide good strategy for identifying the potentialclient but also act as good reference for classification problems.We propose supervised learning algorithm call ensembledecision tree to analysis the potential customer and their majorcharacteristics.This paper is organized as follows. Section II introducesthe current research status of machine learning; Section III puts forward theclassification model and intelligent recommendation algorithm based on XGBoostalgorithm for insurance business data, and analyzes its efficiency; Section IV givesyou experiment and result. Section V puts forward the analysis result. SectionVI Conclusion and future work. I.
RELATED WORKThe classification problem for US bank insurance businessdata has imbalanced data distribution. This means ratio between positive andnegative proportion are extremely unbalanced, the prediction models generated directlyby supervised learning algorithms like SVM, Logistic Regression are biased forlarge proportion. Example, the ratio between positive and negative classes is100:1.
Therefore, this can be seen as such model does not help in prediction. Imbalancedclass distribution will affect the performance of classification problem. Thus,some techniques should be applied to deal this problem. One approach to handlethe problem of unbalanced class distribution is sampling techniques. This willrebalance the dataset. Sampling techniques are broadly classified into twotypes.
They are under sampling and over sampling. Under sampling technique isapplied to major class for reduction process (e.g. Random Under Sampling) andover sampling is another technique applied to add missing scores to set ofsamples of minor class (e.g.
Random Over Sampling).The drawback of ROS isredundancy in dataset this will again lead to classification problem that isclassifier may not recognize the minor class significantly. To overcome thisproblem, SMOTE (Synthetic Minority Over Sampling) is used. This will createadditional sample which are close and similar to near neighbors along withsamples of the minor class to rebalance the dataset with help of K-NearestNeighbors (KNN) 2.Sampling method is divided into non-heuristic methodand heuristic method. Non-heuristic will randomly remove the samples frommajority class in order to reduce the degree of imbalance.
Heuristic sampling isanother method which will distinguish samples based on nearest neighboralgorithm 7.Another difficulty in classification problem is data quality,which is existence of missing data. Frequent occurrence of missing data willgive biased result. Mostly, dataset attributes are dependent to each other.Thus, identifying the correlation between those attributes can be used todiscover the missing data values. One approach to replace the missing valueswith some probablevalues is called imputation 6.One of the challenges in big data is data quality.
We needto ensure the quality of data otherwise it will mislead to wrong predictionssometimes. One significant problem of data quality is missing data.Imputation is method for handling themissing data.
This will reconstruct the missing data with estimated ones. Imputationmethod has advantage of handling missing data without help of learningalgorithms and this will also allow the researcher to select the suitableimputation method for particular circumstance 3. There are many imputation methods for missing valuetreatment (Some widely used data imputation methods are Case substitution, Meanand Mode imputation, Predictive model). In this paper we have built thepredictive model for missing value treatment.There are various machine learning algorithms to solve bothclassification and regression problems.
Machine learning is process ofdesigning the system which has ability to automatically learn and practicewithout being explicitly programmed. Machine learning algorithms are classifiedinto three types (Supervised learning, Unsupervised learning, ReinforcementLearning).In this paper, we propose supervised machine learning algorithms tobuilt the model. Some of the supervised learning algorithms are Regression, Decision Tree, Random Forest, KNN, LogisticRegression etc 8. Decision tree in machine learning can be used for bothclassification and regression. In decision analysis, a decision treecan be used to visually and explicitly represent decisions and decision making. The tree has two important entities that are decision nodesand leaves.
The leaves are the decisions or the final outcomes. And thedecision nodes are where the data is split. Classification tree is type ofdecision tree where the outcome was a variable like ‘fit’ or ‘unfit’. Here thedecision variable is Categorical.One of the best ensemble methods israndom forest. It is used for both classification and regression 5.
RandomForest is collection of many decision trees; every tree has its full growth.And it has advantage of automatic feature selection and etc 4.Gradient Boosting looks toconsecutively reduce error with each consecutive model, until one final modelis produced. The main aim of all machine learning algorithms is to build thestrongest predictive model while accounting for computational efficiency aswell. This is where XGBoosting algorithm plays important role. XGBoost (eXtreme Gradient Boosting) is adirect application of Gradient Boosting for decision trees. It gives you moreregularized model formalization to control over-fitting, which gives betterperformance 8. II.
METHODOLOGYClassificationModel: Traditional sales approach of insurance product is offlineprocess and it as following disadvantages: (1) lack of customerevaluation system, don’t know the characteristics influence weight of thepotential customers; (2) the data accumulated in this way usually hasserious ruinous, indirect influence the accuracy of classification model 4.For a bunch of classification models, distribution of classand correlation features affects the forecast results. Imbalanced dataclassification and independent attributes of insurance dataset will haveserious deviation in classification model result.
We can handle this kind ofproblems with different sampling method and supervised learning algorithms.In this article, we have used oversampling approach with supervised learning algorithms oncar insurance dataset to build the best predictivemodel. Imbalanced data classification problem is resolved with oversamplingmethod and finally we build the model with supervised learningalgorithms using training dataset. Finally, predictive model isvalidated with test dataset and Performance of algorithms isevaluated using confusion matrix method with test dataset. AndPrecision-Recall, F-measure is also other performance metrics calculated foraccuracy of algorithms.
The taxonomy of proposed classificationmodel is given below:Figure 1.Taxonomy of proposedmethodologyA. Dataset:The key objective of this paper is toanalyze and understanding the need and purchase plan of findwho all buy the car insurance service of campaign. So, wepropose different classification algorithms in R on large-scale insurancedata to improve the performance and predictive modeling.
We have collected thedata from Kaggle datasets: The Home of Data Science & Machine Learning.This is a dataset from one bank in the United States.Besides usual services, this bank also provides car insurance services. Thebank organizes regular campaigns to attract new clients. The bank has potentialcustomers’ data, and bank’s employees call them for advertising available carinsurance options. We are provided with general information about clients (age,job, etc.) as well as more specific information about the current insurancesell campaign (communication, last contact day) and previous campaigns(attributes like previous attempts, outcome).You have data about 4000 customerswho were contacted during the last campaign and for whom the results ofcampaign (did the customer buy insurance or not) are known.
B. Preprocessing:Data is usually collected for unspecified applications. Dataquality is one of the major issues that are needed to be addressed in processof big data analytics.
Problems that affect the data quality are given in the following:1.Noise and outliers 2. Missing values 3. Duplicate data. Preprocessing is atechnique used to make data more appropriate for data analytics. Data Cleaningis method to handle the missing data values. We have used predictive modelimputation method to predict the missing values using the non-missing data.
Here, we used KNN algorithm to estimate the missing data. This will estimatemissing data with help of the recent neighbor values. Data transformation is one of the methods in preprocessingto normalize data. Normalization is a process in which wemodify the complex dataset into simpler dataset. Here, we used Min-MaxNormalization to normalize the data.
It will scale the data between the 0 and1.Where, x is thevector that we going to normalize. Then min and max arethe minimum and maximum values in x given its range. After thedataset is pre-processed. Now it is ready for data partition. C.
Data Partition:In this step, we will split data into separate roles oflearn (train) and test rather than just working with all of the data. Trainingdata will contain the input data with expected output. More or less 60% of theoriginal dataset constitutes the training dataset and other 40% is consideredas testing dataset. This is the data that validate the core model and checksfor accuracy of the model. Here, we partitioned the original insurance datasetinto train and test set with probability of 60 and 40 split.D.
Supervised Learning Algorithms:Supervised learning is machine learning technique. Thisinfers function and practice with training data without explicitly programmed.Learning is said to be supervised, when the desired outcome is already known. After partition, next step is to build the model withtraining sample set.
Here, our target variable is chosen first. We selected ourtarget variable as car insurance and other attributes in dataset is taken aspredictors to develop the predictive model. Now, I want to create a modelto predict who will buy the car insurance service during campaign? In thisproblem, we need to segregate clients who buy car insurance in campaign basedon highly significant input variables.
In this paper, we are using Random Forest and Extremegradient boosting Algorithm to predict the model. And evaluate which algorithmconfers better performance. Random Forest: Before we move to randomforest it is necessary to have a look on decision tree.
What is Decision Tree?Decision Tree can be used for both classification andregression. Unlike linear models, tree based predictive model gives highaccuracy. Decision tree is mostly used in classification problem. It willsegregate the clients based on all values of predictor variablesand identify the variable, which creates the best homogeneous sets ofclients.
In this, our decision variable is categorical.Why Random Forest?Random forest is one of the frequently used predictive modeland machine learning technique. In a normal decision tree,one decision tree is built to identify the potential client but in case ofrandom forest algorithm, numbers of decision trees are built during the processto identify the potential client. A vote from each of the decision trees isconsidered in deciding the final class of an object.Model Description:Sampling is one ofthe methods in preprocessing. This will select the subset of original samples.This is mainly used in case of balance the data classification.
In our model,we have used under sampling approaches to balance the data sampling. It will reduce themajority class to make their frequencies closer to the rarest class. Originalinsurance data is balanced with under sampling.So further we will use this sample in Random Forest Algorithms to build the model.This randomly generates the n number of trees to build the effective model.Extreme Gradient Boosting:Another classifier is extreme gradient boosting .The XGBoosthas an immensely high predictive model. This algorithm works ten percent fasterthan existing algorithms.
It can be used for both regression, classificationand also ranking.One of the most interesting things about the XGBoost is thatit is also called as regularized boosting technique. This helps to reduceoverfit modeling. Over-fitting is the occurrence in which the learning modeltightly fits the given training data so much that it would be inaccurate inpredicting the outcomes of the untrained data. Model Description:In our model, first we used over sampling method to balancethe classification.
Sampling techniques can be used to get better predictionperformance in the case of imbalanced classes using R and caret. Over samplingwill randomly duplicate samples from the class with few instances. Here,we used over sampling method with train set to improve the performance ofmodel. Now, balanced samples are collected. We will pass these samples toXGBoost as train set and built the model. XGBoost built the binaryclassification model with insurance data. After this, model is validated withtest set.
This produces much better prediction performance compared torandom forest algorithm.E. Model Evaluation:Performance analysis of classification problems includes thematrix analysis of predicted result. In this paper, we have used followingmetrics to evaluate the performance of classification algorithms. They arePrecision-Recall, F-measure.Precision is the fraction of predicted occurrence that isrelated. It is also called positive predicted value.Recall is fraction of related instances that have been retrieved over the totalamount of relevant instances.
F1-Measure is the weightedharmonic mean of the precision and recall and represents the overallperformance. Where, TP – Truepositive ,FP – False Positive, TN-True Negative, FN-False Negative.