A the balanced training datasets. This will improve

A huge amount of data is generally referred as Big Data. It
is enormous in size, diverse variety and has highest velocity of data arrival.
This huge information is useless unless the data is examined to uncover the new
correlations, customer experiences and other useful information that can help
the organization to take more informed business decisions. Big data is widely
applied in all sectors like healthcare, insurance, finance and many more.

  Big data in
insurance sector is one of the most promising. Traditional marketing system of
insurance is offline based sales business. They generally sell the insurance
policies by calling and visiting the customers. This fixed marketing system
also achieved good results in past time. But currently many new private
insurances companies also have entered into the marketplace which gives
healthier competition. On other hand, eagerness of people to pay for the
insurance service is also increased. Therefore, understanding the need and
purchase plan of clients is extremely essential for insurance companies to
raise the sales volume.

  Big data technology
supports the insurance companies’ transformations. Due to lack of principle and
innovation of traditional marketing, badly structured insurance data, unclear
customers purchasing characteristics leads to imbalanced data, which brings the
difficulty of classification of user and insurance product recommendation.

Decision making task is difficult with imbalanced data
distribution. To solve this problem, we usually use few resampling
which will construct the balanced training datasets. This will improve
the performance of predictive model.

  Main purpose of this
paper is to identify the potential customer with help of big data technology.
This paper does not only provide good strategy for identifying the potential
client but also act as good reference for classification problems.

We propose supervised learning algorithm call ensemble
decision tree to analysis the potential customer and their major

This paper is organized as follows. Section II introduces
the current research status of machine learning; Section III puts forward the
classification model and intelligent recommendation algorithm based on XGBoost
algorithm for insurance business data, and analyzes its efficiency; Section IV gives
you experiment and result. Section V puts forward the analysis result. Section
VI Conclusion and future work.


The classification problem for US bank insurance business
data has imbalanced data distribution. This means ratio between positive and
negative proportion are extremely unbalanced, the prediction models generated directly
by supervised learning algorithms like SVM, Logistic Regression are biased for
large proportion. Example, the ratio between positive and negative classes is
100:1. Therefore, this can be seen as such model does not help in prediction.

class distribution will affect the performance of classification problem. Thus,
some techniques should be applied to deal this problem. One approach to handle
the problem of unbalanced class distribution is sampling techniques. This will
rebalance the dataset. Sampling techniques are broadly classified into two
types. They are under sampling and over sampling. Under sampling technique is
applied to major class for reduction process (e.g. Random Under Sampling) and
over sampling is another technique applied to add missing scores to set of
samples of minor class (e.g. Random Over Sampling).The drawback of ROS is
redundancy in dataset this will again lead to classification problem that is
classifier may not recognize the minor class significantly. To overcome this
problem, SMOTE (Synthetic Minority Over Sampling) is used. This will create
additional sample which are close and similar to near neighbors along with
samples of the minor class to rebalance the dataset with help of K-Nearest
Neighbors (KNN) 2.Sampling method is divided into non-heuristic method
and heuristic method. Non-heuristic will randomly remove the samples from
majority class in order to reduce the degree of imbalance. Heuristic sampling is
another method which will distinguish samples based on nearest neighbor
algorithm 7.Another difficulty in classification problem is data quality,
which is existence of missing data. Frequent occurrence of missing data will
give biased result. Mostly, dataset attributes are dependent to each other.
Thus, identifying the correlation between those attributes can be used to
discover the missing data values. One approach to replace the missing values
with some probable
values is called imputation 6.

One of the challenges in big data is data quality. We need
to ensure the quality of data otherwise it will mislead to wrong predictions
sometimes. One significant problem of data quality is missing data.

Imputation is method for handling the
missing data. This will reconstruct the missing data with estimated ones. Imputation
method has advantage of handling missing data without help of learning
algorithms and this will also allow the researcher to select the suitable
imputation method for particular circumstance 3.


There are many imputation methods for missing value
treatment (Some widely used data imputation methods are Case substitution, Mean
and Mode imputation, Predictive model). In this paper we have built the
predictive model for missing value treatment.

There are various machine learning algorithms to solve both
classification and regression problems. Machine learning is process of
designing the system which has ability to automatically learn and practice
without being explicitly programmed. Machine learning algorithms are classified
into three types (Supervised learning, Unsupervised learning, Reinforcement
Learning).In this paper, we propose supervised machine learning algorithms to
built the model. Some of the supervised learning algorithms are Regression, Decision Tree, Random Forest, KNN, Logistic
Regression etc 8. Decision tree in machine learning can be used for both
classification and regression. In decision analysis, a decision tree
can be used to visually and explicitly represent decisions and decision making. The tree has two important entities that are decision nodes
and leaves. The leaves are the decisions or the final outcomes. And the
decision nodes are where the data is split. Classification tree is type of
decision tree where the outcome was a variable like ‘fit’ or ‘unfit’. Here the
decision variable is Categorical.

One of the best ensemble methods is
random forest. It is used for both classification and regression 5. Random
Forest is collection of many decision trees; every tree has its full growth.
And it has advantage of automatic feature selection and etc 4.

Gradient Boosting looks to
consecutively reduce error with each consecutive model, until one final model
is produced. The main aim of all machine learning algorithms is to build the
strongest predictive model while accounting for computational efficiency as
well. This is where XGBoosting algorithm plays important role.  XGBoost (eXtreme Gradient Boosting) is a
direct application of Gradient Boosting for decision trees. It gives you more
regularized model formalization to control over-fitting, which gives better
performance 8.  



Model: Traditional sales approach of insurance product is offline
process and it as following disadvantages: (1) lack of customer
evaluation system, don’t know the characteristics influence weight of the
potential customers; (2) the data accumulated in this way usually has
serious ruinous, indirect influence the accuracy of classification model 4.

For a bunch of classification models, distribution of class
and correlation features affects the forecast results. Imbalanced data
classification and independent attributes of insurance dataset will have
serious deviation in classification model result. We can handle this kind of
problems with different sampling method and supervised learning algorithms.

In this article, we have used over
sampling approach with supervised learning algorithms on
car insurance dataset to build the best predictive
model. Imbalanced data classification problem is resolved with over
method and finally we build the model with supervised learning
algorithms using training dataset. Finally, predictive model is
validated with test dataset and Performance of algorithms is
evaluated using confusion matrix method with test dataset. And
Precision-Recall, F-measure is also other performance metrics calculated for
accuracy of algorithms. The taxonomy of proposed classification
model is given below:

Figure 1.Taxonomy of proposed


The key objective of this paper is to
analyze and understanding the need and purchase plan of find
who all buy the car insurance service of campaign. So, we
propose different classification algorithms in R on large-scale insurance
data to improve the performance and predictive modeling. We have collected the
data from Kaggle datasets: The Home of Data Science & Machine Learning.

This is a dataset from one bank in the United States.
Besides usual services, this bank also provides car insurance services. The
bank organizes regular campaigns to attract new clients. The bank has potential
customers’ data, and bank’s employees call them for advertising available car
insurance options. We are provided with general information about clients (age,
job, etc.) as well as more specific information about the current insurance
sell campaign (communication, last contact day) and previous campaigns
(attributes like previous attempts, outcome).You have data about 4000 customers
who were contacted during the last campaign and for whom the results of
campaign (did the customer buy insurance or not) are known.


Data is usually collected for unspecified applications. Data
quality is one of the major issues that are needed to be addressed in process
of big data analytics. Problems that affect the data quality are given in the following:
1.Noise and outliers 2. Missing values 3. Duplicate data. Preprocessing is a
technique used to make data more appropriate for data analytics. Data Cleaning
is method to handle the missing data values. We have used predictive model
imputation method to predict the missing values using the non-missing data.
Here, we used KNN algorithm to estimate the missing data. This will estimate
missing data with help of the recent neighbor values. Data transformation is one of the methods in preprocessing
to normalize data. Normalization is a process in which we
modify the complex dataset into simpler dataset. Here, we used Min-Max
Normalization to normalize the data. It will scale the data between the 0 and

Where, x is the
vector that we going to normalize. Then min and max are
the minimum and maximum values in x given its range. After the
dataset is pre-processed. Now it is ready for data partition.

Data Partition:

In this step, we will split data into separate roles of
learn (train) and test rather than just working with all of the data. Training
data will contain the input data with expected output. More or less 60% of the
original dataset constitutes the training dataset and other 40% is considered
as testing dataset. This is the data that validate the core model and checks
for accuracy of the model. Here, we partitioned the original insurance dataset
into train and test set with probability of 60 and 40 split.

Supervised Learning Algorithms:

Supervised learning is machine learning technique. This
infers function and practice with training data without explicitly programmed.
Learning is said to be supervised, when the desired outcome is already known.

After partition, next step is to build the model with
training sample set. Here, our target variable is chosen first. We selected our
target variable as car insurance and other attributes in dataset is taken as
predictors to develop the predictive model. Now, I want to create a model
to predict who will buy the car insurance service during campaign? In this
problem, we need to segregate clients who buy car insurance in campaign based
on highly significant input variables.

In this paper, we are using Random Forest and Extreme
gradient boosting Algorithm to predict the model. And evaluate which algorithm
confers better performance.


Random Forest: 

Before we move to random
forest it is necessary to have a look on decision tree.

What is Decision Tree?

Decision Tree can be used for both classification and
regression. Unlike linear models, tree based predictive model gives high
accuracy. Decision tree is mostly used in classification problem. It will
segregate the clients based on all values of predictor variables
and identify the variable, which creates the best homogeneous sets of
clients. In this, our decision variable is categorical.

Why Random Forest?

Random forest is one of the frequently used predictive model
and machine learning technique. 

In a normal decision tree,
one decision tree is built to identify the potential client but in case of
random forest algorithm, numbers of decision trees are built during the process
to identify the potential client. A vote from each of the decision trees is
considered in deciding the final class of an object.

Model Description:

Sampling is one of
the methods in preprocessing. This will select the subset of original samples.
This is mainly used in case of balance the data classification. In our model,
we have used under sampling approaches to balance the data sampling. It will reduce the
majority class to make their frequencies closer to the rarest class. Original
insurance data is balanced with under sampling.
So further we will use this sample in Random Forest Algorithms to build the model.
This randomly generates the n number of trees to build the effective model.

Extreme Gradient Boosting:

Another classifier is extreme gradient boosting .The XGBoost
has an immensely high predictive model. This algorithm works ten percent faster
than existing algorithms. It can be used for both regression, classification
and also ranking.

One of the most interesting things about the XGBoost is that
it is also called as regularized boosting technique. This helps to reduce
overfit modeling. Over-fitting is the occurrence in which the learning model
tightly fits the given training data so much that it would be inaccurate in
predicting the outcomes of the untrained data. 

Model Description:

In our model, first we used over sampling method to balance
the classification. Sampling techniques can be used to get better prediction
performance in the case of imbalanced classes using R and caret. Over sampling
will randomly duplicate samples from the class with few instances. Here,
we used over sampling method with train set to improve the performance of
model. Now, balanced samples are collected. We will pass these samples to
XGBoost as train set and built the model. XGBoost built the binary
classification model with insurance data. After this, model is validated with
test set. This produces much better prediction performance compared to
random forest algorithm.

Model Evaluation:

Performance analysis of classification problems includes the
matrix analysis of predicted result. In this paper, we have used following
metrics to evaluate the performance of classification algorithms. They are
Precision-Recall, F-measure.

Precision is the fraction of predicted occurrence that is
related. It is also called positive predicted value.
Recall is fraction of related instances that have been retrieved over the total
amount of relevant instances. F1-Measure is the weighted
harmonic mean of the precision and recall and represents the overall


 Where, TP – True
positive ,FP – False Positive, TN-True Negative, FN-False Negative.