Abstract Web pages along with Naive Bayes clustering

AbstractThe distinctive webpagerecommendation for individuals is evident these days.Web servers areloaded with recommendation systems that analyse and recommend webpages for theusers.

They use data implicitly obtained as a result of Webbrowsing patterns of the users for recommending webpages.The existing systemcollects the Web logs and generates a cluster ofsimilar users and recommends pages to the user by actively analysingit in online. However the time for analysing it in online is more. To optimizethis and increase the correctness of recommendation systems, a methodthat applies Firefly based algorithm for recommending Webpages along with Naive Bayes clustering is designed.

 User Web logsare initially clustered in offline by using Naive Bayes clusteringtechnique. To find the similarity between theactive user queries with other users in thecluster Firefly algorithm based similarity measure is used. Theproposed approach uses a probability basedclustering which eliminates the odd records while forming clusters.

Firefly algorithm meticulously searches the generated weblogs present in the cluster of the active user and recommends the toppages. Firefly algorithm utilizes time efficiently, thus it is used forprocessing in online. When pages are obtained, they areranked and the top pages that are more relevant tothe query are recommended.The efficiency of the system can be evaluatedusing measures like precision, recall-Score, Matthews’s correlation andFallout rate. The proposed approach is expected to improve timeutilization in online process as well as recommendsmore accurate Webpages.    Introduction- Webpage recommendation system is a sub-domain of recommendation systems thatrecommends a set of Web pages to the users based on their past browsingpatterns. It is done by applying special mining techniques on the data that arepreviously gathered from the users which in turn discovers and extractinformation from Web documents and services. The major concern is to findreliable and efficient recommendation algorithms.

Best services for writing your paper according to Trustpilot

Premium Partner
From $18.00 per page
4,8 / 5
Writers Experience
Recommended Service
From $13.90 per page
4,6 / 5
Writers Experience
From $20.00 per page
4,5 / 5
Writers Experience
* All Partners were chosen among 50+ writing services by our Customer Satisfaction Team

Recommendation systemtypically produces the result by following one of the two ways – throughcollaborative and content based filtering. A.    CollaborativeFiltering  Mostrecommendation system has wide use of collaborative filtering for recommendingitems. This method lies on collecting and processing the information’s onuser’s behaviours or activities and then predicting the items relating to theirsimilarity with other users.

Collaborative filtering approach builds astructure from the users past behaviours and decisions of other similar users.This model is used to predict user interested items. Since collaborativefiltering is independent of machine analysable contents, it is capable ofrecommending for complex items accurately without “understanding” of the itemitself.  B.    ContentBased Filtering Content basedfiltering is a widely used approach for designing recommendation systems.

Thistechnique is lies on a definition of item along with a user’s preferredprofile. In a content based recommendation systems, the keywords are consideredas user’s interest. It utilize a series of distinct property of an item forobtaining and recommending items with same properties. These approaches arecontinually combined as Hybrid Recommendation Systems. These algorithms try torecommend items based on examining the items that are liked by a user in thepast or in the present.

In general, various items of candidate set are comparedwith items that are rated by the user in the past and the best matching itemsare recommended.  Literature surveyRecommendationsystem has a major role of recommending personalized items for the usersbased on their interest in a web services. The webalso contains a rich and dynamic information’s. The amountof information on the web is growing dynamically,as well as the number of web sites and webpages per web site. Predicting the behavioursand needs of a web user has gained importance. Many webpagerecommendation system were developed in thepast, since they compute recommendations inonline process, their time utilization shouldbe efficient. A system 4 that uses support vectormachine (SVM) learning based model wasdeveloped for computing similarity between two itemswhich performed better than latentfactor approach for group recommendations. Since thematrix representation was followed, thedata sparsity problem was solved.

However, the system was not ableto stably scale when size of the groupdynamically increased.  Hybridrecommender systems which combines more number of recommendation techniques was designed5. It eliminates any weakness which exist when onlyone recommender system is used. There are several ways in which thesystems can be combined, such as weighted hybridrecommender in which the scoreof a recommended item is computed from the resultsof all of the available recommendation techniques presentin the system. However, data sparseness was still a problem, the system maygenerate week recommendations if few users have rated thesame items and also the system doesn’tovercome the cold start problem.

  Hyperspectralsensors can acquire hundreds of contiguous bands over a wideelectromagnetic spectrum for each pixel. To reduce computationalcost and eliminate an actual classifier within the band searching process, animproved firefly algorithm based band selection method 8 was used. TheFirefly algorithm is an evolutionary optimization algorithm proposedby Yang 13. After the initializations of parameters, the brightness is calculatedwith the objective function .Then the moment states wereevaluated and the bands are selected. Firefly algorithm also hada faster convergence even at the size of thedata is larger.

  Further, to improvethe accuracy of similarity measure, firefly algorithm based similarity measuresare also introduced 10.It considered separate effects for ratings ofusers with similar opinions and conflicting opinions. In orderto generate initial population of fireflies, half of population randomlygenerated and the other half of population are randomly generated.

 Meanabsolute error was chosen as objective function to measure recommendation accuracy whichis obtained by difference between predicted rating and real rating.  An optimalsimilarity measure via a simple linear combination of values and ratio ofratings for user-based collaborative filtering provides better results. Itincreased speed of finding nearest neighbours of active user and reduceits computation time. Similarity function equation based on Firefly algorithmwas simpler than the equation used in traditional metricstherefore, the proposed method provided recommendations faster than traditionalmetrics.  Graph colouring problems aregenerally discrete.

Algorithms to discrete problems arequite complex. A new algorithm based on Similarity anddiscretize firefly algorithm directly without any other hybridalgorithm was developed 11. It was adoptable to dynamic graphsizes.  A system forassigning an electronic document to one or more predefined categoriesor classes based on its textual context and use of agglomerativeclustering algorithm was developed 6. This type ofclustering along with sample correlation coefficient assimilarity measure, allowed high indexing term space reduction factor witha gain of higher classification accuracy.

 In order tominimize noise and outlier data, a modified DBSCALE algorithm using Naïve Bayeshas been designed 7. This algorithm is basically a prospect basedutility. This function is used toestimate the outlier clusterdata and increase the correctness rate of algorithm on giventhreshold value. Since Naïve Bayes is a probability based function,it removes outlier cluster data and increases the correctness rate according tothreshold value. It also computes maximum posterior hypothesis for outlierdata. In order to minimize noise and outlier data, a modified DBSCALE algorithmusing Naïve Bayes has been designed 7. This algorithm is basically a prospectbased utility.  This function isused to increase the correctness rate of algorithm on giventhreshold value and to estimate theoutlier cluster data.

 Since Naïve Bayes isa probability based function, it removes outlier cluster data andincreases the correctness rate according tothreshold value. It also computes maximum posteriorhypothesis for outlier data. The memorybased collaborative system uses matrixbased computation and solves data sparsity problem but, scalabilityof the system cannot be stable when size of the group dynamically increases.Hybrid system could be helpful in overcomingthe scalability issue but it again leads to cold start problem.  Toeliminate outliers as well as overcoming other two problems Naive Bayes clustering,a probability based method was used in past.

Firefly algorithm has a faster convergence and searches allpossible subsets with better time utilization. Thus, to design an efficientrecommendation system, Naïve Bayes method can be followed forclustering in offline. Since the time complexity should beless, Firefly algorithm that is moreefficient in terms of time utilization, it can be used forcalculating similarity in online. Combination of these two techniquemight increase the accuracy of therecommendation system as well as results in efficienttime utilization.                     III.   Overview of the proposed work  Initially, the web log files are obtained fromthe 1 America Online Inc.

 The log files consists of fivefields i.e. anonymous ID for individual user, query of each user alongwith query time, list of URLs which user proceeded and itsrank in the result. These logs are collectedand grouped based on anonymous ID.

The URL among allthe users are obtained and its content are downloaded andprocessed. The processing of data includes removal ofstop words from the URL’s data andkeyword extraction. Similar users are clustered based on fetchedkeywords by using Naïve Bayes clustering technique which provides efficientclusters compared to clustering by the use of association rules. The createdclusters are given to online component. In online process, when an active usergives a query, the keywords from the query is extracted. Thesimilarity between the extracted keywords with the other usersin the same cluster of the active useris calculated using Firefly similarity measure. Thesimilarity values are sorted along with the web pagesbrowsed by similar users in the cluster.

The top k web pages arerecommended for the active useras a result.               IV. The proposedwork The proposedsystem follows a linear process of initially collecting theweb logs and processing them followed by clustering similar usersby Naïve Bayes clustering technique and finally generatingrecommendations based on a similarity measure from fireflyalgorithm.  A.       Pre-processing of Web Logs  Theweb logs are collected form 1 AOL Inc. It consists of 20million web queries from 650 thousand real users over 3months.

 The data set includes anonymous ID, query, querytime, item rank and click URL. The log file containsmany number of users along with the web pages visited bythem. It is validated and separated based on anonymous ID. The useris separated into individual file using anonymous ID. The content fromthe URL are fetched and downloaded.Those keywords are processed which undergoes stopwords removal andstemming process. The final keywords are thenextracted. The features like keywords, Timings, Frequency, Click URL andRevisit are fetched.

The user profile is constructed using thosefeatures. The user profile that constructed is basedon the features that are takenform the user log files. Timing: The timingthat the user spent on that particular URLFrequency: Theamount of time the user visited the URL Clickstream: Thenumber of click stream that are visited by user Revisit:Whether the user visited the web page  The keywords aregenerated from the data fetched form theURL. Timing for each URL is estimated fromthe given date and time by calculating the differencebetween the each URL that are searched in a singleday by having some time constraints. Frequencyis hence calculated such that number of times the userclicked the URL.

The clickstreams are those that areclicked by the user for additional information. The timingof revisit is calculated such that to decide whether theuser preferred it much or not. Keywords:Keywords are those which are extracted from the URL.

The information from the URL is hence collected and processed toobtain features of the user.    B.        Naïve Bayes Clustering  Clustering, alsoknown as unsupervised classification, is a descriptive task with manyapplications. Clustering is decomposition or partition of a data set intogroups such that the object in one group are similar toeach other but as different as possible from theobject in other groups. Three main approach for clustering of data is partitionbased clustering, hierarchical clustering and probabilistic modelbased clustering. Probabilistic model based clustering is asoft clustering were an object can be in many clusterfollowing a probability distribution.

A clustering is useful if it producessome interesting insight in the problem that weare analysing. Naïve Bayes clustering is also a probabilistic clustering techniquethat is based in Bayes theorem with strong independentassumption between features. The feature variables canbe discrete or continuous. This probabilistic clustering lies on nominal andnumeric variables in the data set and its novelty lies in the use of mixture oftruncated exponential (MTE) densities to model the numeric variables. In NaïveBayes clustering the class is the only root variable and allthe attributes are conditionally independent given the class. Theclustering problem reduces to take a data set of instancesand a previously specified number of clusters (k), and work outeach cluster’s distribution and the population distribution betweenthe clusters. To obtain these parameters the expectation maximization (EM)algorithm is used. Since Naïve Bayes clustering isa probability based techniques.

The items belongs to thecluster if and only if it has a relation to it. This helps ineliminating outlier data in the process of clustering. It also provides properclustering with less computations. The given dataset is divided into two parts,one for the training and other for testing. For eachrecord in the test and train databases, the distribution of the classvariable is computed. According to the obtained distribution, a value for theclass variable is simulated and inserted in the corresponding cluster. Thelog-likelihood of the new model is computed.

If it is higher than the initialmodel, the process is repeated. Otherwise, the process is stopped,obtained clusters are returned.    C.   Optimisation Using Firefly Algorithm Fireflyalgorithm is an evolutionary algorithm that is based on thebehaviour of fireflies. Fireflies live in colonies and cooperate for thesurvival of the colony.

Generally, in order to model the behaviour offireflies, three assumptions will always be considered i.e. all fireflies arehomogeneous, Attractiveness of each firefly is related to its level ofbrightness, rightness of firefly is determined with an exponentialobjective function. Each firefly always emits a kindof light that by which attracts other fireflies.

The amount of accessedlight depends on parameters such as distance and absorption coefficient of thesurroundings. The longer the distance the lesser the amount of accessed lightwill be. Also in surroundings with high light absorption coefficient such asfoggy weathers, the intensity of light decreases. Thecertain issue is that every firefly regardless of its gender hasalways been attracted to and moved toward the brighter firefly.

Firefly has a light intensity of its own. The key concept is, the firefly withlow light intensity is always attracted to the firefly with high lightintensity. This concept can be incorporated for calculating similarity. Byusing firefly based similarity measure unique and distinguished results can beobtained which is a useful feature for ranking. It can deal with highly non-linear, multi-modal optimization problems naturally andefficiently. It does not use velocities, and thereis no problem as that associated with velocity in PSO. Thespeed of convergence is very high in probability of finding the globaloptimized answer.

It has the flexibility of integration with other optimizationtechniques to form hybrid tools. It does not require agood initial solution to start itsiteration process. Each web pages visited bythe user i are considered a firefly. The number of user visited theparticular page is assumed as the light intensity of the firefly.

The objectivefunction is formulated based on the frequency and duration. Frequency iscalculated as the ratio to the number of visits per page to the average vestsof all pages.   The duration isthe ratio of duration of page to the total duration of all the pages visited bythe user. Thus, the objective function can be defined as in equation 5.1Interest (i)= 2*Frequency (i)*Duration (i) Frequency (i)+Duration (i) (5.1)  The interest of all users in the cluster is calculated.

Then the pagesto be recommended are found by using page rank algorithm 2 on the obtainedresult. The results after applying page rank algorithm is given as therecommended web page to the user.     D. Ranking the WebPages  The result, set ofweb pages obtain should ranked in an order that the user might have higherinterest.

 Thus, they areranked in a sorted order basedon the interest of the active user. The associationrule checks the maximum possible combinationswhich provides more accurate pages.   E.   Recommendation Process  The URL that areto be recommended will be identified based on ranking and similarity measure.The similarity measure is calculated among the users by comparing their similarinterest. From the obtained result of pages, page rank algorithmis used to rank the most relevant pages to the user.

 Thus, resultant URL’s arerecommended to the users. Hencethe web page that is to be recommended tothe user will be more relevant. The use of Naive Bayes clusteringwill eliminate the outliers and Firefly based similarity calculation willcheck all the subsets of the clusters.