of 26
Data Mining for Web Personalization
Patrick DudasData Mining for Web PersonalizationOutlinePersonalizationData miningExamplesWeb miningMapReduceData PreprocessingKnowledge DiscoveryEvaluationInformation High
PersonalizationGoal of data mining approach is for automatic personalizationAutomatic Personalization:Content-basedCollaborativeRule-based Rule-based (Brief overview)Create decision rulesImplicitly/ExplicitlyHighly domain dependent Rules nontransferable Profiles are based on user inputBiased StaticDegrade over time
Content-based (Brief overview)Profile based on users past experiences and their interest (ratings)Think Amazon, Pandora, eBay..Vector similarities based on cosine similarityBayesian classificationRemember: Ratings = Profile = RecommendationCollaborative (Brief overview)Creating groups of users based on ratingsNearest neighbor approachOnce grouped, recommendation based on the other neighbors are presentedMore users or items = more dimensions of dataDynamic or real-time not applicableData Mining Data rich descriptionsLarge volumes of datareliable modelsAutomated data collectionEvaluate results/make decisionsIntegration with existing data sources
Examples of Large Datasetshttp://aws.amazon.com/datasetsFeatured data sets:Illumina - Jay Flatley (CEO of Illumina) Human Genome Data Setcience 315(5814): 972.350 GBYRI Trio Dataset700GBSloan Digital Sky Survey DR6 Subset160 GBGenome, survey data, Google Books n-gram corpuses, traffic statistics, OpenStreetMap dataset, Wikipedia trafficData Mining Web Personalization Recommendations based on Web objects:ItemsPagesDocumentsNavigation by links
Web miningPros: Personalization (duh.), real-time, more enriched datasets Cons: Privacy issues, building complex systems that misrepresent the individualExtend the Data Mining ParadigmData Preparation and TransformationWeb logsDate/time usageSite informationResource requested (image, video, etc.)Site files/meta-data The power of the cookie
Server-side cookies!Data Preparation and Transformation (cont.)Pageview:User actions (where they clicked and the path)User events (what they are trying to accomplish)Session:Sequence of page views
MapReduceGoogle designHoodop implementedC++, C#, Erlang, Java, Ocaml, Perl, Python, Ruby, F#, R..
Example
Usage Data Pre-Processing
Pattern DiscoveryWe have data! Now what?ClusterClassificationAssociation Rule DiscoverySequential pattern DiscoveryMarkov ModelsLatent Variable Model
ClusteringPartitioningSplit your data into groupsK-meansHierarchical Divisive (top-down)Start with everything, find groupsAgglomerative (bottom-up)Start with a cluster and add additional informationModel-basedBuilding a model for the data (best fit)K-means
User-Based ClusteringStart with the user profilePartition into k-groups of profilesBased on similarity
Association DiscoverySupportmin(support)Confidencemin(confidence)
Evaluation (Personalization Model)Challenges:Recommendation algorithms may require unique set of evaluation metricsPersonalization actions may be differentDomainIntended applicationData gatheredCheck for overfitting dataTraining setROC CurveROC CurveTPR=TP/ (TP+FN)FPR=FP/ (FP+TN)"No""Yes"SNN.95.05.20.80
Information HighInformation is addictiveInformation can be misleadingInformation ethicsInformation is power, sometimes too powerful
Personal SuggestionsDevelop a hypothesisFigure out what data is neededMake informed decisionsDont trust just your judgmentExperts are experts for a reason!Develop a way to validate based on experienceThen get more data if needed
SourcesKohavi, R. and F. Provost (2001). "Applications of data mining to electronic commerce." Data Mining and Knowledge Discovery 5(1): 5-10.Dean, J. and S. Ghemawat (2008). "MapReduce: Simplified data processing on large clusters." Communications of the ACM 51(1): 107-113.Mobasher, B. (2007). "Data mining for web personalization." The adaptive web: 90-135.http://en.wikipedia.org/wiki/File:FrequentItems.pngZaharia, M., A. Konwinski, et al. (2008). Improving mapreduce performance in heterogeneous environments, USENIX Association.Witten, I. H. and E. Frank (2002). "Data mining: practical machine learning tools and techniques with Java implementations." ACM SIGMOD Record 31(1): 76-77.Senthil kumar, A. (2011). Knowledge discovery practices and emerging applications of data mining : trends and new domains. Hershey, PA, Information Science Reference.Thank you!Questions?