Data Mining MTAT.03.183(4AP 6EAP)(4AP = 6EAP)Introduction
Jaak ViloJaak Vilo
2009 Fall
LecturerLecturer
• 1986‐1991 U Tartu
• 1991‐1999 U Helsinki (sequence pattern discovery)1991 1999 U Helsinki (sequence pattern discovery)
• 1999‐2002 EMBL‐EBI, UK (bioinformatics)
• 2002‐ EGeen ‐> Quretec (Biobank and Data Mgmnt)
• U Tartu professor (Bioinformatics) 2007U Tartu, professor (Bioinformatics) 2007– EXCS – Center of Excellence
– STACC – Software Technologies and ApplicationsCompetence Center (Tarkvara TAK)
– research projectsJaak Vilo and other authors UT: Data Mining 2009 2
StudentsStudents
• >80 registered
• Estonian vs ForeignEstonian vs Foreign
• MSc 1st y / 2nd y ?
• BSc , PhD ?
• Non IT/CS ?Non IT/CS ?
• Why this class? Expectations? (ESSCaSS’08,09…)
Jaak Vilo and other authors UT: Data Mining 2009 3
CourseCourse
• http://courses.cs.ut.ee/2009/dm/
• List: [email protected]: [email protected]
• Lectures: 10:15, Liivi 2‐403
• Seminars: 12:15, Liivi 2‐403
• Prof Jaak Vilo vilo@ut eeProf. Jaak Vilo [email protected]• http://www.quicktopic.com/43/H/eWqhydvFpUN
O h ? Sk ?• Other? Skype ?
4UT: Data Mining 2009Jaak Vilo and other authors
SeminarsSeminars
• Three types:1. Homework: presentations/discussionsp /
2. Guest lectures, visitors
3 Practical labs/training (no concrete plans yet)3. Practical labs/training (no concrete plans yet)
• Participation is obligatory (>75%)
Jaak Vilo and other authors UT: Data Mining 2009 5
Grading requirementsGrading requirements
• Participation! >75% of seminarsParticipation! >75% of seminars
• Homeworks (30%) (min 50% of assignments)
• Projects/essays (30%)
• Exam (40%)Exam (40%)
• Total: 100% + thresholds
• All deadlines are stringent.
Jaak Vilo and other authors UT: Data Mining 2009 6
HomeworkHomework• Tasks/assignmentsTasks/assignments
– 5 tasks/week + possibly bonuses
– About in every 2 weeks (irregular)
• Report/mark all completed tasksp / p– written reports on tasks
ready to present fully to class– ready to present fully to class
– there will be some uploading system
– and/or paper sheets in class
• Deadline always before class start (Thu, 12:15)Deadline always before class start (Thu, 12:15)Jaak Vilo and other authors UT: Data Mining 2009 7
4AP = 6EAP4AP = 6EAP
• 4 weeks (4x40h=160h) of intensive work– assuming basic knowledge of BSc materialg g
• 1/3 in class• 1/3 in class
• 1/3 reading, homeworks
• 1/3 projects, writing, …
Jaak Vilo and other authors UT: Data Mining 2009 8
What is Data Mining?What is Data Mining?
• Data ‐> Information, Knowledge, Insight– new, interesting, nontrivial, useful …, g, ,
• Data size ‐> Algorithmic challenge
• Predictive, useful ‐> theoretical andPredictive, useful theoretical and economical challenge
• Why? By practical demand and need…y y p
Jaak Vilo and other authors UT: Data Mining 2009 9
TextbooksTextbooks
b d h d d ( h• Han, Kamber: Data Mining: Concepts and Techniques, Second Edition (The Morgan Kaufmann Series in Data Management Systems) Google Booksweb
• Chakrabarti et al. Data Mining: know it all. Morgan Kaufmann 2008 @ELsevier @AMazon @Google
• Bramer: Principles of Data Mining (Springer 2007) @Amazon @Springer• Bramer: Principles of Data Mining (Springer, 2007) @Amazon @Springer@Google
• David J. Hand, Heikki Mannila and Padhraic Smyth: Principles of Data Mining (MIT Press, 2001) @MIT Press @Google
• Trevor Hastie, Robert Tibshirani, Jerome Friedman: The Elements of Statistical Learning: Data Mining Inference and Prediction (SpringerStatistical Learning: Data Mining, Inference, and Prediction. (Springer 2009) @Tibshirani @Amazon
10UT: Data Mining 2009Jaak Vilo and other authors
• Han, Kamber: Data Mining: Concepts and T h i S d Edi i (Th MTechniques, Second Edition (The MorganKaufmann Series in Data ManagementSystems)
• TOC: http://www cs uiuc edu/homes/hanj/bk2/toc pdf• TOC: http://www.cs.uiuc.edu/homes/hanj/bk2/toc.pdf
11UT: Data Mining 2009Jaak Vilo and other authors
Jaak Vilo and other authors UT: Data Mining 2009 12
What’s it all about?What s it all about?
Data DB
Jaak Vilo and other authors UT: Data Mining 2009 13
• Statistics
• Patterns in dataPatterns in data
• Learning
• Classification
• Knowledge / Information /Knowledge / Information /
• Algorithms
• Prediction
•• …Jaak Vilo and other authors UT: Data Mining 2009 14
Sources of data (growth)Sources of data (growth)• devicesdevices
• net/web
• logs
• transactional db• transactional db
• consumer
• multimedia(!)
• science• science
• cheaper storage, compute power
• …Jaak Vilo and other authors UT: Data Mining 2009 15
Why Data Mining? y g
The Explosive Growth of Data: from terabytes to petabytes
Data collection and data availability
Automated data collection tools, database systems, Web, , y , ,computerized society
Major sources of abundant dataj
Business: Web, e-commerce, transactions, stocks, …
Science: Remote sensing, bioinformatics, scientific simulation, Science: Remote sensing, bioinformatics, scientific simulation, …
Society and everyone: news, digital cameras, YouTube
We are drowning in data but starving for knowledge! We are drowning in data, but starving for knowledge!
“Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
Jiawei Han, Micheline Kamber, and Jian Pei Data Mining: Concepts and Techniques 16
analysis of massive data sets
Evolution of Sciences
Before 1600, empirical science 1600-1950s theoretical science 1600-1950s, theoretical science
Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.
1950s 1990s comp tational science 1950s-1990s, computational science Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.) Computational Science traditionally meant simulation. It grew out of our inability to
find closed-form solutions for complex mathematical models. 1990-now, data science
The flood of data from new scientific instruments and simulations The ability to economically store and manage petabytes of data online The Internet and computing Grid that makes all these archives universally accessible p g y Scientific info. management, acquisition, organization, query, and visualization tasks
scale almost linearly with data volumes. Data mining is a major new challenge! Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Data Mining: Concepts and Techniques 17
Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science, Comm. ACM, 45(11): 50-54, Nov. 2002
Jiawei Han, Micheline Kamber, and Jian Pei
Evolution of Database Technologygy
1960s:ll i d b i S d k S Data collection, database creation, IMS and network DBMS
1970s: Relational data model relational DBMS implementation Relational data model, relational DBMS implementation
1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) ( ) Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s: Data mining, data warehousing, multimedia databases, and Web
databases
2000s 2000s Stream data management and mining Data mining and its applications
Data Mining: Concepts and Techniques 18
Web technology (XML, data integration) and global information systemsJiawei Han, Micheline Kamber, and Jian Pei
examples from Machine Learningexamples from Machine Learning
• 1950’ies – checkers (Arthur Samuels 1959)
• 1960’ies – NN – perceptron and it’s limitations1960 ies NN perceptron and it s limitations
• 1970’ies – expert systems, decision trees(ID3)(ID3), …
• 1980’ies – Neural Networks, PAC learning, …, g,
• 1990’ies – Data mining, ILP, Ensembles
• 2000’ – SVM, Kernels, Graphical Models, …
Jaak Vilo and other authors UT: Data Mining 2009 19
Chapter 1. IntroductionChapter 1. Introduction Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining Data Mining Functionalities: What Kinds of Patterns Can Be Mined? Data Mining: On What Kind of data?
Ti d O d i S ti l P tt T d d E l ti Time and Ordering: Sequential Pattern, Trend and Evolution Analysis
Structure and Network Analysis Structure and Network Analysis Evaluation of Knowledge Applications of Data Miningpp g Major Challenges in Data Mining A Brief History of Data Mining and Data Mining Society
Data Mining: Concepts and Techniques 20
SummaryJiawei Han, Micheline Kamber, and Jian Pei
What Is Data Mining?
Data mining (knowledge discovery from data)Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from u o a d po a y u u ) pa o o dg ohuge amount of data
Data mining: a misnomer?
Alternative names Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
W t h t I thi “d t i i ”? Watch out: Is everything “data mining”? Simple search and query processing
(D d ti ) t t
Data Mining: Concepts and Techniques 21
(Deductive) expert systems
Jiawei Han, Micheline Kamber, and Jian Pei
Knowledge Discovery (KDD) Processg y ( )
This is a view from typical database systems and datadatabase systems and data warehousing communities
Data mining plays an essential l i th k l d di
Pattern Evaluation
role in the knowledge discovery process
T k l t D t
Data Mining
D t W h
Task-relevant Data
Selection
Data Cleaning
Data Warehouse Selection
Data Cleaning
Data Integration
Data Mining: Concepts and Techniques 22DatabasesJiawei Han, Micheline
Kamber, and Jian Pei
Example: A Web Mining FrameworkExample: A Web Mining Framework
Web mining usually involves Web mining usually involves Data cleaning Data integration from multiple sources Warehousing the data Data cube construction Data selection for data miningg Data mining Presentation of the mining results Presentation of the mining results Patterns and knowledge to be used or stored into
knowledge-base
Data Mining: Concepts and Techniques 23
knowledge base
Jiawei Han, Micheline Kamber, and Jian Pei
Data Mining in Business Intelligenceg g
Increasing potentialIncreasing potentialto supportbusiness decisions End User
DecisionM ki
BusinessAnalyst
Making
Data PresentationAnalyst
DataAnalyst
Visualization Techniques
Data MiningInformation Discovery yInformation Discovery
Data ExplorationStatistical Summary, Querying, and Reporting
DBA
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Data Mining: Concepts and Techniques 24
Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems
Jiawei Han, Micheline Kamber, and Jian Pei
Collaborative filteringCollaborative filtering
– Amazon, Netflicks
• Collaborative filtering systems usually take two steps:– Look for users who share the same rating patterns with the active user (the user whom the prediction is for).
– Use the ratings from those like‐minded users found in step 1 to calculate a prediction for the active user
Jaak Vilo and other authors UT: Data Mining 2009 25
Jaak Vilo and other authors UT: Data Mining 2009 26
Netflix prizehttp://www.netflixprize.com/
• http://en.wikipedia.org/wiki/Netflix_Prize
18K i18K movies
480K customers ~ 100M ratings
???? Test on 2.8M witheld ratings
Jaak Vilo and other authors UT: Data Mining 2009 27
Social networkSocial network
• Graph of connections
• Social networkSocial networkmining
Jaak Vilo and other authors UT: Data Mining 2009 28
WebWeb
• Interlinked web sites and pages
• Directed Graph of links
• Information Retrieval PageRankInformation Retrieval, PageRank
• Web mining
Jaak Vilo and other authors UT: Data Mining 2009 29
Web usage miningWeb usage mining
• Software and web usage logs
• Typical use patterns
• User groups their preferences behavior• User groups, their preferences, behavior
• Can you predict their goals and help to achievethem?– distributed online transactions, queries, … (Google, etc)
Jaak Vilo and other authors UT: Data Mining 2009 30
Biomedical data miningBiomedical data mining
• Analyse:– DNA, ,
– Genotype information
disease histories– disease histories
– find associated genes
– predict and classify diseases and outcomes
– discover “how biology works”gy
– …
Jaak Vilo and other authors UT: Data Mining 2009 31
Combinatorial Data Mining AlgorithmsCombinatorial Data Mining Algorithms(research seminar, Sven Laur, PhD)
Basics ideas and techniquesH t fi d f t t i d t b– How to find frequent sets in databases
– How to find frequent motifs in sequences
Algorithmic problems– Depth‐first vs breath first search– How to avoid combinatorial explosion
Interpretation of resultsp– Which patterns are important enough?
Combinatorial Data Mining Algorithms(research seminar, Sven Laur, PhD.)
Other important aspects– How to handle noisy data– Random sampling vs linear scan
Applications and extensionsApplications and extensions– Association rules in practice – Log analysis Episode rules and usability– Log analysis. Episode rules and usability– Graph mining and biochemistry
Combinatorial Data Mining Algorithms(research seminar, Sven Laur, PhD)
Administrative details
C bi t i l D t Mi i Al ith• Combinatorial Data Mining Algorithms
• Gives 3 EAP (2 old AP)
• Takes place on Wednesdays in L122
• First seminar is on 16th of September
• Each participant has to give a presentation
• Project work is combined with DM course
• http://courses.cs.ut.ee/2009/fast‐counting/
Research at U TartuResearch at U Tartu
• BIIT – http://biit.cs.ut.ee/
• STACC – Software Technologies and A li i C CApplications Competence Center– companies and universities
– Skype, Regio, Delfi, Quretec, …
– Research problems, topics, scholarships
Jaak Vilo and other authors UT: Data Mining 2009 35
Research topicsResearch topics
• Publications => Projects, fundingPublications > Projects, funding
• Relevant to STACC, companies
• Can lead to job offers
Jaak Vilo and other authors UT: Data Mining 2009 36
UT CS departmentUT CS department
• Job offers:
• courses.cs.ut.ee ‐ web site development– UT CS department courses web development
– Other sysdamin and Department developmenty p ptasks
– ……
Jaak Vilo and other authors UT: Data Mining 2009 37