+ All Categories
Home > Documents > 1 Introduction

1 Introduction

Date post: 04-Dec-2015
Category:
Upload: hassanali
View: 1 times
Download: 0 times
Share this document with a friend
Description:
jjjjjjjjjjjjjjjjj
Popular Tags:
25
CpE 615: Machine learning Dr. Mohammad A. Alzubaidi Department of Computer Engineering Yarmouk University
Transcript

CpE 615: Machine learning

Dr. Mohammad A. AlzubaidiDepartment of Computer Engineering

Yarmouk University

Brief Introduction Dr. Mohammad A. Alzubaidi Assistant Professor at CE Dept. Research interests: machine learning, data

mining, image processing and their applications to bioinformatics

Outline of lecture Course information

Introduction to Machine Learning (ML)

Tentative Course schedule

Survey

Course Information Instructor: Dr. Mohammad A. Alzubaidi Office: Assistant Dean Office, H-205 Phone: 02/7211111 x4440 Email: [email protected] Web: elearning.yu.edu.jo Time: Wed 5:00pm—8:00pm Office hours: Sun – Thu 8:00am – 4:00pm Location: HN-401 Course textbook: No textbook is required. (Materials will be

available at the class web page) Topics: Data types and representation, classification,

evaluation, preprocessing, clustering, semi-supervised learning, advanced topics … etc.

Reference books Introduction to Data Mining. Tan, et al., 2005.

Pattern Classification. Duda, et al. , 2000.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Hastie, et al., 2001.

Kernel Methods in Computational Biology. Scholkopf, et al., editors. 2004.

Kernel Methods for Pattern Analysis. Taylor and Cristianini, 2004.

Grading Midterm Exam: 30%

Project, class participation, and seminars: 30%. Two to three students form a group to carry out a

small research project. A survey of the state-of-art in an area related to this course Machine learning techniques for specific applications A comparative study of several well-known algorithms. Design of a novel algorithm related to this course.

Students are required to attend the lecture, participate in the class discussion.

Students might be asked to give a seminar.

Final Exam: 40%.

Programming language Matlab

Tutorials http://www.math.ufl.edu/help/matlab-tutorial/ http://www.math.mtu.edu/~msgocken/intro/node1.ht

ml

R language

Or other languages

What is machine learning? Machine learning is the study of computer systems that

improve their performance through experience. Learn existing and known structures and rules. Discover new findings and structures.

Face recognition Bioinformatics

Supervised learning vs. unsupervised learning

Semi-supervised learning

Machine learning versus data mining

Data mining is extraction of useful patterns from data

sources, e.g., databases, texts, web, image. the analysis of (often large) observational

data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner.

A lot of common topics Clustering, Classification … etc.

Machine learning versus data mining

Different focuses ML focuses more on theory (statistics) DM focuses more on applications

In this course I will try to balance between the two.

Clustering

Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Applications of Cluster Analysis

Understanding Group genes and proteins that have similar

functionality, or group stocks with similar price fluctuations

Summarization Reduce the size of large data sets

Clustering precipitation in Australia

Classification: Definition Given a collection of records (training set )

Each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes.

Goal: previously unseen records should be assigned a class as accurately as possible.

A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification Example

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

continuous

class

Refund MaritalStatus

TaxableIncome Cheat

No Single 75K ?

Yes Married 50K ?

No Married 150K ?

Yes Divorced 90K ?

No Single 40K ?

No Married 80K ?10

TestSet

Training Set

ModelLearn

Classifier

Classification: Application

Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach:

Use credit card transactions and the information on its account-holder as attributes.

When does a customer buy, what does he buy, how often he pays on time, etc

Label past transactions as fraud or fair transactions. This forms the class attribute.

Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card

transactions on an account.

Character Recognition

Given a digit representation.

What is it’s class?

Inputs are 28x28 greyscale images.

Researchers have used Neural Networks Support Vector

Machines ... etc

Other applications

Face recognition

Protein function prediction

Cancer detection

Document categorization

Data representation Traditional algorithms work on vectors.

Images can be represented as matrices or vectors.

Kernel Methods: Basic ideas

Original Space Feature Space

Data integrationmRNA

expression data

protein-protein interaction data

hydrophobicity data

sequence data

(gene, protein)

Genome-wide data

Curse of dimensionality Large sample size is required for high-dimensional data.

Query accuracy and efficiency degrade rapidly as the dimension increases.

Strategies Feature reduction Feature selection Kernel learning

Model selection Choose the best model from a set of different models to

fit to the data

Support Vector Machines (SVM), Linear Discriminant Analysis (LDA)

Models are specified by certain parameters. How to choose the best parameters? Cross-validation (leave one out, k-fold CV)

Machine learning applications Computer vision, information retrieval, image processing, bioinformatics, text mining, web mining … etc.

Course schedule Weeks 1 – 6:

Introduction Data Types Classification Evaluation Preprocessing

Week 7: Midterm Exam Weeks 8 – 11:

Clustering Semi-supervised Learning Advances Topics

Weeks 12 – 14: Presentations Week 15: Final Exam

Survey Why are you taking this course?

What would you like to gain from this course?

What topics are you most interested in learning about from this

course?

Any other suggestions?


Recommended