+ All Categories
Home > Documents > Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP .

Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP .

Date post: 24-Feb-2016
Category:
Upload: oro
View: 36 times
Download: 0 times
Share this document with a friend
Description:
Arabic Text Categorization Based on Arabic Wikipedia. Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP . Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT Presentation
Popular Tags:
27
Intelligent Database Systems Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP. Arabic Text Categorization Based on Arabic Wikipedia
Transcript
Page 1: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Presenter : CHANG, SHIH-JIE

Authors : ADNAN YAHYA and ALI SALHI

2014. ACM TALIP.

Arabic Text Categorization Based on Arabic Wikipedia

Page 2: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

OutlinesMotivationObjectivesMethodologyExperimentsConclusionsComments

Page 3: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Motivation

A challenge due to the correlation between certain subcategories and overlap between main categories.

EX:

Page 4: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Objectives• To solve this, we use algorithm and further adopt the two

approaches .

Page 5: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

CATEGORIZATION CORPORA - Training Data

Related Tags Approach

Page 6: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Page 7: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Testing Data

10 categories with 40 documents in each category

Page 8: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Methodology - PREPROCESSING TECHNIQUES

Root Extraction (RE) Light Stemming (LS) Special Expressions Extraction

Page 9: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Methodology- CATEGORIZATION PROCESSCategorize the input text in two phases

Phase one: we categorize the text into one of the main categories.

Phase two:We further categorize the input text based on subcategories:

Page 10: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Page 11: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Methodology - Basic Categorization Algorithm (BCA)

Page 12: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Methodology - Percentage and Difference Categorization (PDC) Algorithm

has frequency 7 in the 300-word

Page 13: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Methodology - Percentage and Difference Categorization (PDC) Algorithm

The category with the highest sum of flag values is considered to be the best match for the input text.

Page 14: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Methodology – PDC Algorithm vs. BCA Algorithm

Page 15: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Methodology – Enhancing Main/Subcategories Grouping

(1) Overlapping Main Categories for Phase Two

Problem : The possible high correlation between subcategories of different main categories

Page 16: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Methodology – Enhancing Main/Subcategories Grouping

(2) Replacing Main Categories by Groups of Related Categories

Page 17: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Methodology – Enhancing Main/Subcategories Grouping

Page 18: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Methodology - Word Filtration Techniques within Categories

Page 19: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Methodology - The result of applying the three techniques

Page 20: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Modified PDC with N Scales Define a scaling of

1 0.5 0

1 0.5 00.250.75

Page 21: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Further Testing on the PDC AlgorithmTool Root ExtractionTool Light Stemming & Light10Tool Double WordsTool Expressions Extraction

Page 22: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Using Testing Data from the Reference Categories

Page 23: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Training Data Characteristics

Page 24: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

COMPARISON WITH RELATED WORK

Page 25: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Using Testing Data from the Reference Categories

Page 26: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Conclusions– To use training and testing data from same source by

splitting the corpus into test and training components. This consistently gives better results.

– However, we believe that the second method (different source ) makes more sense, as the tests will

be more credible and indicative of performance in real-life environments.

Page 27: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP .

Intelligent Database Systems Lab

Comments• Advantages

– To.• Applications

– Arabic Text Categorization .


Recommended