+ All Categories
Home > Documents > Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris...

Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris...

Date post: 20-Jan-2016
Category:
Upload: mervin-johns
View: 230 times
Download: 0 times
Share this document with a friend
Popular Tags:
38
Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities
Transcript
Page 1: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Challenges with XMLChallenges with Semi-Structured collections

Ludovic DenoyerUniversity of Paris 6

Bridging the gap between research communities

Page 2: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

OutlineMotivations

XML Mining Challenge

Graph Labelling/WebSpam Challenge

Conclusion and future work

Page 3: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

General IdeaThe two challenges have been proposed to

try to attract researchers from different domains:◦Mainly Machine Learning and Information

Retrieval

Show to IR researchers that ML methods are able to solve some of their problems

Show to ML researchers that IR tasks provide interesting context for developping new general Machine Learning Algorithms

Page 4: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

General IdeaFind generic tasks that correspond

to:◦IR new real-applications◦ML new generic problems

To work together….To mutualize efforts…

To solve these tasks faster…To compare the approaches…

Page 5: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Open questions in MLStructure+content

classification

Classification of inter-dependan

t variables

Structured

output

classification

Page 6: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Open questions in IRStructure+content

classification

Classification of inter-dependan

t variables

Structured

output

classification

Semi

structured documents (XML)

Interconnected

documents

Heterogeneous collections

Page 7: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

MotivationsStructured

input

classification

Classification of inter-dependan

t variables

Structured

output

classification

Semi

structured documents (XML)

Hyperlinked documents

Heterogeneous collections

XML Mining Challenge

Page 8: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

MotivationsStructured

input

classification

Classification of inter-dependan

t variables

Structured

output

classification

Semi

structured documents (XML)

Hyperlinked documents

Heterogeneous collections

WebSpam Challenge

XML Mining Challenge

Page 9: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

MotivationsInformation Retrieval

Machine Learning

Data Mining

Web

ProposedChallenges

Page 10: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Challenges

XML Mining Challenge◦« Bridging the gap between Machine

Learning and Information Retrieval »

Graph Labelling Challenge◦Application to WebSpam detection

Page 11: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

OutlineMotivations

XML Mining Challenge

WebSpam Challenge

Conclusion and future work

Page 12: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

XML Mining ChallengeLaunched in 2005

◦ PASCAL (Network of excellence in ML)◦ DELOS (Network of excellence in Digital Librairies)

Organized as a INEX Track◦ INEX: Initiative for the Evaluation of XML IR

More than 50 different institutes involved

One event each year at INEX (december)Biggest INEX Track (after ad-hoc retrieval)We are currently launching the 4th XML

Mining track

Page 13: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

XML Mining ChallengeML Goal

◦ Classification of large collections of structures

IR Goal◦Classification of semi-structured

collections Using both structure and content

Page 14: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Underlying ideaUsing structure and content

Information

Page 15: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

CollectionsDifferent collections have been used:

◦2005 Artificial collection Movie collection

◦2006 Scientific articles Wikipedia XML based collection

◦2007 Wikipedia XML based collection

96,000 documents in XML 21 categories

Page 16: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Submitted papers

2005 2006 20070

1

2

3

4

5

6

7

8

9

10

Number of papersIR PapersML PapersDM Papers

Page 17: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Large variety of modelsDifferent existing ML Methods

have been applied:◦Self Organizing Map◦SVM◦(Graph) Neural Network◦CRF◦Incremental Models◦…

Some new models have been developped

Page 18: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Short Typology

See Report on the XML Mining track – SIGIR Forum

Page 19: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Results - 2007Classification

Authors Method Micro recall

Macro recall

Zhang and al.

Kernel+SVM

0.87 0.83

L. M. de Campos and al.

Graphical Models – Bayesian netwoks

0.78 0.76

Meenakshi and al.

Negative Category Document Frequency

0.78 0.75

….

Page 20: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

XML Structure Mapping taskProposed in 2006

ML task : Structured ouput classification◦Learning to transform trees

IR application : Dealing with hetereogenous collections◦Learning to transform heterogeneous

documents to a mediated schema

Page 21: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

XML Structure Mapping

A generic ML model able to solve this task has a lot of potential applications:◦Conversion between file formats◦Automatic translation◦Natural Language processing◦…

Page 22: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Conclusion Existing structured input models (kernel,…) have been

tested on this task New specific models have been developped

Difficult to know which model is the best◦ Need to wait one more year

The challenge has attracted researchers from different communities◦ Each year, ML researchers are coming to INEX and:

Discover a new domain Present advanced ML models to other researchers

The collections are freely available and have been downloaded a hundred times◦ …some articles start to appear in different conferences…

Page 23: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

WebSpam ChallengePASCAL « Graph Labelling Challenge »

Organized by:

◦ Ricardo BAEZA-YATES (Yahoo! Research Barcelona)◦ Carlos CASTILLO (Yahoo! Research Barcelona)◦ Brian DAVISON (Lehigh University, USA )◦ Ludovic DENOYER (University Paris 6, France)◦ Patrick GALLINARI (University Paris 6, France)

The Web Spam Challenge 2007 was supported by PASCAL

The Web Spam Challenge 2007 was also supported by the DELIS EU - FET research project

Page 24: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

WebSpam Challenge

Three Events:

◦AirWeb workshop 2007 (WWW’07) May 2007 Web-oriented part

◦GraphLab workshop 2007 – P KDD/ECML September 2007 ML-oriented part

◦AirWeb workshop 2008 (WWW’08 ?)

Page 25: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

WebSpam Challenge

IR (Web) Task : ◦Detection of web spam

Spam = any attempt to get “an unjustifiably favorable relevance or importance score for some Web pages, considering the page’s true value”

Page 26: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Example of spam

Page 27: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

WebSpam ChallengeML Learning task:

◦Graph labelling◦Classification of inter-dependant

variables

Page 28: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

CollectionA collection of interconnected

Web pages◦77 millions pages◦About 11,000 hosts◦manually labeled as spam or normal

(host level)

Blinded evaluation of models

Page 29: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Participants

WWW 07 GraphLab 070

1

2

3

4

5

6

7

8

ParticipantsML ParticipantsWeb/IR partic-ipantsIndustrial partic-ipants

Page 30: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

ParticipantsWhy such an increase of ML

participants during GraphLab ?

Page 31: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

GraphLab workshop at ECML/PKDD 2007Collection has been fully preprocessed by the

organizers Each node corresponds to a vector (in SVMLight format)

based on the words distribution in each host/page The contingenchy matrix has been built

One small collection with 9,000 nodesOne large collection with 400,000 nodes

10% for train/20% for validation/70% for test

You can easily apply your « relationnal » models on this corpus without knowing anything about text processing

Page 32: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

ResultsSmall collection (9,000 nodes)

Participants Methods AUC

Abernethy and al.

Semi supervised learning

95.2

Tang and al. SVM 95.1

Filoche and al. Stacked Learning

92.7

Csalogany and al.

C4.5 87.7

Tian and al. Semi Supervised

86.3

… … …

Page 33: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

ResultsLarge collection (400,000 nodes)

Participants Methods AUC

Weiss and al. Semi supervised learning

99.8

Filoche and al. Stacked Learning

99.1

Tang and al. SVM 98.9

… … …

Page 34: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Conclusion on WebSpamDifferent pure ML methods used

« as if »◦Semi supervised methods◦Stacked Learning◦…

Very nice performances of ML models (equivalent to Web « hand-made » models)

Page 35: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Conclusion on WebSpam

Devlopment of a ML benchmark for graph labelling

WebSpam also proposes interesting ML challenges that could be integrated in the challenge◦Learning with a few examples◦Large scale problems◦Adversial Machine Learning◦…

Page 36: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Conclusion

The two challenges have proposed benchmarks for IR/Web applications and also for generic ML problems

It is possible to mix researchers from different communities

ML researchers dislike to clean real collections◦ you have to preprocess the collections

ML researchers dislike large collections◦ but it is moving…

Page 37: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Future workXML Mining will continue this year

◦See http://xmlmining.lip6.fr◦The corpus will be preprocessed ?

WebSpam challenge will also continue◦See http://webspam.lip6.fr◦We will see after WWW’08 if we propose

an other GraphLab workshop (see http://graphlab.lip6.fr)

◦Note that a new larger corpus has been developped in 2008

Page 38: Challenges with XML Challenges with Semi-Structured collections Ludovic Denoyer University of Paris 6 Bridging the gap between research communities.

Thank you for your attention

(Thank you to the participants of the different challenges that are in the room)


Recommended