Date post: | 01-Jan-2016 |
Category: |
Documents |
Upload: | xantha-buck |
View: | 24 times |
Download: | 0 times |
Challenges with XMLChallenges with Semi-Structured collections
Ludovic DenoyerUniversity of Paris 6
Bridging the gap between research communities
OutlineMotivations
XML Mining Challenge
Graph Labelling/WebSpam Challenge
Conclusion and future work
General IdeaThe two challenges have been proposed to
try to attract researchers from different domains:◦Mainly Machine Learning and Information
Retrieval
Show to IR researchers that ML methods are able to solve some of their problems
Show to ML researchers that IR tasks provide interesting context for developping new general Machine Learning Algorithms
General IdeaFind generic tasks that correspond
to:◦IR new real-applications◦ML new generic problems
To work together….To mutualize efforts…
To solve these tasks faster…To compare the approaches…
Open questions in MLStructure+content
classification
Classification of inter-dependan
t variables
Structured
output
classification
Open questions in IRStructure+content
classification
Classification of inter-dependan
t variables
Structured
output
classification
Semi
structured documents (XML)
Interconnected
documents
Heterogeneous collections
MotivationsStructured
input
classification
Classification of inter-dependan
t variables
Structured
output
classification
Semi
structured documents (XML)
Hyperlinked documents
Heterogeneous collections
XML Mining Challenge
MotivationsStructured
input
classification
Classification of inter-dependan
t variables
Structured
output
classification
Semi
structured documents (XML)
Hyperlinked documents
Heterogeneous collections
WebSpam Challenge
XML Mining Challenge
MotivationsInformation Retrieval
Machine Learning
Data Mining
Web
ProposedChallenges
Challenges
XML Mining Challenge◦« Bridging the gap between Machine
Learning and Information Retrieval »
Graph Labelling Challenge◦Application to WebSpam detection
OutlineMotivations
XML Mining Challenge
WebSpam Challenge
Conclusion and future work
XML Mining ChallengeLaunched in 2005
◦ PASCAL (Network of excellence in ML)◦ DELOS (Network of excellence in Digital Librairies)
Organized as a INEX Track◦ INEX: Initiative for the Evaluation of XML IR
More than 50 different institutes involved
One event each year at INEX (december)Biggest INEX Track (after ad-hoc retrieval)We are currently launching the 4th XML
Mining track
XML Mining ChallengeML Goal
◦ Classification of large collections of structures
IR Goal◦Classification of semi-structured
collections Using both structure and content
Underlying ideaUsing structure and content
Information
CollectionsDifferent collections have been used:
◦2005 Artificial collection Movie collection
◦2006 Scientific articles Wikipedia XML based collection
◦2007 Wikipedia XML based collection
96,000 documents in XML 21 categories
Submitted papers
2005 2006 20070
1
2
3
4
5
6
7
8
9
10
Number of papersIR PapersML PapersDM Papers
Large variety of modelsDifferent existing ML Methods
have been applied:◦Self Organizing Map◦SVM◦(Graph) Neural Network◦CRF◦Incremental Models◦…
Some new models have been developped
Short Typology
See Report on the XML Mining track – SIGIR Forum
Results - 2007Classification
Authors Method Micro recall
Macro recall
Zhang and al.
Kernel+SVM
0.87 0.83
L. M. de Campos and al.
Graphical Models – Bayesian netwoks
0.78 0.76
Meenakshi and al.
Negative Category Document Frequency
0.78 0.75
….
XML Structure Mapping taskProposed in 2006
ML task : Structured ouput classification◦Learning to transform trees
IR application : Dealing with hetereogenous collections◦Learning to transform heterogeneous
documents to a mediated schema
XML Structure Mapping
A generic ML model able to solve this task has a lot of potential applications:◦Conversion between file formats◦Automatic translation◦Natural Language processing◦…
Conclusion Existing structured input models (kernel,…) have been
tested on this task New specific models have been developped
Difficult to know which model is the best◦ Need to wait one more year
The challenge has attracted researchers from different communities◦ Each year, ML researchers are coming to INEX and:
Discover a new domain Present advanced ML models to other researchers
The collections are freely available and have been downloaded a hundred times◦ …some articles start to appear in different conferences…
WebSpam ChallengePASCAL « Graph Labelling Challenge »
Organized by:
◦ Ricardo BAEZA-YATES (Yahoo! Research Barcelona)◦ Carlos CASTILLO (Yahoo! Research Barcelona)◦ Brian DAVISON (Lehigh University, USA )◦ Ludovic DENOYER (University Paris 6, France)◦ Patrick GALLINARI (University Paris 6, France)
The Web Spam Challenge 2007 was supported by PASCAL
The Web Spam Challenge 2007 was also supported by the DELIS EU - FET research project
WebSpam Challenge
Three Events:
◦AirWeb workshop 2007 (WWW’07) May 2007 Web-oriented part
◦GraphLab workshop 2007 – P KDD/ECML September 2007 ML-oriented part
◦AirWeb workshop 2008 (WWW’08 ?)
WebSpam Challenge
IR (Web) Task : ◦Detection of web spam
Spam = any attempt to get “an unjustifiably favorable relevance or importance score for some Web pages, considering the page’s true value”
Example of spam
WebSpam ChallengeML Learning task:
◦Graph labelling◦Classification of inter-dependant
variables
CollectionA collection of interconnected
Web pages◦77 millions pages◦About 11,000 hosts◦manually labeled as spam or normal
(host level)
Blinded evaluation of models
Participants
WWW 07 GraphLab 070
1
2
3
4
5
6
7
8
ParticipantsML ParticipantsWeb/IR partic-ipantsIndustrial partic-ipants
ParticipantsWhy such an increase of ML
participants during GraphLab ?
GraphLab workshop at ECML/PKDD 2007Collection has been fully preprocessed by the
organizers Each node corresponds to a vector (in SVMLight format)
based on the words distribution in each host/page The contingenchy matrix has been built
One small collection with 9,000 nodesOne large collection with 400,000 nodes
10% for train/20% for validation/70% for test
You can easily apply your « relationnal » models on this corpus without knowing anything about text processing
ResultsSmall collection (9,000 nodes)
Participants Methods AUC
Abernethy and al.
Semi supervised learning
95.2
Tang and al. SVM 95.1
Filoche and al. Stacked Learning
92.7
Csalogany and al.
C4.5 87.7
Tian and al. Semi Supervised
86.3
… … …
ResultsLarge collection (400,000 nodes)
Participants Methods AUC
Weiss and al. Semi supervised learning
99.8
Filoche and al. Stacked Learning
99.1
Tang and al. SVM 98.9
… … …
Conclusion on WebSpamDifferent pure ML methods used
« as if »◦Semi supervised methods◦Stacked Learning◦…
Very nice performances of ML models (equivalent to Web « hand-made » models)
Conclusion on WebSpam
Devlopment of a ML benchmark for graph labelling
WebSpam also proposes interesting ML challenges that could be integrated in the challenge◦Learning with a few examples◦Large scale problems◦Adversial Machine Learning◦…
Conclusion
The two challenges have proposed benchmarks for IR/Web applications and also for generic ML problems
It is possible to mix researchers from different communities
ML researchers dislike to clean real collections◦ you have to preprocess the collections
ML researchers dislike large collections◦ but it is moving…
Future workXML Mining will continue this year
◦See http://xmlmining.lip6.fr◦The corpus will be preprocessed ?
WebSpam challenge will also continue◦See http://webspam.lip6.fr◦We will see after WWW’08 if we propose
an other GraphLab workshop (see http://graphlab.lip6.fr)
◦Note that a new larger corpus has been developped in 2008
Thank you for your attention
(Thank you to the participants of the different challenges that are in the room)