Date post: | 16-Jan-2016 |
Category: |
Documents |
Upload: | dominick-powell |
View: | 213 times |
Download: | 0 times |
Tokeniser
Francisco Miguel Pérez Romero
University of Sevilla
Roadmap
Introduction
Class Diagram
Libraries
Conclusions
Roadmap
Introduction
Class Diagram
Libraries
Conclusions
Web Wrapping
Information retrieval
VerifierOntologiserExtractor
Query
NavigatorFormFiller
Tokeniser
¨ Tokenisation Rules¨ Configuration File ¨ Web Page¨ Parser
Tokeniser Usage
¨ Web Page Classification¨ Information Extraction Learners¨ Information Extraction
Example
Config FileToken List
Web Page
Tokeniser
XML File Token
List
Concepts
¨ Configuration File¨ Token¨ Tokenisation types
Roadmap
Introduction
Class Diagram
Libraries
Conclusions
Example
3 Token Classes: Word Space Digit Space Digit
Class Diagram: Tokenisation
Tokenisation Example
Class Diagram: Tokeniser
Roadmap
Introduction
Class Diagram
Libraries
Conclusions
Comparison Features 1
¨ Comparison Features:¨ Javadoc documentation?¨ Support UNICODE UTF-8¨ Support UNICODE UTF-16¨ Named Groups¨ Indexable Groups > 9¨ Negative Groups¨ Nested groups¨ Lazy qualifications?
Comparison Features 2
¨ Comparison Features:¨ Fuzzy matching?¨ Support POSIX?¨ Support Ignore Case?¨ Support New Line Option?¨ Use State Machine?¨ Support accent?
Libraries
¨ Tabla 1
Libraries
¨ Tabla 2
Libraries
¨ Tabla 3
Benchmark 1
¨ Regular Expression List¨ String List¨ Matching all one another¨ Time in ms
Benchmark 1: 10000 Iterations
¨ org.apache: -> 7078 ms¨ com.stevesoft : -> 19782 ms¨ kmy.regex : -> 781 ms¨ java.util : -> 1266 ms¨ jregex.Pattern : -> 1000 ms¨ org.apache.oro : -> 2156 ms¨ dk.brics.automaton : -> 265 ms¨ com.karneim.util.collection : -> 407 ms
Benchmark 1: 20000 Iterations
¨ org.apache: -> 11796 ms¨ com.stevesoft : -> 26641 ms¨ kmy.regex : -> 906 ms¨ java.util : -> 1891 ms¨ jregex.Pattern : -> 1422 ms¨ org.apache.oro : -> 3375 ms¨ dk.brics.automaton : -> 312 ms¨ com.karneim.util.collection : -> 610 ms
Benchmark 1: 50000 Iterations
¨ org.apache: -> 28656 ms¨ com.stevesoft : -> 63297 ms¨ kmy.regex : -> 1781 ms¨ java.util : -> 4281 ms¨ jregex.Pattern : -> 3219 ms¨ org.apache.oro : -> 7641 ms¨ dk.brics.automaton : -> 531 ms¨ com.karneim.util.collection : -> 1312 ms
Diagram
org.
apac
he
com
.ste
veso
ft
kmy.
rege
x
java
.util
jrege
x.Pa
ttern
org.
apac
he.o
ro
dk.b
rics
com
.kar
neim
0
10000
20000
30000
40000
50000
60000
70000
10000 It20000 It50000 It
Benchmark 2
¨ Source Code¨ Matching tags
Benchmark 2: Amazon
¨ org.apache : -> 218 ms¨ com.stevesoft : -> 63 ms¨ kmy.regex : ->94 ms¨ java.util : -> 0 ms¨ jregex.Pattern : -> 93 ms¨ org.apache.oro : -> 32 ms¨ dk.brics.automaton : -> 0 ms¨ com.karneim.util.collection : -> 47 ms
Benchmark 2: Marca
¨ org.apache : -> 62 ms¨ com.stevesoft : -> 47 ms¨ kmy.regex : ->93 ms¨ java.util : -> 0 ms¨ jregex.Pattern : -> 94 ms¨ org.apache.oro : -> 16 ms¨ dk.brics.automaton : -> 0 ms¨ com.karneim.util.collection : -> 62 ms
Benchmark 2: Ebay
¨ org.apache : -> 31 ms¨ com.stevesoft : -> 125 ms¨ kmy.regex : ->266 ms¨ java.util : -> 0 ms¨ jregex.Pattern : -> 156 ms¨ org.apache.oro : -> 47 ms¨ dk.brics.automaton : -> 0 ms¨ com.karneim.util.collection : -> 172 ms
Diagram
org.
apac
he
com
.ste
veso
ft
kmy.
rege
x
java
.util
jrege
x.Pa
ttern
org.
apac
he.o
ro
dk.b
rics
com
.kar
neim
0
50
100
150
200
250
300
AmazonMarcaEbay
To sum up…
¨ Dk.brics.automaton is the faster¨ Dk.brics and com.karneim fail with URL¨ Kmy.regex or java.util
Roadmap
Introduction
Class Diagram
Libraries
Conclusions
Conclusions
¨ Tokenisation test¨ Searching information¨ A real project¨ Experience
Thanks!