QUANT-Question Answering Benchmark Curator
Ria Hari Gusmita, Rricha Jalota, Daniel Vollmers, Jan Reineke, Axel-Cyrille NgongaNgomo, and Ricardo Usbeck
September 10, 2019
Gusmita et al QUANT September 10, 2019 1 / 33
Outline
1 Motivation
2 Approach
3 Evaluation
4 QALD-specific Analysis
5 Conclusion & Future Work
Gusmita et al QUANT September 10, 2019 2 / 33
MotivationDrawback in evaluating Question Answering systems over knowledge bases
Mainly based on benchmark datasets(benchmarks)Challenge in maintaining high-quality andbenchmarks
Gusmita et al QUANT September 10, 2019 3 / 33
MotivationChallenge in maintaining high-quality and benchmarks
Change of the underlying knowledge base
DBpedia 2016-04 DBpedia 2016-10
http://dbpedia.org/resource/Surfing http://dbpedia.org/resource/Surfer
http://dbpedia.org/ontology/seatingCapacity http://dbpedia.org/property/capacity
http://dbpedia.org/property/portrayer http://dbpedia.org/ontology/portrayer
http://dbpedia.org/property/establishedDate http://dbpedia.org/ontology/foundingDate
Gusmita et al QUANT September 10, 2019 4 / 33
MotivationChallenge in maintaining high-quality and benchmarks
Metadata annotation errors
Gusmita et al QUANT September 10, 2019 5 / 33
MotivationDegradation QALD benchmarks against various versions of DBpedia
Gusmita et al QUANT September 10, 2019 6 / 33
Contribution
QUANT, a framework for the intelligent creation and curation of QA benchmarks
Definition
Given B, D, and Q as benchmark, dataset, and questions respectively
S represents QUANT’s suggestions
i th version of a QA benchmark Bi as a pair (Di ,Qi )Given a query qij ∈ Qi with zero results on Dk with k > iS : qij −→ q′ij
QUANT aimsto ensure that queries from Bi can be reused for Bk
to speed up the curation process as compared to the existing one
Gusmita et al QUANT September 10, 2019 7 / 33
What QUANT supports
1 Creation of SPARQL queries2 The validity of benchmark metadata3 Spelling and grammatical correctness of questions
Gusmita et al QUANT September 10, 2019 8 / 33
ApproachArchitecture
Gusmita et al QUANT September 10, 2019 9 / 33
ApproachSmart suggestions
1 SPARQL suggestion2 Metadata suggestion3 Multilingual Questions and Keywords Suggestion
Gusmita et al QUANT September 10, 2019 10 / 33
Smart suggestion1. How SPARQL suggestion module works
Gusmita et al QUANT September 10, 2019 11 / 33
1. SPARQL suggestionMissing prefix
The original SPARQL query
SELECT ? sWHERE {
r e s : New_Delhi dbo : coun t r y ? s .}
Gusmita et al QUANT September 10, 2019 12 / 33
1. SPARQL suggestionMissing prefix
The original SPARQL query
SELECT ? sWHERE {
r e s : New_Delhi dbo : coun t r y ? s .}
The suggested SPARQL query
PREFIX dbo : <ht tp : // dbped ia . org / on to l ogy/>PREFIX r e s : <ht tp : // dbped ia . org / r e s o u r c e/>SELECT ? sWHERE {
r e s : New_Delhi dbo : coun t r y ? s .}
Gusmita et al QUANT September 10, 2019 13 / 33
1. SPARQL suggestionPredicate change
The original SPARQL query
SELECT ? dateWHERE {
? web s i t e r d f : t ype onto : So f tware .? web s i t e onto : r e l e a s eDa t e ? date .? web s i t e r d f s : l a b e l "DBpedia" .
}
Gusmita et al QUANT September 10, 2019 14 / 33
1. SPARQL suggestionPredicate change
The suggested SPARQL query
SELECT ? dateWHERE {
? web s i t e r d f : t ype onto : So f tware .? web s i t e r d f s : l a b e l "DBpedia" .? web s i t e dbp : l a t e s t R e l e a s eDa t e ? date .
}
Gusmita et al QUANT September 10, 2019 15 / 33
1. SPARQL suggestionPredicate missing
The original SPARQL query
SELECT ? u r iWHERE {
? s u b j e c t r d f s : l a b e l "Tom␣Hanks" .? s u b j e c t f o a f : homepage ? u r i
}
Gusmita et al QUANT September 10, 2019 16 / 33
1. SPARQL suggestionPredicate missing
The original SPARQL query
SELECT ? u r iWHERE {
? s u b j e c t r d f s : l a b e l "Tom␣Hanks" .? s u b j e c t f o a f : homepage ? u r i
}
The suggested SPARQL query The predicate foaf:homepage is missing in ?subjectfoaf:homepage ?uri
Gusmita et al QUANT September 10, 2019 17 / 33
1. SPARQL suggestionEntity change
The original SPARQL query
SELECT ? u r i WHERE{ ? u r i r d f : t ype yago : C ap i t a l s I n Eu r o p e }
Gusmita et al QUANT September 10, 2019 18 / 33
1. SPARQL suggestionEntity change
The original SPARQL query
SELECT ? u r i WHERE{ ? u r i r d f : t ype yago : C ap i t a l s I n Eu r o p e }
The suggested SPARQL query
SELECT ? u r i WHERE{ ? u r i r d f : t ype yago : W i k i c a tC ap i t a l s I nEu r o p e }
Gusmita et al QUANT September 10, 2019 19 / 33
2. Metadata suggestion
Gusmita et al QUANT September 10, 2019 20 / 33
3. Multilingual questions and keywords suggestion
Question with missing keywords and translations
Gusmita et al QUANT September 10, 2019 21 / 33
3. Multilingual questions and keywords suggestion
Generated keywords: state, united, states, america, highest, densityUtilizing Trans Shell tool→Generated keywords translations suggestion
Gusmita et al QUANT September 10, 2019 22 / 33
3. Multilingual questions and keywords suggestion
Suggested Question Translations
Gusmita et al QUANT September 10, 2019 23 / 33
Evaluation
Three goals of the evaluation:1 QUANT vs manual curation
Graduate students curated 50questions using QUANT and another50-question manually23 minutes vs 278 minutes
2 Effectiveness of smart suggestions10 expert users got involved increating a new joint benchmark, calledQALD-9, with 653 questions
3 QUANT’s capability to provide ahigh-quality benchmark dataset
The inter-rater agreement betweeneach two users amounts up to 0.83 onaverage
Group Inter-rater Agreement
1st Two-Users 0.972nd Two-Users 0.723rd Two-Users 0.884th Two-Users 0.775th Two-Users 0.96
Average 0.83
Gusmita et al QUANT September 10, 2019 24 / 33
EvaluationUsers acceptance rate in %
Use
r 1
Use
r 2
Use
r 3
Use
r 4
Use
r 5
Use
r 6
Use
r 7
Use
r 8
Use
r 9
Use
r 10
List of users
0
10
20
30
40
50
60
70
80
90
100
Acc
epta
nce
rat
e in
%
acceptance rate per user
QUANT provided 2380 suggestions and user acceptance rate on average is 81%The top 4 acceptance-rate are for QALD-7 and QALD-8
Gusmita et al QUANT September 10, 2019 25 / 33
EvaluationNumber of accepted suggestions from all users
User 1
User 2
User 3
User 4
User 5
User 6
User 7
User 8
User 9
User 10
List of users
0
100
200
300
400
500
Num
ber
of
accepte
d s
uggest
ion
SPARQL Query
Question Translations
Out of Scope
Onlydbo
Keywords Translations
Hybrid
Answer Type
Aggregation
Most users accepted suggestion for out-of-scope metadataKeyword and question translation suggestions yielded the second and third highestacceptance rates.
Gusmita et al QUANT September 10, 2019 26 / 33
EvaluationNumber of users who accepted QUANT’s suggestions for each question’s attribute.
Aggregation
Answer TypeHybrid
Keywords TranslationsOnly Dbo
Out of Scope
Question Translations
SPARQL Query
Name of attributes
0
10
20
30
40
50
60
70
80
90
100
110
Num
ber
of u
sers
acc
epte
d su
gges
tion
in
%Percentage
83.75% of the users accepted QUANT’s smart suggestions on averageHybrid and SPARQL suggestions were only accepted by 2 and 5 users respectively.
Gusmita et al QUANT September 10, 2019 27 / 33
EvaluationNumber of suggestions provided by users
User 1
User 2
User 3
User 4
User 5
User 6
User 7
User 8
User 9
User 10
List of users
0
10
20
30
40
Num
ber
of
pro
vid
ed s
uggest
ions
SPARQL Query Question Translations Out of Scope Onlydbo
Keywords Translations Hybrid Answer Type Aggregation
Answer type, onlydbo, out-of-scope, and SPARQL query metadata were attributes whosevalue redefined by users
Gusmita et al QUANT September 10, 2019 28 / 33
QALD-specific Analysis
There are 1924 questions where 1442 questions are training data and 482 questions are testdata
Gusmita et al QUANT September 10, 2019 29 / 33
QALD-specific Analysis
Duplication removal resulted 655 uniquequestionsRemoving 2 semantically similar questionsproduced 653 questionsUsing QUANT with 10 expert users, wegot 558 total benchmark questions →increase QALD-8 size by 110.6%The new benchmark formed QALD-9dataset
Distribution of unique questions in all QALDversions
Gusmita et al QUANT September 10, 2019 30 / 33
Conclusion
QUANT’s evaluation highlights the need for betterdatasets and their maintenanceQUANT speeds up the curation process by up to91%.Smart suggestions motivate users to engage in moreattribute corrections than if there were no hints
Gusmita et al QUANT September 10, 2019 31 / 33
Future Work
There is a need to invest more time into SPARQLsuggestions as only 5 users accepted themWe plan to support more file formats based on ourinternal library
Gusmita et al QUANT September 10, 2019 32 / 33
Thank you for your attention!
Ria Hari [email protected]
https://github.com/dice-group/QUANT
DICE Group at Paderborn Universityhttps:
//dice-research.org/team/profiles/gusmita/
Gusmita et al QUANT September 10, 2019 33 / 33