Vagueness-theNeglectedFeaturein
BigData
WalthervonHahnUniversitätHamburg•ComputerScienceDepartment
E-Mail:vhahn@[email protected]©Waltherv.Hahn
Colloqueinterna@onal•Lesmégadonnéesetlessciencessociales.22.–24.Sept.2017•Bucarest
Contents• Theore@calBackground• WhichBigData?• Vagueness
– types– onseverallayers– historicalmaterial
• Annotata@onsandInferencing• Datacollec@onandinterpreta@oninDH• VaguenessinInterpreta@onGUIs• LexicalandSyntac4cSourcesofvaguenessinoriginal• ExamplefornecessaryManualAnnota4onofFactualuncertainity• Summary• References
OverviewInsocialscience(asinotherfieldsof„DigitalHumani@es“)bigdataprojectstendtocollectdataasfactsina(rela@onal)database.Socialscience,howeverpartakes-asahumanity-accordingtoWilhelmDiltheyinahermeneu@cparadigmforestablishingsocialhypotheses.Accordingly,socialdataoaenconsisteitheroftextsmirroringactudes,allega@ons,beliefs,etc.,orarereac@onsoftestsubjectstoverbals@muli.Suchmaterialcannotbetreatedasfactslikenumbersorposi@[email protected],analysingonlyformalfeaturesinthematerialdoesrarelycontributetothehermeneu@caimsofthesociologicalquest.• • Thetalkisaboutpossiblewaysoutofthisdilemma.Afirstsolu@onisthe
subsequentusageofbigdataforhumanreadingandinterpreta@ononly,which,however,underes@matesthescien@ficpowerofcompu@ng.
• • Anothersolu@onisasemi-automa@[email protected]
bymetadataaboutthecredibilityoftextsandauthorsaswellasbylexicalannota@onsofvaguenessexpressions.Occurrencesof“perhaps”,“mostly”or“toacertainextent”(tonameonlyobviousexamples)[email protected],annota@onsupportsseman@cqualifica@onsandallowsforreasoningovervaguefeaturesinbigdata.
EmpiricalSocialScience,HumanityorScience
SCIENCE:– sta@s@csandstochas@cs– computerizedmethodsfor(semi-automa@c)collec@on,retrieval,
annota@onandanalysis,dataexchange,linkingamongdataHUMANITY:
– quan@ta@vemethodssupporthumani@es‘qualita@veresearch,butdonotreplacethem.Themainhermeneu@ctaskisleaopen.
– Computerformalismstypicallymodelfactsindatabases.However,onlyfewhumanis@cissuesarefacts,mostareopentointerpreta@on.Standarddatabasesinsomewayobscurethedatabyalleging“facts”.Themainhermeneu@ctaskiss@llleaopen.
– Example:MostownersofTVsetshavealowIQè Background:Bothfactsareco-occurrent,whodecides,thattheyarenotcausal
©v.Hahn•UniHamburg 4
ScienceandHumani@esWilhelmDilthey(EinleitungindieGeisteswissenscha81922):Diltheydescribeshistoryas“aseriesofworldviews.”Mancanonlyunderstandhimselfthroughwhat“historycantellhim”…[email protected]“intrinsictemporalityofallunderstanding”i.e.,thatman’sunderstandingisdependentonpastworldviews,interpreta@ons,andasharedworld.LateronHans-GeorgGadamer(WahrheitundMethode1960)declared,thatinterpre@ngatextinvolvesafusionofhorizons(Horizontverschmelzung).Boththetextandtheinterpreterfindthemselveswithinapar@cularhistoricaltradi@on,or“horizon”.Eachhorizonisexpressedthroughthemediumoflanguage,andbothtextandinterpreterbelongtoandpar@cipateinhistoryandlanguage.JürgenHabermas(TechnikundWissenscha8alsIdeologie,TheoriedeskommunikaDvenHandelns,1968)dis@nguishesbetweenpurposivera@onalac@onandsocialac@on,[email protected]ürgenHabermas’conceptandtheoryofcommunica@vera@onalitydis@nguishesitselffromthera@onalisttradi@on,byloca@ngra@onalityinstructuresofinterpersonallinguis@ccommunica@onratherthaninthestructureofthecosmos.
©v.Hahn•UniHamburg 5
WhichBigData?
• Vaguenessinsocialscienceisanissueforthosebigdata,whichintheendareevaluatedseman@cally,i.e.byanalysesorannota@onshigherthanlinguis@cformalstructures.
• AtleastwhenyouusewordNetsynsetsorevenworse,theirtransla@onsfromEnglish,youhavetoenvisagevaguenessproblems,becauseyouusewordsensesinahermeneu@cway,notonlybymeasuring.
Heissa@sfied
DialogueandSeman@cInterpreta@on
Oh,ingeneral,ok!
HowdoyouliveinIdohalive?
thisafactinthedatabase
.. ..
Example• Manysocialsciencedatacomefrompublicopinionpollsandare
individualresponsestoverbals@muli.• Measuringformaldetails(e.g.sentencelength,response@me)is
notahermeneu@cac@vity.• However,alateraqachmentofmeaning(„Amajorityof
respondentsarescepDcagainstAfricanimmigrants“)tonumericalresultsisahermeneu@cissueandsubjecttointerpreta@on.
• Toavoidinvalidinterpreta@ons,userhavetoincludeintotheirevalua@onmetadataaboutthesurveydetails,i.e.aboutques@onnaireandtherespondents.
• Alinguis@canalysishastocheckthehomogeneityoftherandomsampleorpossibleinterferenceamongtheinterviewerandtherespondents.
• Withinbigdataauserhastomergethemetadataintoadatareliabilityinfo.
M.Pinkal‘sSchemaofSeman@cVagueness
Semantic Vagueness
Vagueness in anarrow sense
Ambiguity
Porosity
Relativity
Inexactness
BorderlineUncertainty
Homonymy
Polysemy
SyntakticAmbiguity
Referential Ambiguity
Elliptical Ambiguity
Metaphorical Ambiguity
one-dimen-sional many-dimensional
IllocutiveUnclarity
CommunicativeUnderspecification
©v.Hahn•UniHamburg 9
Cimiano,UngerandMcCrae
ReferentialUncertainty
FactualUncertainty(yet)unexploredfacts "themoonis384402,56mdistantfromthe
earth”rangeexpressions “Thebeginningofthe18.century”“Romaniain
themiddleages“uncertaindefini@on “thenorthernslopeofthemountain”Inexactmeasures „4Tagereisen“,„10Fuß“,a4days’journey,10
feet”unclearplace „Syrfia“unclearfacts „aufBefehldesSultans“,„byorderofthesultan“unclear@me „IngrauerVorzeit“,„inprehistoricDmes“unclearperson „derdamaligeFürst“,„theformerprince“unclearac@on „DieUnterwerfungderBarbaren“,
„thesubmissionofthebarbarians“
ChallengeforDH:Vaguenessonseverallayers
Examplesfromhistorictexts:• Linguis@cvagueness,• Logicalvagueness,• Fuzzyconcepts
– “BeforeStephantheGreat,allmountainsaroundMoldaviabelongedtoTransilvaniaandthecountrywasnarrowonthisside”…,
• Vagueorconcurrentontologies:– TheTurkishandtheMoldavianadministra@on,
• Referen@alvaguenessoruncertainty– Theoriginofthehill“ChanTepesi”or“MogilaRabuy”,
• NaïveHistory(derivedfrom‘naivephysics’)– „TheRomanEmpireconqueredDacia“,
• Historicalchange,• Vaguenessofthesources
©v.Hahn•UniHamburg 11
Themoreyougointohistory,themoredatabecomevague
• measures• @mespanexpressions:InthebeginningofXthcentury,
shortlylater• Persons:theformerprince,thecurrentpope• evenNEsareoaenvague,arevague,• Addi@onally,changesinwri@ngcreatesar@ficial
vagueness,
Howtoannotate
• Inbigdatayoucannotannotatelargeamountsoftextswithreasonablecosts.
• Theonlywaysout:– smalllearningtextsandautoma@cpropaga@on,– automa@cannota@onoflexicalindicators,– includingmeta-datafortextclasses,– establishinginferencerulesfor„vaguenesscombina@ons“.
LexicalVaguenessPredictors
• Modalverbs:must,should,will,can...• Adverbs:perhaps,forexample,sotosay,possibly,maybe,by
anychance,roughly,rude,coarse,andsoon,andsoforth,basicly,...
• Adjec@ves:simplified,• Compara@vedegrees:beder,more,worse...• Vaguequan@fiers:many,most,mostly,majority,o8en
MetadatatobeincludedintheGUI
genre:• officialdocument,• leqer• fic@on, • fairytale• legend,• folktradi@oncredibilityofauthor• poli@cian• journalist• fic@onwriterhistoricaldistance• modern• historical
decreasingreliability
decreasingcredibility
CurrentDH-Approach
MachineLearning
ManualAnnota4on(Domainspecific)
CL-Tools
NE-Recogni4on
Knowledge-Base
Quan@ta@veMeasures
A B
C D E
Interpreta@on
DATA
IncludingVagueness
MachineLearning
ManualVaguenessAnnota4on
CL-Tools
NE-Recogni4on
Fuzzy-KB
Quan@ta@veMeasures
Interpreta@on
DATA
MachineLearning
ManualDomainAnnota4on
AB
CD E
?
??
LexicalandSyntac4cSourcesofvaguenessintheoriginal
Hactenus Gregoras: ad cuius verba observare haud extra propositum erit τὴν πρώτην, quam Gregoras vocat „Tartariam”, eandem esse, quam hodie vulgo „Magnam” appellamus eiusque incolarum nomina, etsi ab historicis recenseantur, tamen adscita magis, aut ab exteris indita, quam propria eisque, dum in suis sedibus morarentur, peculiaria fuisse. Ita, si quis in praefixa huic tractationi Praefatione legerit Oguzorum gentis Principes in duas stirpes fuisse divisos, „Aliothman” unam, et „Ali Dzengiz”1 alteram, ne credat sub ipsis horum generum conditoribus hanc appellationem iam apud eas gentes invaluisse. Vti enim absonum videtur, Aliothmanos Suleimano parentes ab huius nepote, qui integro post saeculo iis imperabat, nomen fuisse sortitos; ita non minus falso vulgo praedicantur Tartarorum Crimensium Principes ab ipso Dzengizchano „Alidzengiz” appellationem retinuisse.
Până aici l-am citat pe Gregoras: faţă de cuvintele lui nu va fi nepotrivit să observăm că acea Tartaria „ἡ πρώτη”, pe care o numeşte Gregoras, este chiar aceea pe care o numim îndeobşte cea „Mare”, iar numele locuitorilor ei, chiar dacă sunt înregistrate de istorici, au fost totuşi mai degrabă împrumutate sau date de străini decât proprii lor, purtate întocmai pe vremea când se aflau în sălaşurile lor. Astfel, dacă va fi citit cineva în Prefaţa pusă înaintea acestui tratat că principii neamului oguzilor au fost împărţiţi în două stirpe, una „aliothmană”, cealaltă, „alidzengiză”, să nu creadă că denumirea aceasta era de-acum valabilă pentru întemeietorii acestor neamuri. Căci, după cum pare nepotrivit ca aliothmanizii care i se supun lui Suleiman să-şi fi ales numele de la nepotul acestuia, care a domnit peste ei după un secol întreg, la fel de fals se spune îndeobşte că principii tartarilor din Crimea şi-ar fi păstrat denumirea „alidzengiz” chiar de la Dzengizchan
Moreplausible
Quota4on
Wouldhavebeen…
seemsunlikely
equalyfalse
DomnulceldintâicareledupănăvălirealuiBa@e,aagonisitiarășistrălucireaceamaidinainteaMoldoveiafost:1.Dragoșșimăcarcăhronografiilenoastrenuaratăpentruș@ințaneamuluisău,darlanoisezicenecontenit,căafostdinneamulcelvechiualcrailorMoldovineș@,șiaavuttatăpeBogdanfiulluiIoan,delacareletoțiDom-niiobișnuesca-șipunelaiscăliturănumeleIoan.Șicuvântulacestaestemaiușordeaseade-verișipentruaceasta,căcicugreuestedeasecrede,căaltuldinneammaiprost,arfipututcuotovărășieașamaresămeargălavânat,careleadatprilejladescoperireaMol-doveișiarfiputut...
was
Dererstedemnach,dernachBa@aEinfall(*)derMoldauihrenvorigenGlanzwiederverscha}hat,war1.Dragosch.ObgleichunsreJahrbücherseinGeschlechtsregisternichtangeben,soistesdocheinebeständigeSagebeyuns,daßerausdemaltenköniglichenmoldauischenStammegewesensey,unddenBogdanzumVatergehabthabe,welchereinSohndesJohanniswar,vonwelchemalleFürstendenNamenJohannisinihremTitelzuführenpflegen;dieserMeinungistdestomehrGlaubenbeyzumessen,weilmanschwerlichglaubenkan,daßeinervongemeinerHerkunamiteinemsogroßenGefolgeaufdieJagd(welchedieMoldauzuentdeckenGelegenheitgegeben,)habeausgehen,….
Dragos= belongs_toMoldaviankings
Dragosch≈ belongs_toMoldaviankings
shouldhavebeen
Exampleforwrongknowledgeextractedwithoutdeeperlinguis4cannota4on–GermanandRomaniancase
ExamplefornecessaryManualAnnota4onofFactualuncertainity
[…]HefoughttwoBaqleswithBajazetIldirim;inthefirsthewasvictor,andinthesecondheroutedhimwithamemorableslaughter,whichsevenvastpilesofTurkishBodieserectedaaertheBaqle,witnessed,bytheConfessionofHezarfennhimself,thefaithfulTurkishHistorian.
Cantemir,pp.47(Annota@ons)
Hezarfen(HezarfenHüseyinEfendi)(?-1691/92),Tenkih-iTevarih-iMülük:isNOTmen@oningthesefacts
TheTurkishhistorianssoextollthisprince’sexpedi@oninassemblinghistroops,inexecu@nghisdesigns,andinvanquishinghisenemies,thatwhentheytalkofthenaturalspeedoftheTartarsincomparisonwithhiswonderfulmarches,theycallthefirst,thecreepingofasnail.
Cantemir,pp.48(Annota@ons)
DescribedinSolakzade:?,HocaSaadecn:,Neşri:
SummaryToavoid,• thatwords/textsbecomefactsorconceptswithoutseman@c
annota@ons,• thatbigsocialdatabecomeuniformdatabaseentrieswithout
somesortofreliabilitycheck,weneedindica@onsoftheirvagueness.
References• ThomasT.BallmerandPinkal,Manfred,ApproachingVagueness,
Amsterdam1983• GeeraertsDirk,Vagueness'spuzzles,polysemy'svagaries.In:Newman,
JohnCogni@[email protected].• v.Hahn,Walther,VagheitbeiderVerwendungvonFachsprachen.In:
Hoffmann/Kalverkämper/Wiegand:Fachsprachen.Band1.Berlin1998.S.383–390.
• Pinkal,Manfred,Seman@scheVagheit:PhänomeneundTheorien,TeilI.In:[email protected],S.1-26,Wiesbaden1980.
• Pinkal,Manfred,Seman@scheVagheit:PhänomeneundTheorien,TeilII.In:[email protected],S.1-26,Wiesbaden1981.
• EdeltraudWinkler,ÜberlegungenzuArtefaktbezeichnungenimDeutschen.In:DeutscheSprache37(2009)H.1,S.33-47.