Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010.

Post on 25-Dec-2015

214 views 0 download

Tags:

transcript

Web ScienceIntroduction to Information Integration

Julien Gaugaz, October 26, 2010

2

Topics•1. Information Integration

•2. Web Information Retrieval

•3. Entity Search

•4. Web Usage

•5. Collaborative Web

•6. Web Archiving

•7. Medical Social Web

Scenarios

Why Integrating Information?

4

Company Mergers

5

Travelling Agent

AgentAgent

6

Booking Flights

AgentAgent

7

Leveraging Wikipedia Infoboxes

Query

Data Contribution

8

Evolution

Beginning ofDatabases

Wikipedia &Social Web

Rise of Internet & Wrapping

Websites

Num

ber

of

Sourc

es

Kinds of discrepancies

What is the Problem?

10

Wikipedia Infoboxeshttp://de.wikipedia.org/wiki/Berlin

http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000

| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]

| [[Höhe]] : || 34–115 m ü. NN

| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))

11

Wikipedia Infoboxeshttp://de.wikipedia.org/wiki/Berlin

http://en.wikipedia.org/wiki/San_Francisco

|leader_title = [[Mayor of San Francisco|Mayor]]|leader_name = [[Gavin Newsom]] ([[Democratic [...]|D]])|elevation_ft =

52|elevation_max_ft = 925|elevation_min_ft = 0|population_as_of = 2008|population_total = 815358|population_metro = 4203898|population_urban = 3228605

| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro

= 5000000

12

Causes of Discrepancies

•Information sources are diverse

•Different cultural background

•Different domain of activity

•Different model of information

•Typos and other kinds of errors

•Evolution over time

•Use, usage and users of one source may change of over time

13

Places of DiscrepanciesInformation level where discrepancies appear:

•Semantic: meaning, sense

•Representational

• Lexical: word / term representing the meaning

• Structural: how are the terms arranged to represent the meaning

•Syntactic: how is the lexical and structural encoded into characters (and bits)

Discrepancies may concern:

•Schema elements (properties and structure) and values

14

Schema Discrepancies

Semantic

Representational

Syntactic

Einstein’s full name is “Albert Einstein”

EinsteEinsteinin

name first

last

“Albert”

“Einstein”“Albert Einstein”

full_nameEinsteEinste

inin

<Einstein> <full_name> “Albert Einstein”.

<Einstein> <full_name>Albert Einstein</full_name></Einstein>

15

SemanticRepresentational

Schema Ambiguity

Article title

“Prof. Dr. techn.”xyzxyztitle

“The Theory of Relativity”xyzxyztitle

Person title

16

Value Discrepancies

SemanticRepresentational

Einstein’s full name is “Albert Einstein”

“Albert Einstein”“Albert Einstin”“A. Einstein”“Einstein, Albert”

full_nameEinsteEinste

inin

Where discrepancies are addressed with standards

Syntactic Level

18

Encoding Bytes•Basic unit

•Universal standard: Bit (binary digit)

•Ternary digit (base 3, USSR 50’s, out of use)

•Bits into bytes

•Big or small endian

•System wise convention, easily convertible, defined in communication protocols

19

Encoding Characters

•De facto standards:

•UTF-8/16

•Many others exist: ASCII, ISO-8859’s, KOI-8, ...

•Trivial dictionary-based translation

•When the corresponding code exists in the target character map...

20

Encoding Lexico-Structural

•XML, XML Schema

•Structured document serialization format

•Base for:

•(X)HTML

•SVG: Scalable Vector Graphics

•DOCX: Microsoft Office Word 2007

Resource Description FrameworkEncoding information

RDF

22

•<subject> <property> <object>

•<subject>

•URI or blank node

•<property>

•URI

•<object>

•URI or blank node or (typed) literal22

source: http://www.xml.com/2003/02/05/graphics/graph1.gif

23

URI

•URI: Universal Resource Identifiers

•URL’s are URI’s

• scheme:scheme-specific-part

•RDF encourage using URL’s

•URL

• scheme://usr:passwd@domain:port/path?query_string#anchor

24

RDF•Resource Description Framework

•Data model specialized in conceptual information modeling

•Supported by various serialization formats:

•XML

•Notation3 (N3)

•Turtle

•...

25

RDF Schema (RDF/S)•Expressed in RDF

•Types subjects and objects with classes

•Class hierarchy (with multiple inheritance)

•Type of properties of a class

•Types properties

•Domain: type of property’s subject

•Range: type of property’s object

•OWL2 is more expressive: cardinality, etc...

26

When to use RDF?•RDF is good at

•Modeling information

•Especially when schema is unknown or changing

•When there is multiple schemas

•RDF is not for

•Representing documents (XHTML, CSS)

• Internal data management when schema is known and fixed (Relational Databases)

Discrepancies between the representational and semantic levels in

the schema

Schema Matching

28

• name• boxer id• weight• birthdate• total fights• residence

• first name• last name• age• address• street• city• tax id

•Input: Schemas to match

•Possibly data instantiating those schemas

•Output: Mappings between schema elements

•Possibly with confidence values and alternatives

•Possibly with value conversion rules (matchings)

Boxer Taxpayer

• ...

Company

• ...

Trainer

• ...

Tax Office

29

Mappings or Matching?

•Schema mapping identifies correspondences between schema elements

•Schema matching actually transforms an instance of one schema into an instance of another schema

General architectures

How to Use Mappings?

31

Mediated Schemas

Mediated

Schema

Query

Schema1

Schema2

Schema3

Query

Mediated

Schema

Schema1

Schema2

Schema3

Query

Schema x

32

Peer Data Management

Local MappingLocal Source

Peer Schema

Peer Mapping

Local Schema

33

Why not by hand?•Size and complexity of source schemas

•Number of schemas sources

•Leveraging data instance values

•Schemas not known in advance

source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif

34

Schema Matching Features

•Schema-only vs schema & instances

•Representational

•Lexical vs structural

•Internal vs external

More in:•Rahm E, Bernstein PA. A survey of approaches to automatic schema matching. The VLDB Journal. 2001;10(4):334-350.•1. Shvaiko P, Euzenat J. A Survey of Schema-Based Matching Approaches. Journal on Data Semantics IV. 2005;3730:146-171.

35

•String-based

•Language-based

•Linguistic resources

•Constraint-based

•Alignment reuse

•Upper-level formal ontologies

Schema Matching Techniques

•Graph-based

•Taxonomy-based

•Repository of structures

•Model-based

Leveraging lexical features

A String-Based Technique

37

Edit Distance•String distance: measures distance

between two strings

•Edit distance: number of operations needed to transform one string into the other

•Common basic operations:

•Insert, delete or substitute one character

•Possibly with different weights depending on the operation and characters involved

•Java libraries:

•SecondString, SimMetrics

38

Levenshtein Distance•Edit operations: insert, delete,

substitute•Each has a weight of 1

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

insert to Sundays

dele

te f

rom

Sundays substitute in Sundays

SundaysSatundaysSatundaysSaturdaysSaturdaysSaturdaysSaturdaysSaturdays

Sundays

WordNet

A Linguistic Resource

41

Hypernyms / Hyponyms

•Hypernyms: superordinates, isA relationships. A synset may have more than one hypernym.

•Hyponyms: subordinates

{car, auto, automobile, machine, motorcar}

{motor vehicle, automotive vehicle}

{cab, hack, taxi, taxicab} {ambulance}

hypernym

hyponyms

42

Holonym / Meronym•Meronym: name of a constituent part of, the

substance of, or a member of something. X is a meronym of Y if X is a part of Y.

•Holonym: name of the whole of which the meronym names a part. Y is a holonym of X if X is a part of Y.

{car, auto, automobile, machine, motorcar}

{ accelerator, accelerator pedal, gas pedal, gas, throttle, gun}

holonym meronym

43

Other relationships in WN•Antonym

•Entailment (for verbs)

•A verb X entails Y if X cannot be done unless Y is, or has been, done.

•Attribute (for adjectives)

•A noun for which adjectives express values. The noun weight is an attribute, for which the adjectives light and heavy express values.

Leveraging structure

A Graph-Matching Technique

45

Similarity Flooding•Uses structure of the data to help matching

schemas

• Similarity Flooding in Melnik et al. (2002)

• First maps schema elements with lexical similarity

• Then improves matching assuming that:

• If two elements are similar, then the elements adjacent to them are more probable to be similar

Selected paper 1:Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. IEEE Comput. Soc; 2002:117-128.

Detecting duplicate entries

Deduplication

47

Why is there Duplicates?

• first name: Mohamed• last name: Ali• age: 68• address: street: Nicestreet 17 city: Wondercity• tax id: #7234561

• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY

Sport Authorities Taxes Authorities

AdministratAdministration-wide ion-wide databasedatabase

48

•Input: 2 entities with matched attributes

•Output: M for matched or U for unmatched.

•Possibly R for reject between M and U for cases where supervised decision is necessary.

• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY

• first name: Mohamed• last name: Ali• age: 68• address: street: Nicestreet 17 city: Wondercity• tax id: #7234561• name: Muhammad Ali

• address:• city: Cairo• country: Egypt• tax id: #8244361

M

UR

Deduplication Features

50

•Value metrics

•Character-based

•Token-based

•Phonetic

•Numeric

Field Distance Metrics

String-based metrics seen for schema matchingSimilar to Information Retrieval techniques (Topic 2 next week)

Not much techniques other than considering them as strings or direct difference

51

Phonex1.First letter as prefix

2.Encode non-prefix consonants

3.Remove duplicate adjacent codes not separated by a vowel

4.Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary

consonant code

b, f, p, v 1

c, g, j, k, q, s, x, z

2

d, t 3

l 4

m, n 5

r 6

h, wdroppe

d

Rupert•Rupert•Ro1e63•Ro1e63•R163

Robert•Robert•Ro1e63•Ro1e63•R163

Ashcraftson1.Ashcraftson2.A2 26a132o53.A26a132o54.A261

52

Other Phonetic Codes

•NYSIIS

•Developed and still in use at the New York State Division of Criminal Justice Services

•Encodes vowels (mostly to A)

•Codes are letters instead of digits

•Longer codes (6 instead of 4)

53

Other Phonetic Codes

•Metaphone

•Codes are letters instead of digits

•No maximum code length

•More elaborated coding rules

•Double Metaphone

•Returns a secondary code to help disambiguate

Detecting Duplicates

55

•M: match, U: unmatch

•Using Bayes rule

•Decision rule: likelihood ratio

•Using independence assumption

Bayes Decision Rule

56

Bayes Decision Rule

•Priors ( and ) can be learned on a training set

•Other methods based on Expectation-Maximisation (EM) algorithm can estimate priors without training set

57

Clustering-Based Decision•Using clustering techniques with appropriate

parameters

• X-Means

• Variant of K-Means without a fixed K

• Chauduri et al. observed that duplicates tend

1.to have small distances from each other (compact set property), and

2.2) to have only a small number of other neighbors within a small distance (sparse neighborhood property).

Selected paper 2:Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates. ICDE’05. 2005:865-876.

58

Dealing with O(n2)

Number of entities in repository

Num

ber

of

com

pari

sons

59

Canopies

●● ●● ●

●●

•Create canopies using a cheap similarity metric

•Overlapping clusters

•Compare entities pairwise using a more expensive similarity metric

Pay-as-you-go Information Integration

Dataspaces

61

Dataspaces•Note a data integration approach per

se

•Data co-existence appraoch

•Pay-as-you-go data integration

•Leveraging human contributions for data integration in a non-invasive manner

Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.

62

•Are they duplicates?

•To compare field values we need schema matches

•To find schema matches we need duplicates

•etc...

Relationship between Schema Matching and

Deduplication

• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY

• first name: Mohamed• last name: Ali• age: 68• address: street: Nicestreet 17 city: Wondercity• tax id: #7234561

Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.

63

Selected Topic Papers1.Schema Matching

• Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. IEEE Comput. Soc; 2002:117-128.

2.Deduplication• Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates.

ICDE’05. 2005:865-876.

3.Dataspaces1. Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06.

New York, NY, USA; 2006:1-9.

• Interdependence between schema matching and deduplication

1. Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.