Web ScienceIntroduction to Information Integration
Julien Gaugaz, October 26, 2010
Topics
2
Topics
2
• 1. Information Integration
Topics
2
• 1. Information Integration
• 2. Web Information Retrieval
Topics
2
• 1. Information Integration
• 2. Web Information Retrieval
• 3. Entity Search
Topics
2
• 1. Information Integration
• 2. Web Information Retrieval
• 3. Entity Search
• 4. Web Usage
Topics
2
• 1. Information Integration
• 2. Web Information Retrieval
• 3. Entity Search
• 4. Web Usage
• 5. Collaborative Web
Topics
2
• 1. Information Integration
• 2. Web Information Retrieval
• 3. Entity Search
• 4. Web Usage
• 5. Collaborative Web
• 6. Web Archiving
Topics
2
• 1. Information Integration
• 2. Web Information Retrieval
• 3. Entity Search
• 4. Web Usage
• 5. Collaborative Web
• 6. Web Archiving
• 7. Medical Social Web
Scenarios
Why Integrating Information?
Company Mergers
4
Company Mergers
4
Company Mergers
4
Company Mergers
4
Travelling Agent
5
Agent
Booking Flights
6
Agent
Leveraging Wikipedia Infoboxes
7
Query
Data Contribution
Evolution
8
Beginning ofDatabases
Wikipedia &Social Web
Rise of Internet & Wrapping Websites
1E+00
1E+01
1E+02
1E+03
1E+04
1E+05
1E+06
1960 1970 1980 1990 2000 2010
Num
ber
of S
ourc
es
Kinds of discrepancies
What is the Problem?
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
Wikipedia Infoboxes
10
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000
| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]
| [[Höhe]] : || 34–115 m ü. NN
| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))
Wikipedia Infoboxes
11
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/San_Francisco|leader_title ! = [[Mayor of San Francisco|Mayor]]|leader_name ! = [[Gavin Newsom]] ([[Democratic [...]|D]])|elevation_ft! ! = 52|elevation_max_ft != 925|elevation_min_ft! = 0|population_as_of = 2008|population_total = 815358|population_metro = 4203898|population_urban = 3228605
| leader_title !! = [[List of mayors of Berlin|Governing Mayor]]||| leader ! ! ! = Klaus Wowereit| elevation ! = 34 - 115| pop_date ! = 2010-03-31| population ! = 3440441| pop_metro ! = 5000000
Wikipedia Infoboxes
11
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/San_Francisco|leader_title ! = [[Mayor of San Francisco|Mayor]]|leader_name ! = [[Gavin Newsom]] ([[Democratic [...]|D]])|elevation_ft! ! = 52|elevation_max_ft != 925|elevation_min_ft! = 0|population_as_of = 2008|population_total = 815358|population_metro = 4203898|population_urban = 3228605
| leader_title !! = [[List of mayors of Berlin|Governing Mayor]]||| leader ! ! ! = Klaus Wowereit| elevation ! = 34 - 115| pop_date ! = 2010-03-31| population ! = 3440441| pop_metro ! = 5000000
Wikipedia Infoboxes
11
http://de.wikipedia.org/wiki/Berlin
http://en.wikipedia.org/wiki/San_Francisco|leader_title ! = [[Mayor of San Francisco|Mayor]]|leader_name ! = [[Gavin Newsom]] ([[Democratic [...]|D]])|elevation_ft! ! = 52|elevation_max_ft != 925|elevation_min_ft! = 0|population_as_of = 2008|population_total = 815358|population_metro = 4203898|population_urban = 3228605
| leader_title !! = [[List of mayors of Berlin|Governing Mayor]]||| leader ! ! ! = Klaus Wowereit| elevation ! = 34 - 115| pop_date ! = 2010-03-31| population ! = 3440441| pop_metro ! = 5000000
Causes of Discrepancies
12
Causes of Discrepancies• Information sources are diverse
• Different cultural background
• Different domain of activity
• Different model of information
12
Causes of Discrepancies• Information sources are diverse
• Different cultural background
• Different domain of activity
• Different model of information
• Typos and other kinds of errors
12
Causes of Discrepancies• Information sources are diverse
• Different cultural background
• Different domain of activity
• Different model of information
• Typos and other kinds of errors
• Evolution over time
• Use, usage and users of one source may change of over time
12
Places of Discrepancies
13
Places of DiscrepanciesInformation level where discrepancies appear:
13
Places of DiscrepanciesInformation level where discrepancies appear:
• Semantic: meaning, sense
13
Places of DiscrepanciesInformation level where discrepancies appear:
• Semantic: meaning, sense
• Representational
• Lexical: word / term representing the meaning
• Structural: how are the terms arranged to represent the meaning
13
Places of DiscrepanciesInformation level where discrepancies appear:
• Semantic: meaning, sense
• Representational
• Lexical: word / term representing the meaning
• Structural: how are the terms arranged to represent the meaning
• Syntactic: how is the lexical and structural encoded into characters (and bits)
13
Places of DiscrepanciesInformation level where discrepancies appear:
• Semantic: meaning, sense
• Representational
• Lexical: word / term representing the meaning
• Structural: how are the terms arranged to represent the meaning
• Syntactic: how is the lexical and structural encoded into characters (and bits)
Discrepancies may concern:
• Schema elements (properties and structure) and values13
Schema Discrepancies
14
Semantic
Representational
Syntactic
Einstein’s full name is “Albert Einstein”
Schema Discrepancies
14
Semantic
Representational
Syntactic
Einstein’s full name is “Albert Einstein”
Einsteinname first
last
“Albert”
“Einstein”
Schema Discrepancies
14
Semantic
Representational
Syntactic
Einstein’s full name is “Albert Einstein”
Einsteinname first
last
“Albert”
“Einstein”“Albert Einstein”
full_nameEinstein
Schema Discrepancies
14
Semantic
Representational
Syntactic
Einstein’s full name is “Albert Einstein”
Einsteinname first
last
“Albert”
“Einstein”“Albert Einstein”
full_nameEinstein
<Einstein> <full_name> “Albert Einstein”.
Schema Discrepancies
14
Semantic
Representational
Syntactic
Einstein’s full name is “Albert Einstein”
Einsteinname first
last
“Albert”
“Einstein”“Albert Einstein”
full_nameEinstein
<Einstein> <full_name> “Albert Einstein”.
<Einstein> <full_name>Albert Einstein</full_name></Einstein>
SemanticRepresentational
Schema Ambiguity
15
“Prof. Dr. techn.”xyztitle
“The Theory of Relativity”xyztitle
SemanticRepresentational
Schema Ambiguity
15
“Prof. Dr. techn.”xyztitle
“The Theory of Relativity”xyztitle
Person title
SemanticRepresentational
Schema Ambiguity
15
Article title
“Prof. Dr. techn.”xyztitle
“The Theory of Relativity”xyztitle
Person title
Value Discrepancies
16
SemanticRepresentational
Einstein’s full name is “Albert Einstein”
“Albert Einstein”“Albert Einstin”“A. Einstein”“Einstein, Albert”
full_nameEinstein
Where discrepancies are addressed with standards
Syntactic Level
Encoding Bytes
18
Encoding Bytes
• Basic unit
• Universal standard: Bit (binary digit)
• Ternary digit (base 3, USSR 50’s, out of use)
18
Encoding Bytes
• Basic unit
• Universal standard: Bit (binary digit)
• Ternary digit (base 3, USSR 50’s, out of use)
• Bits into bytes
• Big or small endian
• System wise convention, easily convertible, defined in communication protocols
18
Encoding Characters
19
Encoding Characters
• De facto standards:
• UTF-8/16
19
Encoding Characters
• De facto standards:
• UTF-8/16
• Many others exist: ASCII, ISO-8859’s, KOI-8, ...
19
Encoding Characters
• De facto standards:
• UTF-8/16
• Many others exist: ASCII, ISO-8859’s, KOI-8, ...
• Trivial dictionary-based translation
• When the corresponding code exists in the target character map...
19
Encoding Lexico-Structural
20
Encoding Lexico-Structural
• XML, XML Schema
• Structured document serialization format
• Base for:
• (X)HTML
• SVG: Scalable Vector Graphics
• DOCX: Microsoft Office Word 2007
20
Resource Description FrameworkEncoding information
RDF
22
source: http://www.xml.com/2003/02/05/graphics/graph1.gif
• <subject> <property> <object>
22
source: http://www.xml.com/2003/02/05/graphics/graph1.gif
• <subject> <property> <object>
• <subject>
• URI or blank node
22
source: http://www.xml.com/2003/02/05/graphics/graph1.gif
• <subject> <property> <object>
• <subject>
• URI or blank node
• <property>
• URI
22
source: http://www.xml.com/2003/02/05/graphics/graph1.gif
• <subject> <property> <object>
• <subject>
• URI or blank node
• <property>
• URI
• <object>
• URI or blank node or (typed) literal22
source: http://www.xml.com/2003/02/05/graphics/graph1.gif
URI
23
URI
• URI: Universal Resource Identifiers
• URL’s are URI’s
•scheme:scheme-specific-part
• RDF encourage using URL’s
23
URI
• URI: Universal Resource Identifiers
• URL’s are URI’s
•scheme:scheme-specific-part
• RDF encourage using URL’s
• URL
• scheme://usr:passwd@domain:port/path?query_string#anchor
23
RDF
24
RDF• Resource Description Framework
24
RDF• Resource Description Framework
• Data model specialized in conceptual information modeling
24
RDF• Resource Description Framework
• Data model specialized in conceptual information modeling
• Supported by various serialization formats:
• XML
• Notation3 (N3)
• Turtle
• ...
24
RDF Schema (RDF/S)
25
RDF Schema (RDF/S)• Expressed in RDF
25
RDF Schema (RDF/S)• Expressed in RDF
• Types subjects and objects with classes
• Class hierarchy (with multiple inheritance)
• Type of properties of a class
25
RDF Schema (RDF/S)• Expressed in RDF
• Types subjects and objects with classes
• Class hierarchy (with multiple inheritance)
• Type of properties of a class
• Types properties
• Domain: type of property’s subject
• Range: type of property’s object
25
RDF Schema (RDF/S)• Expressed in RDF
• Types subjects and objects with classes
• Class hierarchy (with multiple inheritance)
• Type of properties of a class
• Types properties
• Domain: type of property’s subject
• Range: type of property’s object
• OWL2 is more expressive: cardinality, etc...25
When to use RDF?
26
When to use RDF?• RDF is good at
• Modeling information
• Especially when schema is unknown or changing
• When there is multiple schemas
26
When to use RDF?• RDF is good at
• Modeling information
• Especially when schema is unknown or changing
• When there is multiple schemas
• RDF is not for
• Representing documents (XHTML, CSS)
• Internal data management when schema is known and fixed (Relational Databases)
26
Discrepancies between the representational and semantic levels in the schema
Schema Matching
28
• name• boxer id• weight• birthdate• total fights• residence
• first name• last name• age• address
• street• city
• tax id
Boxer Taxpayer
• ...
Company• ...
Trainer
• ...
Tax Office
28
• name• boxer id• weight• birthdate• total fights• residence
• first name• last name• age• address
• street• city
• tax id
• Input: Schemas to match
• Possibly data instantiating those schemas
Boxer Taxpayer
• ...
Company• ...
Trainer
• ...
Tax Office
28
• name• boxer id• weight• birthdate• total fights• residence
• first name• last name• age• address
• street• city
• tax id
• Input: Schemas to match
• Possibly data instantiating those schemas
• Output: Mappings between schema elements
• Possibly with confidence values and alternatives
• Possibly with value conversion rules (matchings)
Boxer Taxpayer
• ...
Company• ...
Trainer
• ...
Tax Office
Mappings or Matching?
29
Mappings or Matching?
• Schema mapping identifies correspondences between schema elements
29
Mappings or Matching?
• Schema mapping identifies correspondences between schema elements
• Schema matching actually transforms an instance of one schema into an instance of another schema
29
General architectures
How to Use Mappings?
Mediated Schemas
31
Schema1
Schema2
Schema3
Mediated Schemas
31
Mediated Schema
Schema1
Schema2
Schema3
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
Schema1
Schema2
Schema3
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
Mediated Schema
Schema1
Schema2
Schema3
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
Mediated Schema
Schema1
Schema2
Schema3
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
Query
Mediated Schema
Schema1
Schema2
Schema3
Mediated Schemas
31
Mediated Schema
Query
Schema1
Schema2
Schema3
Query
Mediated Schema
Schema1
Schema2
Schema3
Query
Schema x
Peer Data Management
32
Local MappingLocal Source
Peer Schema
Peer Mapping
Local Schema
Why not by hand?
33source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif
Why not by hand?• Size and complexity of source schemas
• Number of schemas sources
• Leveraging data instance values
• Schemas not known in advance
33source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif
Why not by hand?• Size and complexity of source schemas
• Number of schemas sources
• Leveraging data instance values
• Schemas not known in advance
33source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif
Why not by hand?• Size and complexity of source schemas
• Number of schemas sources
• Leveraging data instance values
• Schemas not known in advance
33source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif
Schema Matching Features
34
Schema Matching Features
• Schema-only vs schema & instances
34
Schema Matching Features
• Schema-only vs schema & instances
• Representational
• Lexical vs structural
34
Schema Matching Features
• Schema-only vs schema & instances
• Representational
• Lexical vs structural
• Internal vs external
34
Schema Matching Features
• Schema-only vs schema & instances
• Representational
• Lexical vs structural
• Internal vs external
34
More in:• Rahm E, Bernstein PA. A survey of approaches to automatic schema matching. The
VLDB Journal. 2001;10(4):334-350.• 1. Shvaiko P, Euzenat J. A Survey of Schema-Based Matching Approaches. Journal on
Data Semantics IV. 2005;3730:146-171.
Schema Matching Techniques
35
• String-based
Schema Matching Techniques
35
• String-based
• Language-based
Schema Matching Techniques
35
• String-based
• Language-based
• Linguistic resources
Schema Matching Techniques
35
• String-based
• Language-based
• Linguistic resources
• Constraint-based
Schema Matching Techniques
35
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
Schema Matching Techniques
35
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
• Taxonomy-based
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
• Taxonomy-based
• Repository of structures
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
• Taxonomy-based
• Repository of structures
• Model-based
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
• Taxonomy-based
• Repository of structures
• Model-based
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
• Taxonomy-based
• Repository of structures
• Model-based
• String-based
• Language-based
• Linguistic resources
• Constraint-based
• Alignment reuse
• Upper-level formal ontologies
Schema Matching Techniques
35
• Graph-based
• Taxonomy-based
• Repository of structures
• Model-based
Leveraging lexical features
A String-Based Technique
Edit Distance
37
Edit Distance• String distance: measures distance between
two strings
37
Edit Distance• String distance: measures distance between
two strings
• Edit distance: number of operations needed to transform one string into the other
37
Edit Distance• String distance: measures distance between
two strings
• Edit distance: number of operations needed to transform one string into the other
• Common basic operations:
• Insert, delete or substitute one character
• Possibly with different weights depending on the operation and characters involved
37
Edit Distance• String distance: measures distance between
two strings
• Edit distance: number of operations needed to transform one string into the other
• Common basic operations:
• Insert, delete or substitute one character
• Possibly with different weights depending on the operation and characters involved
• Java libraries:
• SecondString, SimMetrics37
Levenshtein Distance
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysS
undays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysSatundays
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysSatundaysSaturdays
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysSatundaysSaturdaysSaturdays
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysSatundaysSaturdaysSaturdaysSaturdays
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysSatundaysSaturdaysSaturdaysSaturdaysSaturdays
Sundays
Levenshtein Distance• Edit operations: insert, delete, substitute
• Each has a weight of 1
38
S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4
insert to Sundays
dele
te fr
om S
unda
ys substitute in Sundays
SundaysSatundaysSatundaysSaturdaysSaturdaysSaturdaysSaturdaysSaturdays
Sundays
WordNet
A Linguistic Resource
WordNet
40
WordNet
• Fundamental components: Synonyn Sets (Synsets)
40
WordNet
• Fundamental components: Synonyn Sets (Synsets)
• {car, auto, automobile, machine, motorcar}
• a motor vehicle with four wheels; usually propelled by an internal combustion engine
40
WordNet
• Fundamental components: Synonyn Sets (Synsets)
• {car, auto, automobile, machine, motorcar}
• a motor vehicle with four wheels; usually propelled by an internal combustion engine
• {car, railcar, railway car, railroad car}
• a wheeled vehicle adapted to the rails of railroad
40
Hypernyms / Hyponyms• Hypernyms: superordinates, isA relationships. A
synset may have more than one hypernym.
• Hyponyms: subordinates
41
{car, auto, automobile, machine, motorcar}
{motor vehicle, automotive vehicle}
{cab, hack, taxi, taxicab} {ambulance}
hypernym
hyponyms
Holonym / Meronym• Meronym: name of a constituent part of, the
substance of, or a member of something. X is a meronym of Y if X is a part of Y.
• Holonym: name of the whole of which the meronym names a part. Y is a holonym of X if X is a part of Y.
42
{car, auto, automobile, machine, motorcar}
{ accelerator, accelerator pedal, gas pedal, gas, throttle, gun}
holonym meronym
Other relationships in WN
43
Other relationships in WN
• Antonym
43
Other relationships in WN
• Antonym
• Entailment (for verbs)
• A verb X entails Y if X cannot be done unless Y is, or has been, done.
43
Other relationships in WN
• Antonym
• Entailment (for verbs)
• A verb X entails Y if X cannot be done unless Y is, or has been, done.
• Attribute (for adjectives)
• A noun for which adjectives express values. The noun weight is an attribute, for which the adjectives light and heavy express values.
43
Leveraging structure
A Graph-Matching Technique
Similarity Flooding
45
Similarity Flooding• Uses structure of the data to help matching
schemas
45
Similarity Flooding• Uses structure of the data to help matching
schemas
• Similarity Flooding in Melnik et al. (2002)
• First maps schema elements with lexical similarity
• Then improves matching assuming that:
• If two elements are similar, then the elements adjacent to them are more probable to be similar
45
Similarity Flooding• Uses structure of the data to help matching
schemas
• Similarity Flooding in Melnik et al. (2002)
• First maps schema elements with lexical similarity
• Then improves matching assuming that:
• If two elements are similar, then the elements adjacent to them are more probable to be similar
45
Selected paper 1:Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. IEEE Comput. Soc; 2002:117-128.
Detecting duplicate entries
Deduplication
Why is there Duplicates?
47
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
Sport Authorities Taxes Authorities
Administration-wide database
48
• Input: 2 entities with matched attributes
• Output: M for matched or U for unmatched.
• Possibly R for reject between M and U for cases where supervised decision is necessary.
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
• name: Muhammad Ali• address:
• city: Cairo• country: Egypt
• tax id: #8244361
M
UR
Deduplication Features
50
Field Distance Metrics
• Value metrics
• Character-based
• Token-based
• Phonetic
• Numeric50
Field Distance Metrics
• Value metrics
• Character-based
• Token-based
• Phonetic
• Numeric50
Field Distance Metrics
String-based metrics seen for schema matching
• Value metrics
• Character-based
• Token-based
• Phonetic
• Numeric50
Field Distance Metrics
String-based metrics seen for schema matching
Similar to Information Retrieval techniques (Topic 2 next week)
• Value metrics
• Character-based
• Token-based
• Phonetic
• Numeric50
Field Distance Metrics
String-based metrics seen for schema matching
Similar to Information Retrieval techniques (Topic 2 next week)
Not much techniques other than considering them as strings or direct difference
• Value metrics
• Character-based
• Token-based
• Phonetic
• Numeric50
Field Distance Metrics
String-based metrics seen for schema matching
Similar to Information Retrieval techniques (Topic 2 next week)
Not much techniques other than considering them as strings or direct difference
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
Ashcraftson
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
Ashcraftson1.Ashcraftson
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
Ashcraftson1.Ashcraftson2.A2 26a132o5
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
Ashcraftson1.Ashcraftson2.A2 26a132o53.A26a132o5
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
Ashcraftson1.Ashcraftson2.A2 26a132o53.A26a132o54.A261
Phonex1. First letter as prefix
2. Encode non-prefix consonants
3. Remove duplicate adjacent codes not separated by a vowel
4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary
51
consonant code
b, f, p, v 1
c, g, j, k, q, s, x, z 2
d, t 3
l 4
m, n 5
r 6
h, w dropped
Rupert1.Rupert2.Ro1e633.Ro1e634.R163
Robert1.Robert2.Ro1e633.Ro1e634.R163
Ashcraftson1.Ashcraftson2.A2 26a132o53.A26a132o54.A261
Other Phonetic Codes
52
Other Phonetic Codes
• NYSIIS
• Developed and still in use at the New York State Division of Criminal Justice Services
• Encodes vowels (mostly to A)
• Codes are letters instead of digits
• Longer codes (6 instead of 4)
52
Other Phonetic Codes
53
Other Phonetic Codes
• Metaphone
• Codes are letters instead of digits
• No maximum code length
• More elaborated coding rules
53
Other Phonetic Codes
• Metaphone
• Codes are letters instead of digits
• No maximum code length
• More elaborated coding rules
• Double Metaphone
• Returns a secondary code to help disambiguate
53
Detecting Duplicates
Bayes Decision Rule
55
• M: match, U: unmatch
Bayes Decision Rule
55
M if p(M |�x) ≥ p(U |�x)U otherwise
• M: match, U: unmatch
• Using Bayes rule
Bayes Decision Rule
55
p(M |�x) ≥ p(U |�x)
⇔ p(M ∧ �x)
p(�x)≥ p(U ∧ �x)
p(�x)
⇔ p(M)p(�x|M) ≥ p(U)p(�x|U)
⇔ l(�x) =p(�x|M)
p(�x|U)≥ p(U)
p(M)
M if p(M |�x) ≥ p(U |�x)U otherwise
• M: match, U: unmatch
• Using Bayes rule
• Decision rule: likelihood ratio
Bayes Decision Rule
55
p(M |�x) ≥ p(U |�x)
⇔ p(M ∧ �x)
p(�x)≥ p(U ∧ �x)
p(�x)
⇔ p(M)p(�x|M) ≥ p(U)p(�x|U)
⇔ l(�x) =p(�x|M)
p(�x|U)≥ p(U)
p(M)
M if p(M |�x) ≥ p(U |�x)U otherwise l(�x) =
p(�x|M)
p(�x|U)≥ p(U)
p(M)
• M: match, U: unmatch
• Using Bayes rule
• Decision rule: likelihood ratio
• Using independence assumption
Bayes Decision Rule
55
p(M |�x) ≥ p(U |�x)
⇔ p(M ∧ �x)
p(�x)≥ p(U ∧ �x)
p(�x)
⇔ p(M)p(�x|M) ≥ p(U)p(�x|U)
⇔ l(�x) =p(�x|M)
p(�x|U)≥ p(U)
p(M)
M if p(M |�x) ≥ p(U |�x)U otherwise l(�x) =
p(�x|M)
p(�x|U)≥ p(U)
p(M)
p(�x|M) =�
i
p(xi|M)
p(�x|U) =�
i
p(xi|U)
Bayes Decision Rule
56
p(xi|M) p(xi|U)
Bayes Decision Rule
56
• Priors ( and ) can be learned on a training set
p(xi|M) p(xi|U)
Bayes Decision Rule
56
• Priors ( and ) can be learned on a training set
• Other methods based on Expectation-Maximisation (EM) algorithm can estimate priors without training set
p(xi|M) p(xi|U)
Clustering-Based Decision
57
Selected paper 2:Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates. ICDE’05. 2005:865-876.
Clustering-Based Decision• Using clustering techniques with appropriate parameters
• X-Means
• Variant of K-Means without a fixed K
• Chauduri et al. observed that duplicates tend
1. to have small distances from each other (compact set property), and
2. 2) to have only a small number of other neighbors within a small distance (sparse neighborhood property).
57
Selected paper 2:Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates. ICDE’05. 2005:865-876.
Dealing with O(n2)
58
0E+00
2.5E+11
5E+11
7.5E+11
1E+12
0 200'000 400'000 600'000 800'000 1'000'000
Number of entities in repository
Num
ber
of c
ompa
riso
ns
Canopies
59
●●
●
●● ●● ●
●
●
●
●
●●
●●
Canopies
59
●●
●
●● ●● ●
●
●
●
●
●●
●●
• Create canopies using a cheap similarity metric
• Overlapping clusters
Canopies
59
●●
●
●● ●● ●
●
●
●
●
●●
●●
• Create canopies using a cheap similarity metric
• Overlapping clusters
• Compare entities pairwise using a more expensive similarity metric
Pay-as-you-go Information Integration
Dataspaces
Dataspaces
61
Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.
Dataspaces
• Note a data integration approach per se
61
Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.
Dataspaces
• Note a data integration approach per se
• Data co-existence appraoch
61
Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.
Dataspaces
• Note a data integration approach per se
• Data co-existence appraoch
• Pay-as-you-go data integration
• Leveraging human contributions for data integration in a non-invasive manner
61
Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.
Relationship between Schema Matching and Deduplication
62
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.
• Are they duplicates?
Relationship between Schema Matching and Deduplication
62
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.
• Are they duplicates?
• To compare field values we need schema matches
Relationship between Schema Matching and Deduplication
62
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.
• Are they duplicates?
• To compare field values we need schema matches
• To find schema matches we need duplicates
Relationship between Schema Matching and Deduplication
62
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.
• Are they duplicates?
• To compare field values we need schema matches
• To find schema matches we need duplicates
• etc...
Relationship between Schema Matching and Deduplication
62
• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY
• first name: Mohamed• last name: Ali• age: 68• address:
street: Nicestreet 17 city: Wondercity
• tax id: #7234561
Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.
Selected Topic Papers1. Schema Matching
• Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. IEEE Comput. Soc; 2002:117-128.
2. Deduplication• Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates. ICDE’05.
2005:865-876.
3. Dataspaces• Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York,
NY, USA; 2006:1-9.
4. Interdependence between schema matching and deduplication
• Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.
63