+ All Categories
Home > Documents > Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information...

Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information...

Date post: 07-Aug-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
211
Web Science Introduction to Information Integration Julien Gaugaz, October 26, 2010
Transcript
Page 1: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Web ScienceIntroduction to Information Integration

Julien Gaugaz, October 26, 2010

Page 2: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Topics

2

Page 3: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Topics

2

• 1. Information Integration

Page 4: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Topics

2

• 1. Information Integration

• 2. Web Information Retrieval

Page 5: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Topics

2

• 1. Information Integration

• 2. Web Information Retrieval

• 3. Entity Search

Page 6: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Topics

2

• 1. Information Integration

• 2. Web Information Retrieval

• 3. Entity Search

• 4. Web Usage

Page 7: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Topics

2

• 1. Information Integration

• 2. Web Information Retrieval

• 3. Entity Search

• 4. Web Usage

• 5. Collaborative Web

Page 8: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Topics

2

• 1. Information Integration

• 2. Web Information Retrieval

• 3. Entity Search

• 4. Web Usage

• 5. Collaborative Web

• 6. Web Archiving

Page 9: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Topics

2

• 1. Information Integration

• 2. Web Information Retrieval

• 3. Entity Search

• 4. Web Usage

• 5. Collaborative Web

• 6. Web Archiving

• 7. Medical Social Web

Page 10: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Scenarios

Why Integrating Information?

Page 11: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Company Mergers

4

Page 12: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Company Mergers

4

Page 13: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Company Mergers

4

Page 14: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Company Mergers

4

Page 15: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Travelling Agent

5

Agent

Page 16: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Booking Flights

6

Agent

Page 17: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Leveraging Wikipedia Infoboxes

7

Query

Data Contribution

Page 18: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Evolution

8

Beginning ofDatabases

Wikipedia &Social Web

Rise of Internet & Wrapping Websites

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

1960 1970 1980 1990 2000 2010

Num

ber

of S

ourc

es

Page 19: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Kinds of discrepancies

What is the Problem?

Page 20: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Wikipedia Infoboxes

10

http://de.wikipedia.org/wiki/Berlin

http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000

| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]

| [[Höhe]] : || 34–115 m ü. NN

| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))

Page 21: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Wikipedia Infoboxes

10

http://de.wikipedia.org/wiki/Berlin

http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000

| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]

| [[Höhe]] : || 34–115 m ü. NN

| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))

Page 22: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Wikipedia Infoboxes

10

http://de.wikipedia.org/wiki/Berlin

http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000

| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]

| [[Höhe]] : || 34–115 m ü. NN

| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))

Page 23: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Wikipedia Infoboxes

10

http://de.wikipedia.org/wiki/Berlin

http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000

| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]

| [[Höhe]] : || 34–115 m ü. NN

| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))

Page 24: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Wikipedia Infoboxes

10

http://de.wikipedia.org/wiki/Berlin

http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000

| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]

| [[Höhe]] : || 34–115 m ü. NN

| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))

Page 25: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Wikipedia Infoboxes

10

http://de.wikipedia.org/wiki/Berlin

http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000

| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]

| [[Höhe]] : || 34–115 m ü. NN

| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))

Page 26: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Wikipedia Infoboxes

10

http://de.wikipedia.org/wiki/Berlin

http://en.wikipedia.org/wiki/Berlin| leader_title = [[List of mayors of Berlin|Governing Mayor]]||| leader = Klaus Wowereit| elevation = 34 - 115| pop_date = 2010-03-31| population = 3440441| pop_metro = 5000000

| [[(...)|Reg. Bürgermeister]]:|| [[Klaus Wowereit]]

| [[Höhe]] : || 34–115 m ü. NN

| [[Einwohner]] : || {{Metadaten Einwohnerzahl DE-BE|Berlin}}[...] (rendered as: 3.443.735 (31. Mai 2010))

Page 27: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Wikipedia Infoboxes

11

http://de.wikipedia.org/wiki/Berlin

http://en.wikipedia.org/wiki/San_Francisco|leader_title ! = [[Mayor of San Francisco|Mayor]]|leader_name ! = [[Gavin Newsom]] ([[Democratic [...]|D]])|elevation_ft! ! = 52|elevation_max_ft != 925|elevation_min_ft! = 0|population_as_of = 2008|population_total = 815358|population_metro = 4203898|population_urban = 3228605

| leader_title !! = [[List of mayors of Berlin|Governing Mayor]]||| leader ! ! ! = Klaus Wowereit| elevation ! = 34 - 115| pop_date ! = 2010-03-31| population ! = 3440441| pop_metro ! = 5000000

Page 28: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Wikipedia Infoboxes

11

http://de.wikipedia.org/wiki/Berlin

http://en.wikipedia.org/wiki/San_Francisco|leader_title ! = [[Mayor of San Francisco|Mayor]]|leader_name ! = [[Gavin Newsom]] ([[Democratic [...]|D]])|elevation_ft! ! = 52|elevation_max_ft != 925|elevation_min_ft! = 0|population_as_of = 2008|population_total = 815358|population_metro = 4203898|population_urban = 3228605

| leader_title !! = [[List of mayors of Berlin|Governing Mayor]]||| leader ! ! ! = Klaus Wowereit| elevation ! = 34 - 115| pop_date ! = 2010-03-31| population ! = 3440441| pop_metro ! = 5000000

Page 29: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Wikipedia Infoboxes

11

http://de.wikipedia.org/wiki/Berlin

http://en.wikipedia.org/wiki/San_Francisco|leader_title ! = [[Mayor of San Francisco|Mayor]]|leader_name ! = [[Gavin Newsom]] ([[Democratic [...]|D]])|elevation_ft! ! = 52|elevation_max_ft != 925|elevation_min_ft! = 0|population_as_of = 2008|population_total = 815358|population_metro = 4203898|population_urban = 3228605

| leader_title !! = [[List of mayors of Berlin|Governing Mayor]]||| leader ! ! ! = Klaus Wowereit| elevation ! = 34 - 115| pop_date ! = 2010-03-31| population ! = 3440441| pop_metro ! = 5000000

Page 30: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Causes of Discrepancies

12

Page 31: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Causes of Discrepancies• Information sources are diverse

• Different cultural background

• Different domain of activity

• Different model of information

12

Page 32: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Causes of Discrepancies• Information sources are diverse

• Different cultural background

• Different domain of activity

• Different model of information

• Typos and other kinds of errors

12

Page 33: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Causes of Discrepancies• Information sources are diverse

• Different cultural background

• Different domain of activity

• Different model of information

• Typos and other kinds of errors

• Evolution over time

• Use, usage and users of one source may change of over time

12

Page 34: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Places of Discrepancies

13

Page 35: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Places of DiscrepanciesInformation level where discrepancies appear:

13

Page 36: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Places of DiscrepanciesInformation level where discrepancies appear:

• Semantic: meaning, sense

13

Page 37: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Places of DiscrepanciesInformation level where discrepancies appear:

• Semantic: meaning, sense

• Representational

• Lexical: word / term representing the meaning

• Structural: how are the terms arranged to represent the meaning

13

Page 38: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Places of DiscrepanciesInformation level where discrepancies appear:

• Semantic: meaning, sense

• Representational

• Lexical: word / term representing the meaning

• Structural: how are the terms arranged to represent the meaning

• Syntactic: how is the lexical and structural encoded into characters (and bits)

13

Page 39: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Places of DiscrepanciesInformation level where discrepancies appear:

• Semantic: meaning, sense

• Representational

• Lexical: word / term representing the meaning

• Structural: how are the terms arranged to represent the meaning

• Syntactic: how is the lexical and structural encoded into characters (and bits)

Discrepancies may concern:

• Schema elements (properties and structure) and values13

Page 40: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Schema Discrepancies

14

Semantic

Representational

Syntactic

Einstein’s full name is “Albert Einstein”

Page 41: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Schema Discrepancies

14

Semantic

Representational

Syntactic

Einstein’s full name is “Albert Einstein”

Einsteinname first

last

“Albert”

“Einstein”

Page 42: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Schema Discrepancies

14

Semantic

Representational

Syntactic

Einstein’s full name is “Albert Einstein”

Einsteinname first

last

“Albert”

“Einstein”“Albert Einstein”

full_nameEinstein

Page 43: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Schema Discrepancies

14

Semantic

Representational

Syntactic

Einstein’s full name is “Albert Einstein”

Einsteinname first

last

“Albert”

“Einstein”“Albert Einstein”

full_nameEinstein

<Einstein> <full_name> “Albert Einstein”.

Page 44: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Schema Discrepancies

14

Semantic

Representational

Syntactic

Einstein’s full name is “Albert Einstein”

Einsteinname first

last

“Albert”

“Einstein”“Albert Einstein”

full_nameEinstein

<Einstein> <full_name> “Albert Einstein”.

<Einstein> <full_name>Albert Einstein</full_name></Einstein>

Page 45: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

SemanticRepresentational

Schema Ambiguity

15

“Prof. Dr. techn.”xyztitle

“The Theory of Relativity”xyztitle

Page 46: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

SemanticRepresentational

Schema Ambiguity

15

“Prof. Dr. techn.”xyztitle

“The Theory of Relativity”xyztitle

Person title

Page 47: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

SemanticRepresentational

Schema Ambiguity

15

Article title

“Prof. Dr. techn.”xyztitle

“The Theory of Relativity”xyztitle

Person title

Page 48: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Value Discrepancies

16

SemanticRepresentational

Einstein’s full name is “Albert Einstein”

“Albert Einstein”“Albert Einstin”“A. Einstein”“Einstein, Albert”

full_nameEinstein

Page 49: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Where discrepancies are addressed with standards

Syntactic Level

Page 50: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Encoding Bytes

18

Page 51: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Encoding Bytes

• Basic unit

• Universal standard: Bit (binary digit)

• Ternary digit (base 3, USSR 50’s, out of use)

18

Page 52: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Encoding Bytes

• Basic unit

• Universal standard: Bit (binary digit)

• Ternary digit (base 3, USSR 50’s, out of use)

• Bits into bytes

• Big or small endian

• System wise convention, easily convertible, defined in communication protocols

18

Page 53: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Encoding Characters

19

Page 54: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Encoding Characters

• De facto standards:

• UTF-8/16

19

Page 55: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Encoding Characters

• De facto standards:

• UTF-8/16

• Many others exist: ASCII, ISO-8859’s, KOI-8, ...

19

Page 56: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Encoding Characters

• De facto standards:

• UTF-8/16

• Many others exist: ASCII, ISO-8859’s, KOI-8, ...

• Trivial dictionary-based translation

• When the corresponding code exists in the target character map...

19

Page 57: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Encoding Lexico-Structural

20

Page 58: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Encoding Lexico-Structural

• XML, XML Schema

• Structured document serialization format

• Base for:

• (X)HTML

• SVG: Scalable Vector Graphics

• DOCX: Microsoft Office Word 2007

20

Page 59: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Resource Description FrameworkEncoding information

RDF

Page 60: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

22

source: http://www.xml.com/2003/02/05/graphics/graph1.gif

Page 61: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• <subject> <property> <object>

22

source: http://www.xml.com/2003/02/05/graphics/graph1.gif

Page 62: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• <subject> <property> <object>

• <subject>

• URI or blank node

22

source: http://www.xml.com/2003/02/05/graphics/graph1.gif

Page 63: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• <subject> <property> <object>

• <subject>

• URI or blank node

• <property>

• URI

22

source: http://www.xml.com/2003/02/05/graphics/graph1.gif

Page 64: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• <subject> <property> <object>

• <subject>

• URI or blank node

• <property>

• URI

• <object>

• URI or blank node or (typed) literal22

source: http://www.xml.com/2003/02/05/graphics/graph1.gif

Page 65: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

URI

23

Page 66: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

URI

• URI: Universal Resource Identifiers

• URL’s are URI’s

•scheme:scheme-specific-part

• RDF encourage using URL’s

23

Page 67: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

URI

• URI: Universal Resource Identifiers

• URL’s are URI’s

•scheme:scheme-specific-part

• RDF encourage using URL’s

• URL

• scheme://usr:passwd@domain:port/path?query_string#anchor

23

Page 68: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

RDF

24

Page 69: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

RDF• Resource Description Framework

24

Page 70: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

RDF• Resource Description Framework

• Data model specialized in conceptual information modeling

24

Page 71: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

RDF• Resource Description Framework

• Data model specialized in conceptual information modeling

• Supported by various serialization formats:

• XML

• Notation3 (N3)

• Turtle

• ...

24

Page 72: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

RDF Schema (RDF/S)

25

Page 73: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

RDF Schema (RDF/S)• Expressed in RDF

25

Page 74: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

RDF Schema (RDF/S)• Expressed in RDF

• Types subjects and objects with classes

• Class hierarchy (with multiple inheritance)

• Type of properties of a class

25

Page 75: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

RDF Schema (RDF/S)• Expressed in RDF

• Types subjects and objects with classes

• Class hierarchy (with multiple inheritance)

• Type of properties of a class

• Types properties

• Domain: type of property’s subject

• Range: type of property’s object

25

Page 76: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

RDF Schema (RDF/S)• Expressed in RDF

• Types subjects and objects with classes

• Class hierarchy (with multiple inheritance)

• Type of properties of a class

• Types properties

• Domain: type of property’s subject

• Range: type of property’s object

• OWL2 is more expressive: cardinality, etc...25

Page 77: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

When to use RDF?

26

Page 78: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

When to use RDF?• RDF is good at

• Modeling information

• Especially when schema is unknown or changing

• When there is multiple schemas

26

Page 79: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

When to use RDF?• RDF is good at

• Modeling information

• Especially when schema is unknown or changing

• When there is multiple schemas

• RDF is not for

• Representing documents (XHTML, CSS)

• Internal data management when schema is known and fixed (Relational Databases)

26

Page 80: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Discrepancies between the representational and semantic levels in the schema

Schema Matching

Page 81: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

28

• name• boxer id• weight• birthdate• total fights• residence

• first name• last name• age• address

• street• city

• tax id

Boxer Taxpayer

• ...

Company• ...

Trainer

• ...

Tax Office

Page 82: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

28

• name• boxer id• weight• birthdate• total fights• residence

• first name• last name• age• address

• street• city

• tax id

• Input: Schemas to match

• Possibly data instantiating those schemas

Boxer Taxpayer

• ...

Company• ...

Trainer

• ...

Tax Office

Page 83: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

28

• name• boxer id• weight• birthdate• total fights• residence

• first name• last name• age• address

• street• city

• tax id

• Input: Schemas to match

• Possibly data instantiating those schemas

• Output: Mappings between schema elements

• Possibly with confidence values and alternatives

• Possibly with value conversion rules (matchings)

Boxer Taxpayer

• ...

Company• ...

Trainer

• ...

Tax Office

Page 84: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Mappings or Matching?

29

Page 85: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Mappings or Matching?

• Schema mapping identifies correspondences between schema elements

29

Page 86: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Mappings or Matching?

• Schema mapping identifies correspondences between schema elements

• Schema matching actually transforms an instance of one schema into an instance of another schema

29

Page 87: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

General architectures

How to Use Mappings?

Page 88: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Mediated Schemas

31

Schema1

Schema2

Schema3

Page 89: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Mediated Schemas

31

Mediated Schema

Schema1

Schema2

Schema3

Page 90: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Mediated Schemas

31

Mediated Schema

Query

Schema1

Schema2

Schema3

Page 91: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Mediated Schemas

31

Mediated Schema

Query

Schema1

Schema2

Schema3

Page 92: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Mediated Schemas

31

Mediated Schema

Query

Schema1

Schema2

Schema3

Schema1

Schema2

Schema3

Page 93: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Mediated Schemas

31

Mediated Schema

Query

Schema1

Schema2

Schema3

Mediated Schema

Schema1

Schema2

Schema3

Page 94: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Mediated Schemas

31

Mediated Schema

Query

Schema1

Schema2

Schema3

Mediated Schema

Schema1

Schema2

Schema3

Page 95: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Mediated Schemas

31

Mediated Schema

Query

Schema1

Schema2

Schema3

Query

Mediated Schema

Schema1

Schema2

Schema3

Page 96: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Mediated Schemas

31

Mediated Schema

Query

Schema1

Schema2

Schema3

Query

Mediated Schema

Schema1

Schema2

Schema3

Query

Schema x

Page 97: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Peer Data Management

32

Local MappingLocal Source

Peer Schema

Peer Mapping

Local Schema

Page 98: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Why not by hand?

33source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif

Page 99: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Why not by hand?• Size and complexity of source schemas

• Number of schemas sources

• Leveraging data instance values

• Schemas not known in advance

33source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif

Page 100: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Why not by hand?• Size and complexity of source schemas

• Number of schemas sources

• Leveraging data instance values

• Schemas not known in advance

33source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif

Page 101: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Why not by hand?• Size and complexity of source schemas

• Number of schemas sources

• Leveraging data instance values

• Schemas not known in advance

33source: http://www.geneontology.org/images/diag-godb-er.jpgsource: http://www.atutor.ca/development/documentation/database.gif

Page 102: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Schema Matching Features

34

Page 103: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Schema Matching Features

• Schema-only vs schema & instances

34

Page 104: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Schema Matching Features

• Schema-only vs schema & instances

• Representational

• Lexical vs structural

34

Page 105: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Schema Matching Features

• Schema-only vs schema & instances

• Representational

• Lexical vs structural

• Internal vs external

34

Page 106: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Schema Matching Features

• Schema-only vs schema & instances

• Representational

• Lexical vs structural

• Internal vs external

34

More in:• Rahm E, Bernstein PA. A survey of approaches to automatic schema matching. The

VLDB Journal. 2001;10(4):334-350.• 1. Shvaiko P, Euzenat J. A Survey of Schema-Based Matching Approaches. Journal on

Data Semantics IV. 2005;3730:146-171.

Page 107: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Schema Matching Techniques

35

Page 108: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• String-based

Schema Matching Techniques

35

Page 109: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• String-based

• Language-based

Schema Matching Techniques

35

Page 110: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• String-based

• Language-based

• Linguistic resources

Schema Matching Techniques

35

Page 111: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• String-based

• Language-based

• Linguistic resources

• Constraint-based

Schema Matching Techniques

35

Page 112: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• String-based

• Language-based

• Linguistic resources

• Constraint-based

• Alignment reuse

Schema Matching Techniques

35

Page 113: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• String-based

• Language-based

• Linguistic resources

• Constraint-based

• Alignment reuse

• Upper-level formal ontologies

Schema Matching Techniques

35

Page 114: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• String-based

• Language-based

• Linguistic resources

• Constraint-based

• Alignment reuse

• Upper-level formal ontologies

Schema Matching Techniques

35

• Graph-based

Page 115: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• String-based

• Language-based

• Linguistic resources

• Constraint-based

• Alignment reuse

• Upper-level formal ontologies

Schema Matching Techniques

35

• Graph-based

• Taxonomy-based

Page 116: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• String-based

• Language-based

• Linguistic resources

• Constraint-based

• Alignment reuse

• Upper-level formal ontologies

Schema Matching Techniques

35

• Graph-based

• Taxonomy-based

• Repository of structures

Page 117: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• String-based

• Language-based

• Linguistic resources

• Constraint-based

• Alignment reuse

• Upper-level formal ontologies

Schema Matching Techniques

35

• Graph-based

• Taxonomy-based

• Repository of structures

• Model-based

Page 118: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• String-based

• Language-based

• Linguistic resources

• Constraint-based

• Alignment reuse

• Upper-level formal ontologies

Schema Matching Techniques

35

• Graph-based

• Taxonomy-based

• Repository of structures

• Model-based

Page 119: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• String-based

• Language-based

• Linguistic resources

• Constraint-based

• Alignment reuse

• Upper-level formal ontologies

Schema Matching Techniques

35

• Graph-based

• Taxonomy-based

• Repository of structures

• Model-based

Page 120: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• String-based

• Language-based

• Linguistic resources

• Constraint-based

• Alignment reuse

• Upper-level formal ontologies

Schema Matching Techniques

35

• Graph-based

• Taxonomy-based

• Repository of structures

• Model-based

Page 121: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Leveraging lexical features

A String-Based Technique

Page 122: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Edit Distance

37

Page 123: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Edit Distance• String distance: measures distance between

two strings

37

Page 124: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Edit Distance• String distance: measures distance between

two strings

• Edit distance: number of operations needed to transform one string into the other

37

Page 125: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Edit Distance• String distance: measures distance between

two strings

• Edit distance: number of operations needed to transform one string into the other

• Common basic operations:

• Insert, delete or substitute one character

• Possibly with different weights depending on the operation and characters involved

37

Page 126: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Edit Distance• String distance: measures distance between

two strings

• Edit distance: number of operations needed to transform one string into the other

• Common basic operations:

• Insert, delete or substitute one character

• Possibly with different weights depending on the operation and characters involved

• Java libraries:

• SecondString, SimMetrics37

Page 127: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

Sundays

Page 128: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

Sundays

Page 129: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

Sundays

Page 130: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

Sundays

Page 131: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

Sundays

Page 132: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

Sundays

Page 133: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

Sundays

Page 134: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

Sundays

Page 135: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

Sundays

Page 136: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

Sundays

Page 137: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

Sundays

Page 138: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

Sundays

Page 139: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

insert to Sundays

dele

te fr

om S

unda

ys substitute in Sundays

Sundays

Page 140: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

insert to Sundays

dele

te fr

om S

unda

ys substitute in Sundays

SundaysSundays

Page 141: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

insert to Sundays

dele

te fr

om S

unda

ys substitute in Sundays

SundaysSatundaysS

undays

Page 142: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

insert to Sundays

dele

te fr

om S

unda

ys substitute in Sundays

SundaysSatundaysSatundays

Sundays

Page 143: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

insert to Sundays

dele

te fr

om S

unda

ys substitute in Sundays

SundaysSatundaysSatundaysSaturdays

Sundays

Page 144: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

insert to Sundays

dele

te fr

om S

unda

ys substitute in Sundays

SundaysSatundaysSatundaysSaturdaysSaturdays

Sundays

Page 145: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

insert to Sundays

dele

te fr

om S

unda

ys substitute in Sundays

SundaysSatundaysSatundaysSaturdaysSaturdaysSaturdays

Sundays

Page 146: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

insert to Sundays

dele

te fr

om S

unda

ys substitute in Sundays

SundaysSatundaysSatundaysSaturdaysSaturdaysSaturdaysSaturdays

Sundays

Page 147: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Levenshtein Distance• Edit operations: insert, delete, substitute

• Each has a weight of 1

38

S a t u r d a y0 1 2 3 4 5 6 7 81 0 1 2 3 4 5 6 72 1 1 2 2 3 4 5 63 2 2 2 3 3 4 5 64 3 3 3 3 4 3 4 55 4 4 4 4 4 4 3 46 5 5 5 5 5 5 4 37 6 6 6 6 6 6 5 4

insert to Sundays

dele

te fr

om S

unda

ys substitute in Sundays

SundaysSatundaysSatundaysSaturdaysSaturdaysSaturdaysSaturdaysSaturdays

Sundays

Page 148: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

WordNet

A Linguistic Resource

Page 149: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

WordNet

40

Page 150: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

WordNet

• Fundamental components: Synonyn Sets (Synsets)

40

Page 153: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Hypernyms / Hyponyms• Hypernyms: superordinates, isA relationships. A

synset may have more than one hypernym.

• Hyponyms: subordinates

41

{car, auto, automobile, machine, motorcar}

{motor vehicle, automotive vehicle}

{cab, hack, taxi, taxicab} {ambulance}

hypernym

hyponyms

Page 155: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Other relationships in WN

43

Page 156: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Other relationships in WN

• Antonym

43

Page 157: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Other relationships in WN

• Antonym

• Entailment (for verbs)

• A verb X entails Y if X cannot be done unless Y is, or has been, done.

43

Page 158: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Other relationships in WN

• Antonym

• Entailment (for verbs)

• A verb X entails Y if X cannot be done unless Y is, or has been, done.

• Attribute (for adjectives)

• A noun for which adjectives express values. The noun weight is an attribute, for which the adjectives light and heavy express values.

43

Page 159: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Leveraging structure

A Graph-Matching Technique

Page 160: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Similarity Flooding

45

Page 161: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Similarity Flooding• Uses structure of the data to help matching

schemas

45

Page 162: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Similarity Flooding• Uses structure of the data to help matching

schemas

• Similarity Flooding in Melnik et al. (2002)

• First maps schema elements with lexical similarity

• Then improves matching assuming that:

• If two elements are similar, then the elements adjacent to them are more probable to be similar

45

Page 163: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Similarity Flooding• Uses structure of the data to help matching

schemas

• Similarity Flooding in Melnik et al. (2002)

• First maps schema elements with lexical similarity

• Then improves matching assuming that:

• If two elements are similar, then the elements adjacent to them are more probable to be similar

45

Selected paper 1:Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. IEEE Comput. Soc; 2002:117-128.

Page 164: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Detecting duplicate entries

Deduplication

Page 165: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Why is there Duplicates?

47

• first name: Mohamed• last name: Ali• age: 68• address:

street: Nicestreet 17 city: Wondercity

• tax id: #7234561

• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY

Sport Authorities Taxes Authorities

Administration-wide database

Page 166: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

48

• Input: 2 entities with matched attributes

• Output: M for matched or U for unmatched.

• Possibly R for reject between M and U for cases where supervised decision is necessary.

• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY

• first name: Mohamed• last name: Ali• age: 68• address:

street: Nicestreet 17 city: Wondercity

• tax id: #7234561

• name: Muhammad Ali• address:

• city: Cairo• country: Egypt

• tax id: #8244361

M

UR

Page 167: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Deduplication Features

Page 168: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

50

Field Distance Metrics

Page 169: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• Value metrics

• Character-based

• Token-based

• Phonetic

• Numeric50

Field Distance Metrics

Page 170: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• Value metrics

• Character-based

• Token-based

• Phonetic

• Numeric50

Field Distance Metrics

String-based metrics seen for schema matching

Page 171: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• Value metrics

• Character-based

• Token-based

• Phonetic

• Numeric50

Field Distance Metrics

String-based metrics seen for schema matching

Similar to Information Retrieval techniques (Topic 2 next week)

Page 172: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• Value metrics

• Character-based

• Token-based

• Phonetic

• Numeric50

Field Distance Metrics

String-based metrics seen for schema matching

Similar to Information Retrieval techniques (Topic 2 next week)

Not much techniques other than considering them as strings or direct difference

Page 173: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• Value metrics

• Character-based

• Token-based

• Phonetic

• Numeric50

Field Distance Metrics

String-based metrics seen for schema matching

Similar to Information Retrieval techniques (Topic 2 next week)

Not much techniques other than considering them as strings or direct difference

Page 174: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Phonex1. First letter as prefix

2. Encode non-prefix consonants

3. Remove duplicate adjacent codes not separated by a vowel

4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary

51

consonant code

b, f, p, v 1

c, g, j, k, q, s, x, z 2

d, t 3

l 4

m, n 5

r 6

h, w dropped

Page 175: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Phonex1. First letter as prefix

2. Encode non-prefix consonants

3. Remove duplicate adjacent codes not separated by a vowel

4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary

51

consonant code

b, f, p, v 1

c, g, j, k, q, s, x, z 2

d, t 3

l 4

m, n 5

r 6

h, w dropped

Ashcraftson

Page 176: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Phonex1. First letter as prefix

2. Encode non-prefix consonants

3. Remove duplicate adjacent codes not separated by a vowel

4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary

51

consonant code

b, f, p, v 1

c, g, j, k, q, s, x, z 2

d, t 3

l 4

m, n 5

r 6

h, w dropped

Ashcraftson1.Ashcraftson

Page 177: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Phonex1. First letter as prefix

2. Encode non-prefix consonants

3. Remove duplicate adjacent codes not separated by a vowel

4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary

51

consonant code

b, f, p, v 1

c, g, j, k, q, s, x, z 2

d, t 3

l 4

m, n 5

r 6

h, w dropped

Ashcraftson1.Ashcraftson2.A2 26a132o5

Page 178: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Phonex1. First letter as prefix

2. Encode non-prefix consonants

3. Remove duplicate adjacent codes not separated by a vowel

4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary

51

consonant code

b, f, p, v 1

c, g, j, k, q, s, x, z 2

d, t 3

l 4

m, n 5

r 6

h, w dropped

Ashcraftson1.Ashcraftson2.A2 26a132o53.A26a132o5

Page 179: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Phonex1. First letter as prefix

2. Encode non-prefix consonants

3. Remove duplicate adjacent codes not separated by a vowel

4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary

51

consonant code

b, f, p, v 1

c, g, j, k, q, s, x, z 2

d, t 3

l 4

m, n 5

r 6

h, w dropped

Ashcraftson1.Ashcraftson2.A2 26a132o53.A26a132o54.A261

Page 180: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Phonex1. First letter as prefix

2. Encode non-prefix consonants

3. Remove duplicate adjacent codes not separated by a vowel

4. Drop vowels and truncate to prefix and max 3 codes, resp. pad with zero if necessary

51

consonant code

b, f, p, v 1

c, g, j, k, q, s, x, z 2

d, t 3

l 4

m, n 5

r 6

h, w dropped

Rupert1.Rupert2.Ro1e633.Ro1e634.R163

Robert1.Robert2.Ro1e633.Ro1e634.R163

Ashcraftson1.Ashcraftson2.A2 26a132o53.A26a132o54.A261

Page 181: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Other Phonetic Codes

52

Page 182: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Other Phonetic Codes

• NYSIIS

• Developed and still in use at the New York State Division of Criminal Justice Services

• Encodes vowels (mostly to A)

• Codes are letters instead of digits

• Longer codes (6 instead of 4)

52

Page 183: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Other Phonetic Codes

53

Page 184: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Other Phonetic Codes

• Metaphone

• Codes are letters instead of digits

• No maximum code length

• More elaborated coding rules

53

Page 185: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Other Phonetic Codes

• Metaphone

• Codes are letters instead of digits

• No maximum code length

• More elaborated coding rules

• Double Metaphone

• Returns a secondary code to help disambiguate

53

Page 186: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Detecting Duplicates

Page 187: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Bayes Decision Rule

55

Page 188: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• M: match, U: unmatch

Bayes Decision Rule

55

M if p(M |�x) ≥ p(U |�x)U otherwise

Page 189: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• M: match, U: unmatch

• Using Bayes rule

Bayes Decision Rule

55

p(M |�x) ≥ p(U |�x)

⇔ p(M ∧ �x)

p(�x)≥ p(U ∧ �x)

p(�x)

⇔ p(M)p(�x|M) ≥ p(U)p(�x|U)

⇔ l(�x) =p(�x|M)

p(�x|U)≥ p(U)

p(M)

M if p(M |�x) ≥ p(U |�x)U otherwise

Page 190: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• M: match, U: unmatch

• Using Bayes rule

• Decision rule: likelihood ratio

Bayes Decision Rule

55

p(M |�x) ≥ p(U |�x)

⇔ p(M ∧ �x)

p(�x)≥ p(U ∧ �x)

p(�x)

⇔ p(M)p(�x|M) ≥ p(U)p(�x|U)

⇔ l(�x) =p(�x|M)

p(�x|U)≥ p(U)

p(M)

M if p(M |�x) ≥ p(U |�x)U otherwise l(�x) =

p(�x|M)

p(�x|U)≥ p(U)

p(M)

Page 191: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• M: match, U: unmatch

• Using Bayes rule

• Decision rule: likelihood ratio

• Using independence assumption

Bayes Decision Rule

55

p(M |�x) ≥ p(U |�x)

⇔ p(M ∧ �x)

p(�x)≥ p(U ∧ �x)

p(�x)

⇔ p(M)p(�x|M) ≥ p(U)p(�x|U)

⇔ l(�x) =p(�x|M)

p(�x|U)≥ p(U)

p(M)

M if p(M |�x) ≥ p(U |�x)U otherwise l(�x) =

p(�x|M)

p(�x|U)≥ p(U)

p(M)

p(�x|M) =�

i

p(xi|M)

p(�x|U) =�

i

p(xi|U)

Page 192: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Bayes Decision Rule

56

p(xi|M) p(xi|U)

Page 193: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Bayes Decision Rule

56

• Priors ( and ) can be learned on a training set

p(xi|M) p(xi|U)

Page 194: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Bayes Decision Rule

56

• Priors ( and ) can be learned on a training set

• Other methods based on Expectation-Maximisation (EM) algorithm can estimate priors without training set

p(xi|M) p(xi|U)

Page 195: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Clustering-Based Decision

57

Selected paper 2:Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates. ICDE’05. 2005:865-876.

Page 196: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Clustering-Based Decision• Using clustering techniques with appropriate parameters

• X-Means

• Variant of K-Means without a fixed K

• Chauduri et al. observed that duplicates tend

1. to have small distances from each other (compact set property), and

2. 2) to have only a small number of other neighbors within a small distance (sparse neighborhood property).

57

Selected paper 2:Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates. ICDE’05. 2005:865-876.

Page 197: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Dealing with O(n2)

58

0E+00

2.5E+11

5E+11

7.5E+11

1E+12

0 200'000 400'000 600'000 800'000 1'000'000

Number of entities in repository

Num

ber

of c

ompa

riso

ns

Page 198: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Canopies

59

●●

●● ●● ●

●●

●●

Page 199: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Canopies

59

●●

●● ●● ●

●●

●●

• Create canopies using a cheap similarity metric

• Overlapping clusters

Page 200: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Canopies

59

●●

●● ●● ●

●●

●●

• Create canopies using a cheap similarity metric

• Overlapping clusters

• Compare entities pairwise using a more expensive similarity metric

Page 201: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Pay-as-you-go Information Integration

Dataspaces

Page 202: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Dataspaces

61

Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.

Page 203: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Dataspaces

• Note a data integration approach per se

61

Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.

Page 204: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Dataspaces

• Note a data integration approach per se

• Data co-existence appraoch

61

Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.

Page 205: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Dataspaces

• Note a data integration approach per se

• Data co-existence appraoch

• Pay-as-you-go data integration

• Leveraging human contributions for data integration in a non-invasive manner

61

Selected paper 3:Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York, NY, USA; 2006:1-9.

Page 206: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Relationship between Schema Matching and Deduplication

62

• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY

• first name: Mohamed• last name: Ali• age: 68• address:

street: Nicestreet 17 city: Wondercity

• tax id: #7234561

Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.

Page 207: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• Are they duplicates?

Relationship between Schema Matching and Deduplication

62

• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY

• first name: Mohamed• last name: Ali• age: 68• address:

street: Nicestreet 17 city: Wondercity

• tax id: #7234561

Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.

Page 208: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• Are they duplicates?

• To compare field values we need schema matches

Relationship between Schema Matching and Deduplication

62

• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY

• first name: Mohamed• last name: Ali• age: 68• address:

street: Nicestreet 17 city: Wondercity

• tax id: #7234561

Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.

Page 209: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• Are they duplicates?

• To compare field values we need schema matches

• To find schema matches we need duplicates

Relationship between Schema Matching and Deduplication

62

• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY

• first name: Mohamed• last name: Ali• age: 68• address:

street: Nicestreet 17 city: Wondercity

• tax id: #7234561

Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.

Page 210: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

• Are they duplicates?

• To compare field values we need schema matches

• To find schema matches we need duplicates

• etc...

Relationship between Schema Matching and Deduplication

62

• name: Muhammad Ali• boxer id: 1234567• weight: 200 lb• total fights: 61• residence: 17, Nicestreet Louisville, KY

• first name: Mohamed• last name: Ali• age: 68• address:

street: Nicestreet 17 city: Wondercity

• tax id: #7234561

Selected paper 4:Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.

Page 211: Web Science - Leibniz Universität Hannover · 2015-11-23 · Topics 2 • 1. Information Integration • 2. Web Information Retrieval • 3. Entity Search • 4. Web Usage • 5.

Selected Topic Papers1. Schema Matching

• Melnik S, Garcia-Molina H, Rahm E. Similarity flooding: a versatile graph matching algorithm and its application to schema matching. IEEE Comput. Soc; 2002:117-128.

2. Deduplication• Chaudhuri S, Ganti V, Motwani R. Robust Identification of Fuzzy Duplicates. ICDE’05.

2005:865-876.

3. Dataspaces• Halevy AY, Franklin M, Maier D. Principles of dataspace systems. In: PODS ’06. New York,

NY, USA; 2006:1-9.

4. Interdependence between schema matching and deduplication

• Zhou X, Gaugaz J, Balke W-T, Nejdl W. Query relaxation using malleable schemas. SIGMOD 2007. Beijing, China; 2007:545-556.

63


Recommended