+ All Categories
Home > Documents > Search and Access Strategies for Web Archives Sangchul Song and Joseph JaJa 3. Existing Access...

Search and Access Strategies for Web Archives Sangchul Song and Joseph JaJa 3. Existing Access...

Date post: 19-Dec-2015
Category:
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
2
Search and Access Strategies for Web Archives Sangchul Song and Joseph JaJa 3. Existing Access Methods 1. Background o The Web has become the main publication medium world-wide, covering almost every facet of human activity. However, the Web is an ephemeral medium. o Web archives offer unique opportunities for knowledge discovery due to the richness of their contents extending over significant periods of time. o We need effective and scalable access strategies for web archives covering significant temporal spans. 4. Problems With Existing Methods o Inefficient handling of time-constrained search. o Ineffective delivery of search results Inadequate relevancy scoring. Scoring is performed over the entire history. Ungrouped search results. URL is not unique in web archives – time dependent. Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL. Users can want to focus more on a specific time-period within the results. Lack of a group-scoring methodology. What group to show on the top is not clear without a group-scoring methodology. 2. Our Goals: Development of oAn effective technology to explore and search for information in a web archive. The technology must enable effective information exploration and discovery. oA framework that produces ranking and evaluation of archived web contents within a temporal context as required by the user. oMethods to determine the relevance of a group of web objects within the temporal context. A group can consist of a series of temporally-contiguous versions of a single URL, or of web objects archived within some time span. oA framework that allows effective search using keywords and time spans for large scale web archives. 5. Overview of our Approach: o Efficient time-constrained search by maintaining separate inverted lists for a given time window See Block 6. o Scoring within a temporal context by computing term weights as a function of time See Block 7. o Grouping similar search results, while scoring search results as a group See Block 9 and 10. Search all, and then Filter Very inefficient!! September 11 The September 11 attacks (often referred to as nine-eleven, written 9/11) were a series of coordinated suicide attacks by al-Qaeda upon the United States on en.wikipedia.org/wiki/September_11,_2001_attacks September 11 Digital Archive Uses electronic media to collect, preserve, and present the history of the September 11, 2001 attacks in New York, Virginia, and Pennsylvania and the public 911digitalarchive.org/ 9/11 Tributes, September 11 Tributes and Memorials to the Victims Tributes of 9/11 - September 11th 9/11 memorials. For the Victims their Families and the many Heroes of September 11th. 2001. 9/11 World Trade Center, ... www.jontzen.com/tribute.htm - 132k National Commission on Terrorist Attacks Upon the United States Commission chartered to prepare a full and complete account of the circumstances surrounding the September 11, 2001 terrorist attacks, ... www.9-11commission.gov/ - 8k … and 4 million other pages pertaining to the September 11 th Attack … Ethiopian calendar - Wikipedia, the free encyclopedia Thus the first day of the Ethiopian year, 1 Mäskäräm, for years between 1901 and 2099 (inclusive), is usually September 11 (Gregorian), ... en.wikipedia.org/wiki/Ethiopian_calendar - 43k APOD: September 11, 1997 - Mars Global Surveyor: Aerobraking September 11, 1997 See Explanation. Clicking on the picture will download the highest resolution version available. Mars Global Surveyor: Aerobraking apod.nasa.gov/apod/ap970911.html - 5k … and only 560 other pages that are irrelevant to the September 11 th Attack “Find web pages that contain ‘September 11 th before 2001” Chronological Listing Direct ory Hybrid Text- Search
Transcript

Search and Access Strategies for Web ArchivesSangchul Song and Joseph JaJa

3. Existing Access Methods

1. Backgroundo The Web has become the main publication medium world-

wide, covering almost every facet of human activity. However, the Web is an ephemeral medium.

o Web archives offer unique opportunities for knowledge discovery due to the richness of their contents extending over significant periods of time.

o We need effective and scalable access strategies for web archives covering significant temporal spans.

4. Problems With Existing Methodso Inefficient handling of time-constrained search.

o Ineffective delivery of search results Inadequate relevancy scoring.• Scoring is performed over the entire history.

Ungrouped search results.• URL is not unique in web archives – time dependent.• Considering different versions of the same URL tend to have

similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL.

• Users can want to focus more on a specific time-period within the results.

Lack of a group-scoring methodology.• What group to show on the top is not clear without a group-

scoring methodology.

2. Our Goals: Development ofoAn effective technology to explore and search for information in a web archive. The technology must enable effective information exploration and discovery.oA framework that produces ranking and evaluation of archived web contents within a temporal context as required by the user.oMethods to determine the relevance of a group of web objects within the temporal context. A group can consist of a series of temporally-contiguous versions of a single URL, or of web objects archived within some time span.oA framework that allows effective search using keywords and time spans for large scale web archives.

5. Overview of our Approach:o Efficient time-constrained search by maintaining separate inverted lists for a given time window See Block 6.o Scoring within a temporal context by computing term weights as a function of time See Block 7.o Grouping similar search results, while scoring search results as a group See Block 9 and 10.

Search all, and then Filter Very inefficient!!

September 11 attacks - Wikipedia, the free encyclopediaThe September 11 attacks (often referred to as nine-eleven, written 9/11) were a series of coordinated suicide attacks by al-Qaeda upon the United States on … en.wikipedia.org/wiki/September_11,_2001_attacks

September 11 Digital ArchiveUses electronic media to collect, preserve, and present the history of the September 11, 2001 attacks in New York, Virginia, and Pennsylvania and the public … 911digitalarchive.org/

9/11 Tributes, September 11 Tributes and Memorials to the Victims …Tributes of 9/11 - September 11th 9/11 memorials. For the Victims their Families and the many Heroes of September 11th. 2001. 9/11 World Trade Center, ... www.jontzen.com/tribute.htm - 132k

National Commission on Terrorist Attacks Upon the United StatesCommission chartered to prepare a full and complete account of the circumstances surrounding the September 11, 2001 terrorist attacks, ...www.9-11commission.gov/ - 8k

… and 4 million other pages pertaining to the September 11th Attack …

September 11 attacks - Wikipedia, the free encyclopediaThe September 11 attacks (often referred to as nine-eleven, written 9/11) were a series of coordinated suicide attacks by al-Qaeda upon the United States on … en.wikipedia.org/wiki/September_11,_2001_attacks

September 11 Digital ArchiveUses electronic media to collect, preserve, and present the history of the September 11, 2001 attacks in New York, Virginia, and Pennsylvania and the public … 911digitalarchive.org/

9/11 Tributes, September 11 Tributes and Memorials to the Victims …Tributes of 9/11 - September 11th 9/11 memorials. For the Victims their Families and the many Heroes of September 11th. 2001. 9/11 World Trade Center, ... www.jontzen.com/tribute.htm - 132k

National Commission on Terrorist Attacks Upon the United StatesCommission chartered to prepare a full and complete account of the circumstances surrounding the September 11, 2001 terrorist attacks, ...www.9-11commission.gov/ - 8k

… and 4 million other pages pertaining to the September 11th Attack …

Ethiopian calendar - Wikipedia, the free encyclopediaThus the first day of the Ethiopian year, 1 Mäskäräm, for years between 1901 and 2099 (inclusive), is usually September 11 (Gregorian), ...en.wikipedia.org/wiki/Ethiopian_calendar - 43k

APOD: September 11, 1997 - Mars Global Surveyor: AerobrakingSeptember 11, 1997 See Explanation. Clicking on the picture will download the highest resolution version available. Mars Global Surveyor: Aerobraking … apod.nasa.gov/apod/ap970911.html - 5k

… and only 560 other pages that are irrelevant to the September 11th Attack

Ethiopian calendar - Wikipedia, the free encyclopediaThus the first day of the Ethiopian year, 1 Mäskäräm, for years between 1901 and 2099 (inclusive), is usually September 11 (Gregorian), ...en.wikipedia.org/wiki/Ethiopian_calendar - 43k

APOD: September 11, 1997 - Mars Global Surveyor: AerobrakingSeptember 11, 1997 See Explanation. Clicking on the picture will download the highest resolution version available. Mars Global Surveyor: Aerobraking … apod.nasa.gov/apod/ap970911.html - 5k

… and only 560 other pages that are irrelevant to the September 11th Attack

“Find web pages that contain ‘September 11th’ before 2001”“Find web pages that contain ‘September 11th’ before 2001”

Chronological ListingChronological ListingDirectoryDirectory

HybridHybrid

Text-SearchText-Search

6. Basic Techniques

• Determine a snapshot of web contents covering a time windowSCk = { All web objects valid within a time interval [tk~tk+1) }

• Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree.

8. Search User Interface

9. Grouping Search Results

10. Group-wide ScoringGrouping is good, but now which group to place first on the result page?

• Simple method : use average or highest score among members • More effective method: compute a relevancy score as a group. • Instead of tf(t), we use df(t), document frequency of t in group.• Instead of idf(t), we use igf(t), inverse group frequency .

•We extend some of the best known IR technologies for group ranking.

7. Scoring within a Temporal Context

• Relevancy scoring is based onthe time that an web page wasarchived.• The same contents will have different relevancy scores whenthe temporal contexts are different. (e.g. one was archived several months before the other)

First page polluted by the same URLFirst page polluted by the same URL

Grouped by URL(collapsed)Grouped by URL(collapsed)

Grouped by TimeGrouped by Time

Grouped by URL (expanded)Grouped by URL (expanded)

Same contents, different archive dates different scores!!

SC1SC1 SC2SC2 SCKSCK

B-Tree

B-Tree

PLSC1-w

1PL

SC1-w1

PLSC1-w

2PL

SC1-w2

w1

w1

w2

w2

PLSC2-w

1PL

SC2-w1

PLSC2-w

2PL

SC2-w2

w1

w1

w2

w2

PLSC1-w

NPL

SC1-wN

wN

wN

B-TreeB-Tree

SC1

SC1

SC2

SC2

SCK

SCK

SC1

SC1

SC2

SC2

SCK

SCK

w1

w1

w2

w2

wN

wN

SC1

SC1

SC2

SC2

SCK

SCK

PLSC1-w

1PL

SC1-w1

PLSC1-w

2PL

SC1-w2

PLSC2-w

1PL

SC2-w1

PLSCK-w

1PL

SCK-w1

Multi-versionTree

Multi-versionTree

PLSC1-w

1PL

SC1-w1

PLSC1-w

2PL

SC1-w2

w1

w1

w2

w2

wN

wN

w3

w3

PLSC2-w

1PL

SC2-w1

PLSC4-w

1PL

SC4-w1

PLSC1-w

2PL

SC1-w2

PLSC1-w

3PL

SC1-w3

PLSC1-w

3PL

SC1-w3

PLSC1-w

NPL

SC1-wN

PLSC1-w

NPL

SC1-wN

Search and Access Strategies for Web ArchivesSangchul Song and Joseph JaJa


Recommended