+ All Categories
Home > Documents > PRODUCT INFORMATION EXTRACTION - IIIT...

PRODUCT INFORMATION EXTRACTION - IIIT...

Date post: 17-Mar-2018
Category:
Upload: lamthu
View: 213 times
Download: 0 times
Share this document with a friend
87
PRODUCT INFORMATION EXTRACTION By Santosh Raju Vysyaraju 200707023 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Master of Science (by Research) in Computer Science & Engineering Search and Information Extraction Lab, Language Technologies Research Center International Institute of Information Technology Hyderabad, India June 2010
Transcript

PRODUCT INFORMATION EXTRACTION

By

Santosh Raju Vysyaraju

200707023

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THEREQUIREMENTS FOR THE DEGREE OF

Master of Science (by Research)in

Computer Science & Engineering

Search and Information Extraction Lab, Language TechnologiesResearch Center

International Institute of Information Technology

Hyderabad, India

June 2010

Copyright c© 2010 Santosh Raju Vysyaraju

All Rights Reserved

Dedicated to all those people, living and dead, who are directly or indirectlyresponsible to the wonderful life that I am living now.

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “ Product Information

Extraction ” by Santosh Raju Vysyaraju (200707023) submitted in partial fulfill-

ment for the award of the degree of Master of Science (by Research) in Computer

Science & Engineering, has been carried out under my supervision and it is not

submitted elsewhere for a degree.

Date Advisor :

Dr. Vasudeva VarmaAssociate Professor

IIIT, Hyderabad

Acknowledgements

I would like to first thank my advisor Dr.Vasudeva Varma for believing in me

and giving me the freedom to work on problems of my interest. I thank Dr. Prasad

Pingali for his valuable guidance, help at various junctures throughout the duration

of my thesis. I thank Dr. Kamal Karlapalem for his motivation and generating re-

search interest during my bachelors. I thank Babji for all the help with the systems.

I thank Praneeth for all the innumerable dicussions (technical and non-technical),

helps, motivational talks, suggestions and everything else that has happend ever

since that fateful first meeting near NBH. Without him, life would have been tough

and it would have been impossible to finish my thesis with ease. I thank Rahul for

all the help in writing the papers, enlightening discussions and all the fun. I thank

Sowmya for all those innumerable walks, talks and for pushing me to work at deci-

sive moments. I thank Swathi for all the fun and help. I thank Saras and Shivudu

for the lite moments in lunch times and dinner outings. Last but not the least, I spe-

cially thank all my btech batchmates and Krishna Kiran for all those dinner outings

and fun thoughout my master’s which were very important in relieving the stress.

Abstract

Online shopping has become a very popular web application in the recent times and

received lot of attention from both consumers and retail merchants. With this grow-

ing popularity, lots of information is generated on many products to assist users

across the globe. This has resulted in information explosion with lots of informa-

tion for the consumers to digest. This has created need for techniques to identify

useful and relevant content for the consumers from the huge amounts available. In

this thesis, we have explored application of information extraction techniques for

extracting relevant product information. We have focused on the following sub-

problems in the context of products: Attribute Extraction, Attribute Ranking.

We have explored algorithms to automatically extract product attributes from

text descriptions. We have come up with unsupervised methods which do not re-

quire any domain specific information for extraction. Our algorithms are based on

the hypothesis that attributes should repeat across descriptions. We have explored

two clustering methods in this work for extraction : Noun Phrase(NP) Clustering

and Word Clustering. In the first method, clusters of noun phrases are computed so

that all the phrases describing an attribute are grouped together and a representative

attribute is extracted from each cluster. In the second method, we cluster words ap-

pearing in the descriptions such that all the words related to an attribute are grouped

together in one cluster. We construct a graph from the word occurrences in descrip-

tions and compute word clusters using a graph clustering algorithm. Attributes are

extracted from word clusters using word associations inside a cluster. We have

throughly evaluated our methods by conducting various experiments. Our exper-

iments show that the methods are robust and extract product attributes accurately.

We have also compared the two algorithms and presented an analysis on the trade

offs in the two approaches.

We have defined attribute ranking in the context of product comparison and

came up new features for ranking the attributes. We have also designed a new kind

of summaries called Comparative Summaries that delivers majority of the compara-

ble content found in an input set of documents. We have presented how comparative

summaries could be generated for products using our attribute extraction and rank-

ing algorithms. We have carried out experiments to evaluate the performance of our

ranking algorithm and showed that it is effective in identifying useful attributes.

Contents

Table of Contents viii

List of Tables x

List of Figures xi

1 Introduction 11.1 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 Types of Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.1.3 Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.1.4 Extraction Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Product Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . 51.2.1 Product Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.2 Research in Product Information Extraction . . . . . . . . . . . . . 7

1.3 Product Attribute Extraction . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.6 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 Related Work 132.1 Key Phrase Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 Customer Review Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Attribute Information Extraction . . . . . . . . . . . . . . . . . . . . . . . 16

2.3.1 Attribute Extraction from Web Pages . . . . . . . . . . . . . . . . 172.3.2 Attribute Normalization . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Comparative Information Extraction . . . . . . . . . . . . . . . . . . . . . 182.4.1 Extraction of sentence level comparative information . . . . . . . . 182.4.2 Extraction of topic level comparisons . . . . . . . . . . . . . . . . 19

3 Attribute Extraction 213.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 General Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

viii

CONTENTS

3.3 Noun Phrase Clustering Approach . . . . . . . . . . . . . . . . . . . . . . 253.3.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.3.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.3 Attribute Identification . . . . . . . . . . . . . . . . . . . . . . . . 28

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Attribute Extraction using Word Clusters 314.1 Small World Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.2 Graph Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Chinese Whispers Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 334.4 Attribute Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Comparative Summary Generation 385.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.2 Comparative Summary Generation Framework . . . . . . . . . . . . . . . 405.3 Attribute Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.3.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.3.2 Ranking Function . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.4 Creating the Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Experiments and Results 476.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476.2 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.2.1 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.2.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.3 Extraction Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506.3.1 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 506.3.2 Dataset Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.3.3 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.3.4 Noun Phrase Clustering vs Word Clustering . . . . . . . . . . . . . 53

6.4 Ranking Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.4.1 User Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.4.2 Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.4.3 Comparison of Ranking Features . . . . . . . . . . . . . . . . . . . 62

6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7 Conclusions 647.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Bibliography 69

ix

List of Tables

3.1 Sample Noun Phrase Clusters for Product Class: Acoustic Guitar . . . . . . 28

4.1 Sample Word Clusters for Product Class: Acoustic Guitar . . . . . . . . . . 35

5.1 Comparative Summary of iPods . . . . . . . . . . . . . . . . . . . . . . . 395.2 Comparative Summary of Acoustic Guitars . . . . . . . . . . . . . . . . . 45

6.1 Precision and Recall for Hierarchical Clustering Algorithms . . . . . . . . 516.2 Precision and Recall for Chinese Whispers Algorithm with Varying Dataset

Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 526.3 Precision and Recall for HAC algorithm with Varying Dataset Size . . . . . 536.4 Comparison With Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . 546.5 Attributes extracted by NP Clustering and Word Clustering Methods for

Acoustic Guitars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.6 No. of Attributes picked in Experiment . . . . . . . . . . . . . . . . . . . . 576.7 Ranked Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

x

List of Figures

3.1 Sample iPod description1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.2 Sample iPod description2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Sample iPod description3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1 Number of Clusters with Varying Iterations . . . . . . . . . . . . . . . . . 344.2 Sample Sub-graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.1 Complementary Cumulative Distribution of Useful Attributes in Relationto Number of User Selections . . . . . . . . . . . . . . . . . . . . . . . . . 60

6.2 Distribution of Useful Attributes at Various Ranking Positions . . . . . . . 616.3 Distribution of Useful Attributes at various Ranking positions for Different

Ranking Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

xi

Chapter 1

Introduction

World Wide Web has emerged as a great source of knowledge accumulating lot of new

information each day. With the development in information and communication infrastruc-

ture, more and more users across the globe are able to access the web. Internet, with 1.5

billion users now, has seen a 3 fold increase in its user base during 2000-20081. This rapid

increase in Internet usage has given raise to a wide variety of web applications for both en-

terprises and common man. Millions of users access Internet applications such as e-mail,

instant messaging, e-commerce portals, online banking, online ticket booking systems etc

for their day to day activities.

With this advent of world wide web, e-commerce portals have become the favorite

means of purchase for consumers and numerous shopping sites were launched. Websites

like Amazon2, eBay3 witness millions of transactions everyday on a wide range of prod-

ucts. New brands and products that are coming up burden the consumers with too much

information. In order to get relevant information on a product of his/her interest, a con-

sumer has to go through all the text available on that product which is a tedious and time

consuming task. In this thesis, we study the problem of extracting relevant information for

1Statistics obtained from www.internetworldstats.com2www.amazon.com3www.ebay.com

1

CHAPTER 1. INTRODUCTION

consumers from the textual data available in e-commerce portals. We propose and evaluate

information extraction techniques to solve the problem.

In the following sections of this chapter, we introduce the information extraction prob-

lem and information extraction in the context of e-commerce. Section 1.1 discusses the

information extraction problem in detail. Section 1.2 explains the information extraction

problem in the context of e-commerce. Section 1.3 explains the product attribute extraction

problem. Section 1.4 defines the problem statement of this thesis. The contributions of this

thesis are briefly explained in Section 1.5. We conclude the chapter with the organization

of this thesis in Section 1.6

1.1 Information Extraction

Information Extraction(IE) refers to the process of locating relevant information from nat-

ural language text to serve a pre-defined information need. [49] define IE as the task of

filling template information from previously unseen text which belongs to a pre-defined

domain. Its goal is to automatically extract structured information such as entities, rela-

tionships between entities, and attributes describing entities from natural language text.

This information is usually stored automatically into a database, which enables rich form

of queries on the text than possible with keyword search alone.

1.1.1 Applications

Information extraction is useful in a diverse set of applications ranging from enterprise

applications , personal applications to web applications. Some of the early information ex-

traction tasks include extraction of named entities and event information from news articles.

Competitions like Message Understanding Conference (MUC) [65, 23, 12] and Automatic

Content Extraction (ACE) [44] are based on the extraction of structured entities like people

and company names, and relations such as “has-acquired” between them. Other popular

tasks are: tracking disease outbreaks [22], and terrorist events from news sources. This has2

CHAPTER 1. INTRODUCTION

resulted in various other research works [21, 69] like extraction of named entities and their

relationship from news articles.

Later, the advent of Internet and other information rich sources like Wikipedia has

resulted in diverse set of IE applications. Effective Techniques are developed for extraction

that could scale to the diversity and the size of web which include Open domain Question

answering [36, 45, 25, 33] and open domain fact extraction [9, 50, 18, 73]. With more

than a million articles, wikipedia has become a rich source of knowledge. The structure of

wikipedia provides an easy way to extract information which gave rise to many works on

information extraction from wikipedia [64, 77].

1.1.2 Types of Sources

IE systems can be grouped based on the type of text they process: Structured, Semi-

Structured and Unstructured.

Structured: Structured data is primarily relational data stored in databases. Relational

database stores information about the entities, their attributes and the relationships among

them in a formal structure. It gives meaning to the stored data by using the structure. This

makes extraction of relevant information from databases an easy task.

Semi-structured: It is a form of structured data that does not conform with the formal

structure of tables and data models associated with databases but contains tags or other

markers to separate semantic elements and hierarchies of records and fields within the

text. Examples are XML coded pages or highly structured HTML pages, advertisements in

newspapers and job postings.

Unstructured: Unstructured text is a free flow of natural languages text containing a

set of sentences. It doesn’t provide any semantic information about the text. Examples are

newspaper articles.

One of the important factors that influence the accuracy of an information extraction

system is the format and style of source text. Information Extraction from unstructured

3

CHAPTER 1. INTRODUCTION

text is a difficult task and poses lots of challenges. Structured and Semi-structured text

provide semantic information about the text they represent and thus are relatively easy to

process compared to unstructured text.

1.1.3 Granularity

The granularity of a text may vary depending on the type of source. The text could be small

snippets containing unstructured records like addresses, classified ads [2, 6] or sentences

extracted from a natural language paragraph [67, 23, 7]. In the case of unstructured records,

the data can be treated as a set of structured fields concatenated together, possibly with a

limited reordering of the fields. Thus, each word is a part of such structured field and during

extraction we just need to segment the text at the entity boundaries. In sentences there are

many words that do not form part of any entity of interest.

Some extraction tasks require multiple sentences or an entire document for extractions.

Popular examples include event extraction from news articles [23], extraction of part num-

ber and problem description from emails in help centers, structured information extraction

from text resumes, extraction of title, location and timing of a talk from talk announcements

[60] and the extraction of paper headers and citations from a scientific publication [48].

1.1.4 Extraction Methods

The extraction of structured information from noisy, unstructured text poses lots of chal-

lenges, which attracted attention of different research communities including Natural Lan-

guage Processing, Machine Learning, Information Retrieval etc. Numerous information

extraction techniques have been designed over the last two decades to cater to the require-

ments of various applications. These techniques can be primarily categorized into two

groups: (1) Rule-based Approach and (2) Machine Learning.

Rule-Based Approach relies on a system expert, who is familiar with both the applica-

tion domain and the required function of the IE system. Early information extraction sys-

4

CHAPTER 1. INTRODUCTION

tems were all rule-based [26, 16, 41, 57] and they continue to be researched and improved

[14, 30, 43] to meet the challenges of real world extraction systems. Rules are particu-

larly useful when the task is controlled and clear like the extraction of phone numbers and

zip codes from emails, or when creating wrappers for machine generated web-pages. Also,

rule-based systems are faster and more easily amenable to optimizations [56, 61]. Rules are

typically hand-coded by a domain expert. However they can be automatically learned from

training examples created from unstructured text. Several algorithms have been studied for

inducing rules from labeled examples of which bottom-up [10, 11] and top-down [62, 53]

rule formulation are well known. In bottom-up approach, a specific rule is generalized, and

in top-down approach, a general rule is specialized.

Machine learning methods formulate the information extraction problem as a labeling

task where the unstructured text is segmented and the individual parts are labeled. These

techniques can be broadly classified into two classes: generative models based on Hidden

Markov Models [2, 60] and conditional models based on maximum entropy [42, 55, 37].

Both were superseded by global conditional models, popularly called Conditional Random

Fields [34]. Both Rule-based methods and statistical methods are being used in parallel

depending on the nature of the extraction application. Few other hybrid models [13, 19] are

also proposed that attempt to benefit from both machine learning and rule-based methods.

1.2 Product Information Extraction

The rapid expansion of e-commerce resulted in sharp rise in the number of products sold

on the web along with the number of people buying on the web. Lot of textual data is being

generated in this process to assist customers with necessary information to help them in

selecting the correct product among the many available. Product Information Extraction

(PIE) refers to application of information extraction and text mining techniques to extract

relevant information about products. The text useful for PIE comes from variety of sources

and available in different types: product manufacturers provide descriptions, consumers

5

CHAPTER 1. INTRODUCTION

write reviews, online merchants prepare feature tables etc.

1.2.1 Product Data

Most of the data on products is unstructured text, which we have classified into the follow-

ing categories based on the nature of the text,

Customer Reviews: In order to enhance customer satisfaction and their shopping expe-

riences, it has become a common practice for online merchants to enable their cus-

tomers to review or to express opinions on the products that they buy. With more and

more common users becoming comfortable with the Internet, an increasing num-

ber of people are writing reviews. As a consequence, the number of reviews that a

product receives grows rapidly. Some products get hundreds of reviews at popular

merchant sites.

Product Descriptions: Product manufacturers usually provide raw text descriptions or

technical specifications along with the product to help the customers understand the

functionality and usability of products. These documents, referred to as Product De-

scriptions, contain unstructured text which explains the product features in greater

detail.

Structured Tables: Another source of information on products are structured tables. Re-

view websites and price comparison websites like epinions4, google product search

provide tables along with product ratings and reviews. These tables list the features of

a product with their corresponding values to help the customers get a quick overview

on the product.

4www.epinions.com

6

CHAPTER 1. INTRODUCTION

1.2.2 Research in Product Information Extraction

Customer Reviews are a valuable source of information as they are written by users who

had hands on experience on the product. Reviews provide critical opinions and comments

on products and their respective features. It is a common practice for a new customer

willing to buy a product to read the available reviews on that product. However, there

could be large number of reviews for a single product and it is very painful for a customer

to go through all of them. Also, a customer may be interested in a particular feature and

looking for opinions on that feature. The remaining text in the reviews becomes irrelevant

for him/her.

All the above issues encouraged research on customer reviews and many interesting text

mining applications were developed under the name “Sentiment Analysis”. Some of the

interesting applications include Polarity Detection, Subjectivity/Objectivity Identification,

Feature-based Opinion Mining etc. Polarity detection [47, 71] is the classification of given

opinion text as positive or negative or neutral. Subjectivity/Objectivity Identification [46]

refers to classification of given text, usually a sentence, as objective or subjective. Feature-

based sentiment analysis [27] refers to the study of determining the opinions or sentiments

expressed on different features of entities, e.g., a cell phone or a digital camera. In simple

terms, a feature or attribute is a property or a component of a product, e.g., the screen of a

cell phone, or the picture quality of a camera. More details on the techniques used in the

above mentioned applications are discussed in Chapter 2

Product Descriptions are the most primitive form of information available on a product

as they are provided by manufacturer. Thus product descriptions are likely to be available

for most products where as reviews and tables are written for selected popular products.

One problem that has been recently studied[20] is the automatic extraction of attributes

and values from product descriptions. The goal of the task is to automatically generate

structured tables from text descriptions.

Tables offer a simple, structured layout for presenting attributes and values of a product.

7

CHAPTER 1. INTRODUCTION

Many online stores and review websites provide product attribute information in tables. If

the product attribute information is extracted from multiple web sites, another desirable

task is that the product attributes can be automatically normalized[75] and preferably the

semantic meaning of normalized attributes can be obtained. This can improve the indexing

of product Web pages, and support intelligent tasks such as attribute search or product

matching.

In this thesis, we study unsupervised techniques to extract attributes from product de-

scriptions. Also, we define a novel form of summaries referred as Comparative Summaries

for products which provide attributes and values of different products together in a compact

form.

1.3 Product Attribute Extraction

Every product has a set of attributes which best describe its characteristics or functionality.

An attribute could be a tangible or intangible property of that product. Also each attribute

is associated with a value. Values could be binary or discrete from a given set or numeric

from a range. For instance, FM-Radio is a binary valued attribute with just yes or no value

whereas color has a discrete value from the set of colors and weight is discrete valued with

a numeric value. To avoid confusion, we use the phrase product class to refer to a product

type like Cell Phone and product to refer to individual models such as Nokia N72 through

out this thesis.

We define Product Attribute Extraction (PAE) as the task of automatically extracting

attribute information of a product from its text descriptions. By attribute information we

mean attributes and corresponding values of products.

An online shopper willing to buy a product has to go through its description in the

website to know its features. Often there are many varieties and it is painful for a consumer

to manually read all the descriptions to select a product. Manually preparing Attribute lists

is a difficult and time consuming task for e-commerce websites and search engines. Also

8

CHAPTER 1. INTRODUCTION

it is difficult to update the manual lists with the new products and new attributes that come

up everyday. We therefore propose that an automatic PAE system is an ideal information

access tool for users to get quick overview. For example, a future 4G iPhone may have a

new feature that is not available in the current 3G iPhones. An automatic PAE system can

easily capture such features making the comparative summaries dynamically generated

at the time of users request. The traditional way of doing this is by manually keying in

attribute values into structured databases, which are very difficult to manage and keep up

with changing attributes.

In this thesis, we deal with two specific problems of PAE: Attribute Extraction and

Comparative Summary Generation.

Attribute Extraction: Given a set of text descriptions of multiple products which belong

to same product class, the task is to identify the list of attributes specific to the product

class. e.g. for a set of Digital Camera descriptions, the output consists of attributes

like “zoom”, “auto focus”, “resolution” etc.

Comparative Summary Generation: We define a novel form of summaries referred to

as comparative summaries which provides a comparative study of multiple items be-

longing to same category. The purpose of a comparative summary is to provide a

quick, concise comparison of multiple items of a category. An assumption in such

a task is that, the documents in the collection should talk about comparable items

such as entities or events. In our case, these entities are products belonging to same

product class and the input documents are text descriptions of these products. A

comparative summary provides the attributes common to these products and their

corresponding values with respect to each product. It presents the attributes in de-

creasing order of importance in product comparison. These attributes and values of

the products are presented in Attribute vs Product matrix with the elements of the

matrix representing the values of corresponding attribute for each product. Compar-

ative summary generation task deals with extraction of comparative summaries for

9

CHAPTER 1. INTRODUCTION

a product class from its descriptions. This task involves two important steps: (1)

Attribute/Value Extraction from descriptions (described in the above task) and (2)

Attribute Ranking based on the importance of these attributes in product comparison.

We describe this task in more detail in Chapter 5.

1.4 Problem Statement

Information extraction from product descriptions is not a well-studied problem. Though

there have been many research works on information extraction and sentiment analysis

from customer reviews, few attempts were made on information extraction from product

descriptions. Few semi-supervised approaches were proposed recently.

Our goal in this thesis is to study the problem of Product Attribute Extraction and

develop efficient and scalable extraction methods. These methods can be applied to any

kind of products and thus are domain independent. In essence, this work provides solutions

for easily obtaining attribute information about products without manual labour. Our thesis

problem can be divided into the following steps.

• Studying the problem of Product Information Extraction.

• Developing methods for attribute extraction and attribute ranking.

• Defining comparative summaries and coming up with solutions for generating com-

parative summaries for products.

• Evaluating our methods using existing evaluation measures.

1.5 Contributions

The contributions of thesis towards the area of product information extraction and compar-

ative text mining are given below:

10

CHAPTER 1. INTRODUCTION

• Development of novel unsupervised algorithms to the product attribute extraction

problem. Our algorithms extract attributes specific to a product class from the text

descriptions of different varieties.

• Methods were proposed for ranking the attributes of a product class according to their

importance in providing comparisons among different products. We have defined the

notion of Ranking for Comparison and computed a ranking function.

• Design of a novel form of summaries referred to as Comparative Summaries which

provides comparative study of multiple items belonging to same category. We have

defined a phrase based form of summary using the attribute-value paradigm.

• Present techniques to compute Comparative Summaries for products from text de-

scriptions using the attribute extraction and ranking methods.

1.6 Organization of Thesis

The main focus of this thesis is the development of unsupervised techniques for product

attribute extraction and attribute ranking. This chapter introduces the area of information

extraction and its relevant problems in the context of products. We have described the

specific sub-problems of product information extraction which are dealt in this thesis. We

have also provided an overview of the various solutions proposed which are explained in

detail in the subsequent chapters.

Chapter 2 presents some of the related literature in the area. We have identified four

problems which are relevant to the problems studied in this thesis. The chapter explains the

related work done in these problems in four sections. Each section explains a problem and

surveys the various solutions proposed. We have also compared the different works and

tried to understand the advantages and disadvantages of the solutions.

Chapters 3 explains a noun phrase clustering approach and word clustering approach.

11

CHAPTER 1. INTRODUCTION

The chapter begins with explaining the various challenges involved in the attribute extrac-

tion task and then proceeds with the details of the extraction methods.

Chapter 4 explains world clustering method for the attribute extraction which over-

comes the shortcomings of the noun phrase clustering method.

Chapter 5 introduces comparative summaries and presents a solution for generation

of comparative summaries for products. The chapter begins with the introduction of com-

parative summaries and then describes the steps involved in the generation of comparative

summaries for products.

Chapter 6 explains the experiments conducted in detail. The chapter begins with the

details on the datasets used in the experiments and then describes the evaluation measures.

The chapter explains each experiment in detail and presents an analysis on the results ob-

tained. Experiments are conducted to evaluate the techniques presented in Chapters 3 and

5.

Chapter 7 concludes this thesis by explaining the work done and explaining the con-

tributions of this thesis. This chapter also provides an outline for the future work.

12

Chapter 2

Related Work

In this chapter, we survey the literature related to the problems addressed in this thesis:

Product Attribute Extraction and Comparative Summarization. There are very few studies

on attribute extraction from product descriptions and no prior work on comparative sum-

marization. So we present previous studies which are broadly related to attribute extraction

and comparative summarization. We have identified four such problems,

1. Key Phrase Extraction

2. Customer Review Mining

3. Product Attribute Extraction

4. Comparative Information Extraction

The following sections in this chapter explain the previous studies on the above four

problems. We give a brief introduction to each problem and survey the different approaches

that were previously used to solve the problem.

13

CHAPTER 2. RELATED WORK

2.1 Key Phrase Extraction

Key Phrase Extraction aims to identify the most relevant words or phrases in a set of docu-

ments. It is the process of identifying a phrase from the input documents and extracting the

phrase as a keyphrase. Keyphrases provide a high-level overview of a document’s content

and helps the readers to decide whether the document is relevant to them or not. There is

an increase in the amount of information that is available to both lay users and professional

users such as journalists, analysts. Users are required to deal with large collections of doc-

uments from unfamiliar domains. Keyphrases provide a powerful means for sifting through

large numbers of documents and get an understanding of the topics and events which are

particular to a domain. This has called for methods to condense information and make the

most important content stand out. Keyword extraction is one of the first and prominent

methods that which were later followed by automatic text summarization etc.

Keyphrases are usually selected manually. Authors of technical articles provide key-

words to documents so that the reader gets a quick, concise overview on the topic being

discussed. Professional indexers often choose phrases from a predefined controlled vocab-

ulary relevant to the domain at hand. However, only a small fraction of documents come

with keyphrases, and attaching them manually is a very laborious task that requires knowl-

edge of the subject matter. Thus automatic keyword extraction techniques provide great

benefits and have become popular.

The problem of Product attribute extraction is closely related to Key phrase extraction.

Both the tasks involve extraction of phrases from a single or set of documents. However,

product attribute extraction is a more specific problem where the input documents describe

products. It has an additional constraint that the phrases extracted should define a feature

or property of the product being described in the input documents. This makes product

attribute extraction a special and difficult problem.

There has been lot of work on automatic keyphrase extraction and several methods were

proposed. Some of the works treated keyphrase extraction as a classification problem and

14

CHAPTER 2. RELATED WORK

presented supervised learning approaches. In these approaches, documents are treated as

set of phrases and every phrase encountered in the text is classified as positive or negative

- distinguishing keyphrases from the other phrases. [70] presents the GenEx keyphrase ex-

traction system which is based on a set of parametrized heuristic rules that are fine-tuned

using a genetic algorithm. Kea[74] uses naive bayes learning for training the keyphrases

which is shown to produce significant results when both the training and testing data are

limited to same domain. [63] presented KPSpotter, a web based keyphrase extraction sys-

tem capable of processing various types of data like XML, HTML or unstructured text.

The system uses information gain measure and natural language processing techniques to

extract the relevant phrases. All the above mentioned methods use statistical features like

frequency to learn the characteristics of keyphrases. [28] showed that the performance

could be improved by using linguistic features like POS tags, NP Chunks along with sta-

tistical features.

Most of the unsupervised approaches to keyphrase extraction exploit the statistical in-

formation associated with the words in the documents to identify the keyphrases. [40]

proposed a keyword extraction algorithm for single document that doesn’t utilize any ref-

erence corpus. They compute the co-occurrence distribution of the terms with the most

frequent terms in the document and use the degree of bias in the distribution to estimate the

importance of the terms. [68] present a language modeling approach to extract keyphrases.

They construct multiple language models from the input documents and a reference corpus.

Pointwise KL-divergence between these language models is used to score the candidate

phrases from the input documents.

2.2 Customer Review Mining

Web contains a wealth of opinions about lot of products written by consumers in e-commerce

portals. Reviews are very important source of information for consumers who want to buy

a product. However, it is difficult for consumers to go through all the reviews available

15

CHAPTER 2. RELATED WORK

for a product. This has resulted in the development of a new area of text mining called

“Customer Review Mining”(CRM). CRM focuses on automatically identifying opinions

on a product and its features from customer reviews.

Polarity detection deals with classification of opinion text as positive or negative. [47,

71] present various solutions for polarity detection from customer reviews which is a huge

resource for opinion content. [27, 51, 59] present techniques to extract product features

from customer reviews. Product feature extraction from descriptions pose different chal-

lenges. Numerous reviews are available for each product whereas product descriptions

are few in number and the text is sparse. [27] mine the frequently co-occurring words in

phrases to find the product features using association rule mining. [51] present techniques

that are based on frequently occurring patterns in reviews to extract product features from

customer reviews. Patterns of this kind are rare in product descriptions which make the

task challenging.

However feature extraction from consumer reviews is not exhaustive for the following

reasons: 1. Review mining methods primarily focuses on mining pinions and determining

polarity rates. Customers usually discuss only the important features and rarely mention

other less important features of the product. Thus feature extraction from customer reviews

is limited to only the important features. 2. Also, only a selected popular products get

enough consumer reviews for mining and lot of products do not have reviews. Thus product

attribute extraction from consumer reviews is possible only for those products which have

enough number of reviews.

2.3 Attribute Information Extraction

Attribute Information extraction from product descriptions is not a well studied problem

with only a few works that have come up recently. [20] proposed a method to extract

attributes and values of products from text descriptions. They extract attribute-value pairs

from large set of product descriptions that belong to a particular domain. They employ

16

CHAPTER 2. RELATED WORK

a semi-supervised algorithm which identifies attributes and values in the sentences from

text descriptions of multiple products of a domain. In this thesis, we focus on unsupervised

methods for extracting attributes specific to a product. Moreover, our techniques can extract

attributes from small datasets containing less than 50 descriptions whereas their approach

requires relatively large dataset belonging to a domain.

2.3.1 Attribute Extraction from Web Pages

Retail websites and Online merchants often display attribute value information in prod-

uct tables as described in Section. 1.2.1. These websites come up with their own layouts

and templates for displaying product tables. This makes the identification of attributes and

values from product tables a difficult task. Wrapper learning methods[79] have been devel-

oped to extract information from web pages but these methods are supervised and require

manual work in obtaining the training data. Also, the learned wrapper could be used for

extraction only from the websites which were used for preparing the training samples. [80]

uses the layout of the Web page and employs an integrated approach of Hierarchical Con-

ditional Random Fields (HCRF) which can segment and label elements of web pages from

different websites. But it is supervised learning method and requires training examples for

each attribute in advance and thus not capable of discovering previously unseen attributes.

[76] and [75] propose semi-supervised and unsupervised learning methods respectively to

extract attributes accurately which are template-independent and can discover unseen at-

tributes.

2.3.2 Attribute Normalization

Another problem that is closely related to attribute information extraction is Attribute Nor-

malization. Attributes extracted from different sources are not normalized, and require

human effort to judge whether two fields refer to the same attribute. For example, one may

not know that the extracted text fragments “fireworks” and “portrait” refer to two different

17

CHAPTER 2. RELATED WORK

attribute values of the same attribute “shooting mode” of digital camera. Attribute nor-

malization is defined as grouping attribute values with similar semantic meaning. It has

many useful applications such as storing attribute values of products into product database,

retrieving and matching of products, etc. [75] propose a framework for normalization of

attributes found in multiple websites. They designed a probabilistic graphical model that

can model page-independent content information and page-dependent layout information

of text fragments in web pages for simultaneously identifying attribute value pairs and

normalizing them.

2.4 Comparative Information Extraction

There were many works in the past which dealt with the problem of extracting compar-

ative information from text. By comparative information extraction, we mean extraction

of elements from text which compares two or more entities or concepts. As part of this

thesis, we have presented techniques for extraction of comparative information on products

which we refer to as “Comparative Summaries” as defined in Chapter 1. Previous works

on comparative information extraction can be classified into two categories: Extraction of

sentence level comparative information and Extraction of topic level comparisons.

2.4.1 Extraction of sentence level comparative information

There is a class of works focused on extraction of text which explicitly compares two or

more entities/concepts. [31] proposed a method for extraction of comparative sentences

from customer reviews which make subjective or objective comparisons between products.

They proposed a supervised learning approach based on pattern discovery to identify com-

parative sentences. There are few works that summarize the comparable content in a set

of related documents which again is a sentence extraction task. [38] describes a summa-

rization problem to identify similarities and dissimilarities in information content among a

set of related documents. Initially, they extract text units like phrases/words and identify18

CHAPTER 2. RELATED WORK

relationships between them which are finally used to generate the summary. They align

related sentences across the documents and highlight important phrases and present them

as a summary. More recently, the idea of contrastive summarization [32] is proposed for

products. A pair of products are compared using their customer review collection and two

summaries are generated highlighting the differences between two products.

Our work is different from the above works in the following respects. All the above

discussed methods concentrated on extraction of sentences with comparable content. We

do not extract comparative sentences, but work at a more granular level. We identify and

extract attributes specific to a product class and relate them with various products of that

class in a comparative summary. Also, majority of the above works provide comparisons

on a pair of products whereas comparative summaries provide attribute level comparisons

on multiple products simultaneously. Another important difference is that the sentences

extracted by them can contain comparisons that can be subjective or objective in nature.

Comparisons drawn from comparative summaries are objective in nature as they focus on

differences in product attributes.

2.4.2 Extraction of topic level comparisons

Another class of works focus on extraction of more granular topics from a set of related doc-

uments. [17, 39] present a method for identifying corresponding topics or themes across

several corpora that are focused on related, but distinct, domains. [17] achieve this by

cross dataset clustering. Their method simultaneously clusters multiple datasets such that

each cluster includes elements from several datasets, capturing a common theme, which is

shared across the sets. [39] proposed an unsupervised algorithmic framework based on dis-

tributional data clustering for the task. [78] define Comparative Text Mining from a set of

comparable text collections as the task of discovering any latent common themes across all

collections as well as summarize the similarities and differences of these collections along

each common theme. Though these approaches extract common themes that are compa-

19

CHAPTER 2. RELATED WORK

rable across sub corpora, the granularity of theme is vaguely defined and is application

specific. Also, they do not have the notion of explicit attributes or properties whereas the

granularity of the attributes in comparative summary is more precisely defined.

20

Chapter 3

Attribute Extraction

This chapter presents solution to the attribute extraction problem. The goal is to extract the

attributes of a product class given a set of sample product descriptions where each input

description describes a product belonging to that class. In this chapter, we present an un-

supervised solution for the problem which doesn’t require any domain specific knowledge.

The chapter begins with a discussion on the challenges involved in the attribute extraction

task in Section 3.1. Section 3.2 explains the motivation behind our methods and Section

3.3 presents the actual method which we refer to as “Noun Phrase Clustering Method” in

detail.

3.1 Challenges

The attribute extraction task is not a trivial problem and has many challenges. Often, only

a small number of descriptions are available for a product class. Also, product descriptions

contain unstructured text. This poses interesting challenges and makes the task difficult. In

order to understand the complexity of the problem better, we first describe a sample product

description.

Each input description is a text document describing a product belonging to a product

class. Fig. 3.1, 3.2 and 3.3 show sample descriptions of three iPod models. A description

21

CHAPTER 3. ATTRIBUTE EXTRACTION

typically contains 6 to 10 incomplete sentences. The length of these incomplete sentences

vary from very short to long. Sometimes, a sentence could be just be one single noun

phrase. These incomplete sentences describe different features of the product.

We have identified the following challenges in the attribute extraction task.

• Data Sparseness: The primary challenge is to learn the characteristics of a product

class from a sparse dataset. The system takes few descriptions as input and has no

other information about the product class. Thus it has limited evidence to identify

the attributes.

• Unstructured text: The system takes unstructured text as input in the form of in-

complete sentences. Incomplete sentences hinder the use of existing natural language

techniques to gain more knowledge from sentence structure. Though there are suf-

ficient number of tools like syntax parser, named entity recognizer etc, they require

the input text to be grammatically correct to work accurately.

• Domain Knowledge: The system has no prior information about the product class

apart from the text in the descriptions. Also, it is expensive to create domain specific

data as there are wide range of products. So, we intend to make our methods domain

independent. Thus our system doesn’t use any external resources except for a generic

list of units of measurements.

• No Supervision: Attribute extraction is a completely unsupervised task since there

are no labeled samples to learn from. By unsupervised, we mean that there is no

human intervention in the whole extraction process. We want our methods to readily

scale to any product class. Thus the methods should be developed in an unsupervised

manner.

• Noise: Product descriptions contain noisy patterns which describe the features but

are not features by themselves. For example, in Fig. 3.2 “popular music player”,

22

CHAPTER 3. ATTRIBUTE EXTRACTION

Figure 3.1 Sample iPod description1

Figure 3.2 Sample iPod description2

Figure 3.3 Sample iPod description3

“even more beautiful” do not define any attribute. These patterns repeat across the

descriptions and make the extraction task difficult.

3.2 General Approach

The goal of attribute extraction is to identify the attributes of a product class given a col-

lection of sample descriptions where each description is a text document and explains the

features of a product belonging to that class.

23

CHAPTER 3. ATTRIBUTE EXTRACTION

Figures 3.1, 3.2 and 3.3 show three sample descriptions. As explained in Section 3.1,

product descriptions consist of incomplete sentences and long noun phrases. From the

sample descriptions, it is evident that attributes like “display”, “storage” etc repeat across

descriptions. So, in a description collection, attribute terms are more likely than other

terms. Thus the simplest way to select attributes is to take the most frequent terms in the

collection. However, this method has few drawbacks.

• First, this list will tend to contain only frequent attributes and is likely to miss rare

attributes appearing in only few product descriptions.

• This list may also contain frequent values and cannot differentiate attributes from

values.

• Common noisy terms may be selected due to their high frequencies in the description

collection. These noisy terms could be general stop words in English like “not”,

“that” or other modifiers like “great”, “style”, “beautiful” which are very common in

descriptions.

We propose a text clustering based approach to overcome the above problems. A gen-

eral observation is that multiple products of a class have attributes in common. So an at-

tribute is likely to occur multiple times in the descriptions. We try to capture the attributes

using these repetitions. For example, the attribute “Fretboard” appears in different contexts

as “Rosewood Fretboard”, “Bound Rosewood Fretboard”, “Javanese Rosewood Fretboard”

etc.

In this work, we explore clustering at two levels in the text: Noun Phrase Clustering

and Word Clustering. In the first approach, noun phrases from the description collection are

grouped and in the second approach, clusters of words are computed. Noun phrases Noun

Phrase Clustering (NP Clustering) method is described in this chapter and Word Clustering

method is described in Chapter 4.

Our solution operates in two stages. In the first stage, we cluster the text in the descrip-

tions such that each resulting cluster represents a single attribute and captures its context in24

CHAPTER 3. ATTRIBUTE EXTRACTION

different occurrences. This results in clusters of varying size with bigger clusters contain-

ing frequent attributes and smaller clusters representing rare attributes. In the second stage,

we extract an attribute from each cluster.

3.3 Noun Phrase Clustering Approach

As stated earlier, this method consists of two stages: Clustering and Attribute Identifica-

tion. In the first stage, explained in Section 3.3.2, we cluster the noun phrases from all

the descriptions and in the second stage explained in 3.3.3, we extract attributes from the

clusters.

As it can be seen from the product descriptions, attributes occur in noun phrases. For

example, in Fig 3.1, “backlight”, “LCD” occur in phrases “LED backlight”, “2.5 inch color

LCD” respectively. The context of an attribute varies from one occurrence to another. An

attribute is accompanied by a general modifier or a value. Also, a noun phrase usually

contains no attribute or a single attribute and rarely contains more than one attribute. This

motivated us to use noun phrase clustering so that noun phrases related to an attribute are

grouped together in a single cluster.

Before the clustering and attribute identification stages, the text in the descriptions un-

dergoes a preprocessing step to get the noun phrases from the text.

3.3.1 Pre-processing

The input descriptions contain raw text as shown in Figs. 3.1, 3.2 and 3.3. The goal of

this step is to identify the noun phrases from the product descriptions. These noun phrases

are given as input for clustering. Pre-processing is a simple three stage process: Sentence

Splitting, POS tagging and Stemming. First, the text in the descriptions is split into (in-

complete) sentences using a rule based sentence splitter. A set of patterns are devised to

split the text into sentences. We split the text when any of the following patterns occur,

25

CHAPTER 3. ATTRIBUTE EXTRACTION

• A newline is encountered.

• A semicolon followed by whitespace is encountered.

• A full stop ’.’ followed by a white space and preceded by a lower case alphabet is

encountered. A full stop ’.’ is a potential sentence delimiter. However, it doesn’t

always imply sentence termination. Numerical fractions, abbreviations also contain

fullstops.

Once the sentences are obtained, they are tagged for parts-of-speech (POS) tags, using

Brill’s tagger[8]. Noun phrases are extracted from these sentences using the POS tags.

Then we stem the noun phrases using Porter stemmer [52].

3.3.2 Clustering

The goal of this step is to cluster the noun phrases found in the descriptions so that all the

noun phrases containing a particular attribute are grouped together.

Let X = {x1, x2, x3, x4....} be the set of noun phrases extracted from the descriptions

D on which a clustering function CF : X → X is performed. The output of clustering is

a partitioning of the noun phrases into disjoint sets. X = X1 ∪ X2 ∪ ... ∪ Xn where each

Xi is a cluster of phrases.

We use Hierarchical Agglomerative Clustering (HAC) algorithm for the clustering func-

tion CF . HAC algorithm requires a similarity measure between a pair data points, noun

phrases in our case. So, we have come up with a similarity function to find the similarity

between a pair of noun phrases. This similarity measure must be chosen in such a way

that instances of same attribute are brought together. We use a unigram overlap based

score for this purpose. However, words at all the positions in a phrase do not carry same

weight. Phrases containing attribute-value pairs tend to have attribute at the head noun of

the phrase. Since we are looking for attribute overlap between the phrases, we define a po-

sitional feature f for a word w in a noun phrase xi whose value decreases with its distance

26

CHAPTER 3. ATTRIBUTE EXTRACTION

from head noun.

fwi = 1/(1 +Dwh) (3.1)

where Dwh = Distance of word w from head noun.

Let S1 and S2 be the unigram sets of two noun phrases xi and xj . Now we define the

similarity function using the positional feature as

Sim(xi, xj) =Σw(fwi + fwj)

Σufui + Σvfvj

where w ε S1 ∩ S2, u ε S1, v ε S2 (3.2)

We conduct experiments with all the three variants of the Hierarchical Agglomerative

Clustering algorithm: single linkage, average linkage, complete linkage. We use the simi-

larity function computed in equation 3.2 to find the similarity measure between two noun

phrases. Hierarchical Agglomerative Clustering algorithm begins with all the data points

in separate clusters and then continues iteratively merging the most similar cluster pair in

each step. Merging stops when the similarity between the clusters being merged is smaller

than α times the maximal similarity between them where α < 1 .

The output clusters X thus obtained contains noun phrases describing an attribute. The

size of the cluster varies with frequency of the attribute because a high frequent attribute

appears in more number of noun phrases and a low frequent attribute appears in less number

of noun phrases. The chance of finding an attribute in a cluster increases with the size of the

cluster. In our experiments, we consider only clusters with a minimum size θ. We obtained

best results for a θ value of 3 because most of the noise phrases are eliminated in clusters

of size 1 and 2. We used a θ value of 3 in our experiments. By considering the clusters

with minimum θ value of 3, we get the attribute instances which occur multiple times in

different documents. Increasing the minimum cluster size results in extracting only high

frequent attributes.

27

CHAPTER 3. ATTRIBUTE EXTRACTION

Strings Gig Bag Spruce Top Fingerboard

Strings (5) Gig Bag (13) Solid Spruce Top (4) Rosewood Fingerboard (2)

Steel Strings (11) A Gig Bag (2) Spruce Top (5) Ebonite Fingerboard (1)

Extra Strings (2) Nylon Gig Bag (1) Select Spruce Top (1) Fingerboard Rosewood (1)

Strings (1) Natural Spruce Top (1)

Fretboard Tuning Machines Scale Length

Rosewood Fretboard (2) die cast tuning machines(2) Scale Length (5)

Bound Rosewood Fretboard (2) chrome die cast tuning machines(1)

Javanese Rosewood Fretboard (2) chrome plated tuning machines (10

Fretboard (1)

Table 3.1 Sample Noun Phrase Clusters for Product Class: Acoustic Guitar

After obtaining the noun phrase clusters, we remove the generic units of measure from

noun phrases. A list containing 40 units of measure(cm,kg,etc) is prepared for this purpose

which is given as input to the system.

3.3.3 Attribute Identification

This section explains our algorithm for extracting an attribute from each of the clusters

obtained in the previous step. By definition, each cluster represents instances of the same

attribute. Now the task is to identify the n-gram with high possibility of being an attribute.

Attribute identification involves two steps: forming a set of candidate attributes and then

scoring them according to the chance of being the attribute. In our experiments, all the

n-grams with (n < 4) are considered as candidate attributes.

In English language, concepts are often expressed not in single words but in long noun

compounds. This behavior is also noticeable in product descriptions. Moreover, attribute-

value pairs tend to occur together in a single noun compound with value occurring first

followed by the attribute. Consider the phrases “CMOS sensor”, “DIGIC III image pro-

cessor”, all the attributes “sensor”, “image processor” follow the values “CMOS”, “DIGIC

28

CHAPTER 3. ATTRIBUTE EXTRACTION

III”. So phrases containing attribute-value pairs tend to have attribute at the head noun of

the phrase.

Now, a scoring function AttrScore is defined for a candidate attribute a for cluster xi

using the following principles:

1. An attribute is more likely to occur in the cluster xi and less likely to occur in other

clusters.

2. The chance of finding an attribute decreases with its distance from the head noun.

Following principle 1 above, we find a candidate attribute’s belongingness to its cluster

than other cluster using the pointwise KL divergence metric which was previously used in

[68]. So

AttrScore ∝ P (a)logP (a)

Q(a)(3.3)

where P and Q are distributions of all n-grams in current cluster and rest of the clusters

together respectively.

Following the principle 2, We define ADh as the average distance of n-gram a from

head noun Dh in its instances. For example, in the noun phrase “CMOS sensor”, Dh of the

n-grams “CMOS”, “Sensor”, “CMOS Sensor” are 1, 0 and 0 respectively. In order to be an

attribute, the Average Head Noun Distance of an n-Gram w should be small. SoAttrScore

is defined as

AttrScore =P (a)log P (a)

Q(a)

ADh(w)(3.4)

The n-gram a with highest AttrScore is selected as the attribute represented by that

cluster. Since noun phrases are stemmed before clustering, the extracted attributes contain

normalized tokens. Each of these tokens is substituted with its most frequent morphological

variant in the cluster. Table 3.1 shows sample noun phrase clusters for the product class

Acoustic Guitars computed by our noun phrase clustering method. It shows noun phrase29

CHAPTER 3. ATTRIBUTE EXTRACTION

clusters along with the attribute (in bold) it represents. The number next to each noun

phrase gives the number of times the noun phrase has appeared in the description collection.

3.4 Summary

In this chapter, we have explained the challenges in the attribute extraction problem and

presented an unsupervised method to overcome the challenges. The key idea in our method

is to cluster noun phrases from all the descriptions and extract attributes from the noun

phrase clusters. We have defined a custom lexical similarity measure which uses the posi-

tional information of the individual terms in the phrases. We have used HAC algorithms to

cluster the noun phrases. A representative attribute is extracted from each cluster from its

noun phrases. We have defined a AttrScore to measure each candidate n-gram for being

an attribute. This metric scores a candidate high if it is frequent in its cluster and infrequent

in other clusters. The candidate with the highest score is selected as the attribute for that

cluster. This is the first unsupervised method for the attribute extraction problem in the

literature. Also this method is effective in extracting attributes accurately from small set of

descriptions.

30

Chapter 4

Attribute Extraction using Word

Clusters

This section presents an alternate method to the attribute extraction problem. In the previ-

ous chapter, we have explored noun phrase clustering in order to find the attributes. Here

we perform clustering on words in the descriptions and find groups of related words.

Usually language objects, words or noun phrases in our case, are represented as a fea-

ture vector in a multidimensional space. A distance metric is computed to find the sim-

ilarity between the objects. Clustering algorithms use these similarity values to generate

the clusters from the objects. We used a similar representation in the method presented in

the previous chapter. Instead, we used a graph representation for clustering which doesn’t

use dimensions in space. In this representation, a graph is constructed from the input text

where language objects are mapped to nodes in the graph which are connected by edges.

Then graph clustering algorithms are used to find groups of similar nodes.

A co-occurrence graph is constructed from the descriptions with each distinct word

representing a node and edges representing word co-occurrences. This co-occurrence graph

exhibits the small world property. We give more details about the small world property

in Section 4.1. A graph clustering algorithm can now be used to cluster all the words

such that each resulting cluster consists of words related to an attribute. We compute the

31

CHAPTER 4. ATTRIBUTE EXTRACTION USING WORD CLUSTERS

word clusters using the chinese whispers algorithm which has been used to cluster graphs

exhibiting the small world property. We explain chinese whispers algorithm in Section 4.3.

Then we extract an attribute from each of these clusters. We explain the graph clustering

and attribute extraction in 4.3 and 4.4.

4.1 Small World Property

A graph which is characterized by the presence of densely connected sub-graphs and where

there exists a path between most pairs of nodes is said to possess the small world property.

Most of the nodes need not be neighbors of one another, but can be reached from every

other node by a small number of hops. The nodes that are densely connected share a

common property and when mapped this to a social network represents the communities

formed by the people. In social networks, two people may not know each other directly,

but it is possible that both people are connected by common people[72].

There exist many other graphs which are found to exhibit small-world property. Examples

include road maps, food chains, electric power grids, neural networks, voter networks,

telephone call graphs, and social influence networks. We refer the reader to [72] for more

details on the dynamics and structural properties of the small world graphs. According

to Ferrer and Sole[29], co-occurrence graphs also possess the small world property. The

graph built from the product descriptions possesses the co-occurrence property and hence

also possesses the small world property. We now describe how the text is modeled as a

graph in section 4.2.

4.2 Graph Construction

Let D be a set of descriptions describing different varieties of a product. We follow the

same preprocessing step explained in 3.3.1 for getting the noun phrases. We represent

these phrases in a weighted, undirected graph G=(V,E) where each vertex vi ∈ V represents

32

CHAPTER 4. ATTRIBUTE EXTRACTION USING WORD CLUSTERS

a distinct word in the document collection D and each edge (vi,vj ,wi,j) ∈ E represents

co-occurrences between a pair of words. Since a noun phrase typically describes a single

attribute, we limit the context to the boundaries of the noun phrase. This way of using a

complete noun phrases helps us capture the context better than a fixed window approach.

So we say that two words co-occur if they occur within a noun phrase boundary. The weight

of an edge wi,j is the number of co-occurrences between the pair of words represented by

vertices vi and vj . The neighborhood N(vi) of a vertex vi is defined as the set of all nodes

vj ∈ V , connected to vi i.e. (vi,vj ,wij) or (vj ,vi,wij) ∈ E. We build an adjacency matrix

A from the graph G and identify the densely connected nodes in the graph using Chinese

Whispers algorithm.

4.3 Chinese Whispers Algorithm

Chinese Whispers (CW) [4] is an algorithm for partitioning the nodes of a weighted, undi-

rected graph. This algorithm is motivated by a children’s game where children whisper

words to each other. Though the goal of the game is to derive a funny message of the orig-

inal text, CW finds the groups of nodes that share a common property. In children’s game

all the nodes that broadcast the same message fall into a single cluster.

Chinese Whispers is an iterative algorithm which works in a bottom-up fashion. It starts

by assigning a separate class to each node. In each iteration, every node is assigned the

strongest class in its neighborhood, which is the class having the highest sum of weights

to the current node. This process continues until no other assignments are possible for any

node in the graph.

Generally, the CW algorithm can result either in a soft partition or hard partition. We

set the parameters of CW so that it always results in a hard partitioning of the graph i.e each

node is assigned exactly one class. After obtaining the clusters, we proceed to the next step

where we extract attributes represented by the clusters.

We conducted an experiment to see if the number of iterations affected the clusters

33

CHAPTER 4. ATTRIBUTE EXTRACTION USING WORD CLUSTERS

Figure 4.1 Number of Clusters with Varying Iterations

formed in the CW algorithm. We chose four products namely iPods, Violins, Dome Cam-

eras and Digital SLRs with 50 descriptions each. We ran the CW algorithm for iterations

varying from 1 to 100 and noted the number of clusters formed. Fig. 4.1 plots the number

of clusters against number of iterations. From the graph we observe that for the first iter-

ation, the number of clusters formed is very high which is equal to the number of unique

tokens in the product description( as the CW algorithm starts by assigning a different class

for each token). And gradually, as the number of iterations increased(till 5), we see a ex-

ponential decrease in the number of clusters formed. For higher number of iterations, the

clusters formed were stable. For our future experiments, we fixed the number of iterations

to 80.

34

CHAPTER 4. ATTRIBUTE EXTRACTION USING WORD CLUSTERS

Neck Finish Warranty Bag Strap Button Frets

Reinforced Satin Limited Black Strap Frets

Hardwood Carry Pocket Nylon Handle 15

Nato Gloss One Gig Button 18

Neck High Year Carrying Shoulder

Rosette Protective Warranty Bag

Finish

Table 4.1 Sample Word Clusters for Product Class: Acoustic Guitar

4.4 Attribute Extraction

An attribute can be a single word attribute (monitor, zoom) or a multi-word attribute (water

resistant, shutter speed). A preliminary observation of the descriptions showed that at-

tributes are usually composed of a maximum of three words. So, we consider only n-grams

up to length 3 as candidate attributes. In English language, concepts are often expressed not

in single words but in longer noun compounds. This behavior is also noticeable in product

descriptions. Moreover, attribute-value pairs tend to occur together in a single noun com-

pound with value occurring first followed by the attribute at the head noun. For example,

in the phrases “LCD display”, “CMOS Sensor”, the attributes are occurring at head noun

(display, sensor) and are immediately preceded by values (LCD, CMOS). So the chance of

finding an attribute decreases with its distance from the head noun.

In order to capture these patterns, we construct a directed graph Gd : (Vd, Ed) from all

the noun phrases found in the descriptions. Each distinct token ti found in these phrases

constitutes a node i ∈ Vd in the graph. And for each token ti preceding tj in a noun phrase,

we draw an edge (i, j) ∈ E from i to j i.e an outlink from i and an inlink to j. Since a head

noun is not followed by any other tokens as shown in Fig 4.2, an attribute node should have

more number of inlinks and less number of outlinks. From each word cluster C, we pick

35

CHAPTER 4. ATTRIBUTE EXTRACTION USING WORD CLUSTERS

Figure 4.2 Sample Sub-graphs

the node a with the maximum difference between inlink and outlinks (Equation 4.1). The

token ta represented by this node a is selected as the attribute if has a minimum support Sa

of 0.5. Support is defined in Equation 4.2. We do not pick any attribute from cluster C if

36

CHAPTER 4. ATTRIBUTE EXTRACTION USING WORD CLUSTERS

Sa < 0.5.

a = argmaxi

(inlinks(i)− outlink(i)) where i ∈ C (4.1)

Sa =inlinks(a)− outlinks(a)

inlinks(a)(4.2)

If all the inlinks to the node a are from a single node b, then we take the bigram tatb

as the attribute instead of ta and similarly we take a trigram as attribute if tatb has all the

inlinks from a single node. This helps us in extracting multi-word attributes like wood

construction, pitch pipe etc. Table 4.1 shows sample word clusters for the product class

Acoustic Guitars computed using the word clustering method. It shows word clusters along

with the attribute (in bold) it represents.

4.5 Summary

This chapter presented a new method for attribute extraction problem. In this method, we

cluster words appearing in the descriptions instead of noun phrases. We follow a graph

clustering approach to identify the attributes and construct a word co-occurrence graph

from the descriptions. The idea is that words related to an attribute co-occur more often

than words of different attributes and form densely connected clusters in the co-occurrence

graph. So we used a two stage method to identify the attributes. In the first stage, we par-

tition the graph using chinese whispers clustering algorithm. This results in word clusters

each representing words related to a particular attribute. In the second stage, we extract

representative attribute from each cluster. As we explain in Chapter 6, this overcomes the

problems associated with the noun phrase clustering method presented and performs signif-

icantly better than NP clustering method in forming the attributes and improves the quality

of the attributes.

37

Chapter 5

Comparative Summary Generation

In the last two chapters, we have presented algorithms to automatically extract attributes of

a product class. In this chapter, we explain how these attributes can be effectively used to

compare multiple products of a class simultaneously. We have come up with a new design

for the same which we refer to as “Comparative Summaries”. We also describe a method

to automatically rank attributes and generate comparative summaries for a product class.

5.1 Introduction

A summary is a condensed form of text serving a purpose. In a multi document summa-

rization task, a summary aims to deliver majority of the content from a set of documents

that share a common topic. In this work, we define a novel form of summaries referred to

as comparative summaries which provide comparative study of multiple items belonging

to a category. For example, a category of “Tennis Players” will have the players “Roger

Federer”, “Rafael Nadal” etc as its items. rer”, “Rafael Nadal” etc as its items. Here, the

purpose of a summary is to provide a quick, concise comparison among the items of the

category. An assumption in such a task is that, the documents in the collection should

describe the comparable items which are primarily entities or concepts or events.

A comparative summary provides the properties or facts common to these items and

38

CHAPTER 5. COMPARATIVE SUMMARY GENERATION

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Attribute

yes yes yes yes yes yes yes LCD

yes yes yes yes yes yes yes yes yes yes yes yes Songs

yes yes yes yes yes Playback

yes yes yes yes yes yes yes yes yes SP

yes yes yes yes yes yes yes yes yes yes yes yes Video

yes yes yes Formats

yes yes yes yes yes yes yes yes yes yes yes yes Photos

yes yes yes yes WAV

yes yes yes yes AAC

yes yes yes yes yes yes yes Led Backlight

Table 5.1 Comparative Summary of iPods

their corresponding values with respect to each item. It presents these properties ranked

according to their usefulness in comparing the items. For “Tennis Players” category, com-

mon properties would be “Country Represented”, “Grand slam Titles Won”, “ATP Rank-

ing”, etc. Usually, a summary is a free flow of natural language text containing a set of

sentences. Here we design our summary based on the properties and values of the items

which are single words or phrases. We present a comparative summary as a Property vs

Item matrix with the elements of the matrix representing the values of corresponding prop-

erty for each item. This paradigm of word or phrase based summary has been previously

studied for Multi Document Summarization by [66, 54, 58, 35, 24]. We believe that finding

these terms (properties and their values), ranking them according to importance and relat-

ing them with each document is a concise way of presenting the comparable content in the

documents.

We define comparative summary generation as the task of automatically extracting

comparative summary from a collection of documents where each document describes a

particular item in a category. In this thesis, we focus on comparative summary genera-

tion for product domain. Here, a product class is assumed as a category and its products

are treated as the category’s items. The attributes of the product class are the common

39

CHAPTER 5. COMPARATIVE SUMMARY GENERATION

properties that should be presented in the comparative summary.

One of the main tasks users perform before buying products is to choose one from the

many available varieties. For this purpose, users compare products. Comparisons can be

either subjective or objective. For example, “The sound quality in MP3 player A is better

than the sound quality in MP3 player” is a subjective comparison. An objective compari-

son is “Camera X has double the resolution than Camera”. A comparative summary aids

in making objective comparisons like feature comparison or attribute/value comparison

among the products. In this thesis, we are exploring occurrence or non-occurrence of an

attribute (binary), since this is a good simplification of the problem to begin with.

We present a simple form of comparative summary for binary valued attributes. A more

complex form of comparative summary has the attributes taking discrete values. Table

5.1 shows a comparative summary for iPods descriptions. In simple terms, a comparative

summary is an Attributes vs Products table with attributes ranked according to importance

in rows. The rightmost column in the table contains attributes backlight, etc. The top row

contains the document ids of descriptions. The values in the table are binary with yes or no

possibilities. An yes entry implies that the attribute in that row is present in the description

of the corresponding column. Similarly if the entry is no, it means that the document doesnt

have the attribute. Table 5.1 shows the initial screen of the summarizer which displays the

top 10 attributes of a product class. A user can access more attributes of the iPod on clicking

the More Results link available at the bottom of the page.

5.2 Comparative Summary Generation Framework

The goal of Comparative Summary Generation task is to generate a summary for a set of

products belonging to same class P . The input to the system is a collection of product de-

scriptions collection D = {d1, d2, d3 · · · dn} with each description di describing a product

Pi . The product class P has a set of attributes A = {a1, a2, · · · am}. But a product may

not contain all the attributes in A. Let Ai ⊂ A be the set of attributes found in product Pi.

40

CHAPTER 5. COMPARATIVE SUMMARY GENERATION

The output of the system is a comparative summary for P which gives the occurrence in-

formation of attributes in each of the products belonging to P . These attributes are ranked

in the summary according to their importance in comparing products.

We define a Comparative Summary S for the product P as an m x n matrix with each

column corresponding to a product and each row corresponding to an attribute.

Si,j = 1 if ai ∈ Aj (5.1)

= 0 if ai ∈ Aj

A comparative summary generation system for products typically has three stages.

1. Attribute Extraction: The first stage deals with the attribute extraction task. Give a

set of product descriptions of a class, the attributes of the product class are extracted

using the input product descriptions.

2. Attribute Ranking: A summary presents attributes of various products of a class.

A user who is willing to choose a product would look for the attributes present in

the summary. However, all the attributes need not be equally important. So the

comparative summary lists the attributes in the order of importance. Thus a ranking

algorithm is used to find the importance of attributes.

3. Summary Generation: In the final stage, summary is constructed from the ranked

attributes. The occurrence information of attributes in each product is determined

and presented in the summary.

We presented two unsupervised algorithms for the attribute extraction task in chapters

3 and 4. Section 5.3 presents an algorithm for ranking the attributes. Summary generation

is explained in Section 5.4.

41

CHAPTER 5. COMPARATIVE SUMMARY GENERATION

5.3 Attribute Ranking

To facilitate a user in comparing different products using the summary, we present the

attributes along with their corresponding values (occurrence or non-occurrence) for all the

products. However, all the attributes need not be equally important. A user would be more

interested in learning about attributes that assist him/her in comparing products and such

attributes should be displayed at the top in the summary. So we propose the notion of

Ranking For Comparison which states that the attributes should be ranked based on their

usefulness in comparing products.

The most intuitive way of ranking the attributes is to sort them on their frequency of

occurrence in various products. However, this method has drawbacks. A frequent attribute

that has the same value in all the products can be less informative, despite being a vital

feature of that product class, because it cannot help in making comparisons. For example,

as shown in the iPod description (Fig. 3.1), many iPod descriptions mention that “Comes

with earbud headphones and USB cable”. Though Earbud Headphones is very frequent

attribute of an iPod, it is a trivial fact that every iPod comes with Earbud Headphones

and doesn’t contribute much in discriminating different models. So frequency alone is

not sufficient in determining the usefulness of attributes. We solve the above problems by

identifying features which can enhance the performance.

5.3.1 Features

We devise the following characteristics to be present in the attribute: (1) The attribute

should be frequent and occurring in many products. So we take the product frequency of

an attribute as a feature (2) However, attributes that have same value for many products are

inefficient. Attributes having more variety in their values draw more comparisons and thus

help in selecting a particular product. It is known that attributes are likely to occur along

with their values in the noun phrases. So, attributes having informative noun phrases are

more useful for comparison. Since entropy is a measure of informative content, we take

42

CHAPTER 5. COMPARATIVE SUMMARY GENERATION

the entropy in attribute’s context as a feature (3) Domain specific attributes are of more

interest than generic attributes. For example, a user willing to buy an iPod would be more

interested in its memory than weight. We compute the domain specific nature of an attribute

by calculating the KL divergence of an attribute with respect to a background corpus.

We define our feature functions using the above cues.

Product Frequency: We define Product Frequency (pf ) of an attribute as the number

of products in which it appears. Since each document d in the description collection D

represents a single product, product frequency is nothing but the document frequency of

the attribute ai.

pf =|{d : ai ε d}||D|

(5.2)

Context Entropy: Context Entropy (ce) of an attribute is the unigram entropy of the

text surrounding the attribute in its instances. Since a noun phrase typically describes a

single attribute, we limit the context to the boundaries of the noun phrase. For an attribute

ai, all the noun phrases in which it appears are considered as its context. We construct a

unigram language model Mi for the cluster xi. So the context entropy becomes

ce = −Σwp(w|Mi) log p(w|Mi) (5.3)

Specialty: Specialty (sp) is computed as the KL divergence [15] of an attribute a with

respect to a generic background corpus. The assumption here is that common attributes

such as “length”, “weight” etc are more likely to appear in a generic corpus compared to

domain specific attributes such as “ pitch pipe” of an acoustic guitar. Specialty value should

be higher for “pitch pipe” compared to “length”. We use a random sample from TREC

collection [1] for this purpose. Text Retrieval Conference (TREC) is an annual competition

focused on different Information Retrieval research problems. They create large document

collections for evaluating the participants in the competition. We have used the document

collection of ad-hoc retrieval task as our background corpus since the collection is open

43

CHAPTER 5. COMPARATIVE SUMMARY GENERATION

domain and thus serves as a generic english corpus. We construct unigram language models

Ma for attribute a and Mg for the background corpus.

sp = Σwp(w|Ma) log p(w|Ma)/p(w|Mg) (5.4)

5.3.2 Ranking Function

Once we computed the features, we could use a simple formula to combine them and

calculate a salience score for each attribute. But this might be too heuristic and may not

be the best way to combine them. Instead, we learn a regression model using training data.

Regression is a classic statistical problem which tries to determine the relationship between

two random variables X = (x1, x2, ..., xp) and Y . In our case, independent variable X is

a vector of the three features described above: X = (pf, ce, sp), and dependent Y is the

ranking functionR(a) which takes any real-valued score. We useR(a) to sort the attributes

in a descending order.

R(a) = w1 ∗ pf(a) + w2 ∗ ce(a) + w3 ∗ sp(a) + w0 (5.5)

where w1, w2 and w3 are the weights of the features pf , ce, sp respectively.

A linear regression classifier is used to learn the weights. We train a binary regression

model to learn our ranking function. We used manually created attribute lists (explained in

Chapter 6) for positive samples and randomly picked non attribute words from the descrip-

tions for negative samples. The weights are optimized and these weights are used in our

experiments which are explained in detail later in Section. 6.4

5.4 Creating the Summary

As we have described in Section 5.1, a comparative summary is an Attributes vs Products

matrix. The elements of this matrix are filled with values of the attributes in the corre-

44

CHAPTER 5. COMPARATIVE SUMMARY GENERATION

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Attribute

yes yes yes yes yes yes yes yes yes Tuning

yes yes yes yes yes yes Chord Chart

yes yes yes yes Deluxe Semirigid

yes yes yes yes Rosewood Fretboard

yes yes yes yes yes yes yes yes yes Pitch Pipe

yes yes yes yes yes yes yes yes yes yes yes yes Finish

yes yes yes yes back

yes yes yes yes yes yes yes yes yes yes yes yes Strings

yes yes yes yes yes yes Wood Construction

yes yes yes yes yes yes yes yes yes yes Bag

Table 5.2 Comparative Summary of Acoustic Guitars

sponding product. Once the ranked attributes are obtained, the next step is to create the

summary of the products which involves two sub tasks. First, the attributes are presented

in the ranked order in the summary. Second, filling the corresponding values of attributes

in each of the products column. Since, we are concerned with only the occurrence or non-

occurrence information of attributes in this work, an attribute takes a value of either yes or

blank as shown in Fig. 5.2. For each attribute, we look for its occurrence in all the descrip-

tions. If it is present in a description, we fill the value yes in the product’s column of that

corresponding description. Similarly, if the attribute is not present in the description, we

leave the cell blank to the corresponding product column. Thus a comparative summary is

created from the ranked list of attributes.

5.5 Summary

In this chapter, we have presented a new form of summaries referred as comparative sum-

maries which gives comparable content in multiple items belonging to same category.

Comparative summary gives a quick, concise comparison of the multiple items in the

category. In this work, we presented techniques to generate comparative summaries for

45

CHAPTER 5. COMPARATIVE SUMMARY GENERATION

products. We designed the comparative summary based on the attributes of the products. A

comparative summary is an Attributes Vs Products matrix with each element correspond-

ing to attribute of a product and attributes ranked according to their importance in product

comparison. We have come up with features for identifying attributes useful in compari-

son and have presented a method for ranking the attributes using these features. We have

evaluated the ranking method by conducting a user study which is explained in Chapter 6.

The study has shown that our ranking algorithm is effective in identifying attributes useful

in product comparison.

46

Chapter 6

Experiments and Results

This chapter discusses the empirical evaluation of the various methods proposed in this

thesis. We start by describing the datasets used in our experiments: Product description

dataset and gold standard attributes. We then explain the evaluation measures used in our

experiments in section 6.2. Precision and Recall are defined in the context of attributes.

We have conducted several experiments to evaluate the effectiveness of our algorithms and

to find the optimal parameters. These experiments evaluate the attribute extraction and

ranking algorithms presented in Chapters 3, 4 & 5. Sections 6.3 and 6.4 describe these

experiments in detail along with analysis on the results obtained.

6.1 Data

To carry out our experiments, we primarily require two resources, product descriptions for

different product classes and a corresponding gold standard attributes list for each of these

product classes.

• Product Descriptions: We crawled product descriptions from the online shopping

portal Amazon 1. Amazon organizes products in a category tree. Each product be-

longs to at least one bottom category (leaf node) in the tree. The top categories in1www.amazon.com

47

CHAPTER 6. EXPERIMENTS AND RESULTS

the tree are broad and consist of diverse products whereas bottom categories contain

a narrow set of products. Some of the top categories include “Electronics”, “Home

appliances”, “Musical Instruments”. They stand on top of more specific categories

at the bottom of the tree like “Digital Cameras”, “Microwave Oven”, “Electric Gui-

tars”. We have selected 40 categories from bottom level categories and penultimate

level categories for evaluation. These categories are selected from different top cate-

gories to maintain diversity. Each selected category represents a Product Class. A set

of 50 descriptions are downloaded for each product class. As explained earlier, each

description corresponds to one product and contains incomplete sentences describing

the product.

• Gold Standard attributes: In order to evaluate the performance of our system, a

list of reference attributes is prepared for each of the 40 product classes. These are

manually created by two annotators for each product class in the dataset after reading

the descriptions of that class. Any attribute which appears at least in one description

of that product class is added to the reference list. We call these reference attributes

as Gold Standard Attributes. In our experiments, we compute the accuracy of our

methods by comparing the attributes extracted by our methods with the gold standard

attributes.

6.2 Evaluation Measures

This section explains the metrics we have used for evaluating the correctness of the at-

tributes extracted by our methods. Traditional IE systems like Named Entity Recognizers,

Relation Extractors etc are evaluated using precision and recall measures. We used a vari-

ant of precision and recall which was previously used by [20] for evaluating attribute-value

extraction.

Precision and recall values of our methods are computed by comparing attributes ex-

48

CHAPTER 6. EXPERIMENTS AND RESULTS

tracted with gold standard attributes. However, this is not a straight forward job. Consider

the phrase: 3x optical zoom. Here, both zoom and optical zoom could be considered as

attributes. People often do not agree on what the correct attribute is. Also, in cases where

attribute-value pairs occur together in a phrase, a part or whole of the value is often ad-

hered to the attribute which is legally used as an attribute. For example, both monitor and

LCD monitor could be called as attributes. In an extreme scenario, every possible attribute-

value pair can be treated as a binary attribute with its value being the occurrence or non-

occurrence of the pair in the product. [20] presented the paradigm of full match and partial

match for an extracted attribute depending on whether it contains a part of value or not. We

use the same paradigm to compute the precision and recall values in our experiments. Full

match and partial match of an attribute are defined as follows

• Full Match: A full match occurs if the attribute completely overlaps with any of the

gold standard attributes. For example, if the system extracts “monitor” and the gold

standard attributes list contains an entry “monitor”, then it is a full match.

• Partial Match: A partial match occurs if the extracted attribute completely contains

any of the gold standard attributes. Thus, full match is a special case of partial match.

For example, if the gold standard list contains “monitor” and the system extracts

either “monitor” or “LCD monitor”, then it is treated as a partial match.

6.2.1 Precision

Precision is the fraction of the extracted attributes that have a match in the gold standard

list. We define two versions of precision to handle full matches and partial matches. If

we count only full matches, we refer to the precision obtained as full precision. On the

other hand, if we count partial matches, it is referred as partial precision. In our results, we

present both full precision and partial precision.

FullPrecision =|{Full Matches}|

|{Extracted Attributes}|(6.1)

49

CHAPTER 6. EXPERIMENTS AND RESULTS

PartialPrecision =|{Partial Matches}||{Extracted Attributes}|

(6.2)

6.2.2 Recall

Recall is the fraction of the gold standard attributes that have a match with attributes ex-

tracted by the system. While computing Recall, we consider both full matches and partial

matches.

Recall =|{Partial Matches}|

|{Gold Standard Attributes}|(6.3)

6.3 Extraction Experiments

This section explains the experiments we conducted to evaluate the performance of our

clustering algorithms. We also explain the experiments carried out to find the effect of

the dataset size on the extraction accuracy. We also compare the performance of the NP

clustering and Word Clustering methods. Then we provide a detailed analysis and present

the trade offs in both the approaches.

6.3.1 Clustering Algorithms

We have presented two solutions to the attribute extraction problem which are explained in

Chapters 3 & 4. Both the approaches are based on clustering: One based on Noun Phrase

clustering and the other based on word clustering. The Noun Phrase clustering method uses

Hierarchical Agglomerative Clustering (HAC) to find the noun phrase groups. We have

experimented with all the three variants of HAC namely, single linkage, complete linkage

and average linkage. Word clustering method uses the Chinese Whispers algorithm.

We have conducted four experiments to compare the performance of these clustering

algorithms. In the first three experiments, we used the three variants of HAC algorithm and

extracted the attributes as described in Chapter 3. In the fourth experiment, we used the

50

CHAPTER 6. EXPERIMENTS AND RESULTS

Single Linkage Complete Linkage Avg. Linkage Chinese Whispers

Full Precision 35.81 37.36 36.87 48.62

Partial Precision 79.56 83.09 85.17 80.73

Recall 50.76 46.68 47.53 45.0

Table 6.1 Precision and Recall for Hierarchical Clustering Algorithms

Chinese Whispers algorithm as described in Chapter 4. All the experiments are run on the

40 product classes in the dataset and all the 50 descriptions available for each product class

are utilized for extraction. After extracting the attributes, precision (both full and partial)

and recall values are computed using the gold standard attributes.

Results

Table 6.1 shows the precision and recall values for different clustering algorithms. All

the three variants of the HAC algorithm produced similar results. Among them, average

linkage performed better than the other two in partial precision with .85. Also, it achieves

a full precision which is close to that of complete linkage, the highest among the three.

Though single linkage gets the highest recall among the three, it doesn’t perform as good

as the other two in partial precision and full precision. The word clustering approach us-

ing the chinese whisper’s algorithm gave significantly higher full precision than the HAC

algorithms. It has got a full precision of 0.48 whereas the HAC algorithms managed a best

of 0.37. It also gives a recall comparable to that of HAC algorithms.

All the four algorithms achieved recall values between 0.45 to 0.5 and none of them

were able to identify more than 50% of the gold standard attributes. This is because most

of the attributes that the system couldn’t extract appeared few times in the dataset. Since

we prune all the clusters with size less than θ in NP Clustering, many of these rare attributes

are not extracted. Similarly, in the Word Clustering methods, the cutoff on the support of

a candidate attribute, which was set in Equation. 4.2 avoids picking infrequent attributes.

51

CHAPTER 6. EXPERIMENTS AND RESULTS

We discuss more about the precision values in below subsections.

10 20 30 40 50

Full Precision 43.83 47.58 44.85 47.73 48.62

Partial Precision 69.70 71.35 77.04 75.60 80.73

Recall 9.68 20.38 30.15 35.22 45.0

Table 6.2 Precision and Recall for Chinese Whispers Algorithm with VaryingDataset Size

6.3.2 Dataset Size

We have conducted experiments to analyse the effect of dataset size on the performance of

the extraction methods. The size of input description set is increased from 10 to 50 in steps

of 10. Attributes are extracted using average linkage HAC algorithm and Chinese Whis-

pers Word clustering algorithm and precision, recall values are computed. Gold standard

attribute lists are separately prepared for each dataset of size 10 to 50. While preparing

gold standard attribute list for size 10, only attributes that appear in first 10 descriptions

are used as reference attributes for evaluation. Similarly, attributes that appear in first 20

descriptions are used for evaluation for dataset size 20 and so on.

Tables 6.3 and 6.2 give the precision and recall values for different dataset sizes. Both

the algorithms show steady improvement in precision and recall values as the size is in-

creased. Partial precision has gradually increased from 69.70 to 80.70 for CW algorithm

and 73.64 to 85.17 for NP clustering. It remained at 67% for 20 documents and steadily in-

creased there after suggesting that performance increases with input document size. Same

trend could be observed with Full Precision and Recall for both the algorithms. This clearly

indicates that the extraction algorithms are able to learn the characteristics of the product

class better as they get more evidence with the increasing number of documents.

Both the precision column and the attributes column indicate consistent improvement

52

CHAPTER 6. EXPERIMENTS AND RESULTS

in performance with increase in data set size. Though these columns do not converge in

this table, they suggest a possible convergence point for a size much beyond 50. We did

not perform experiments to find this convergence point since it is expensive to obtain gold

standard attribute lists for large description collections. However, this experiment proves

that the performance improves by increasing the dataset size.

10 20 30 40 50

Full Precision 29.20 31.56 34.53 34.12 36.87

Partial Precision 73.64 75.35 81.60 82.38 85.17

Recall 10.76 23.37 33.45 38.65 47.53

Table 6.3 Precision and Recall for HAC algorithm with Varying Dataset Size

6.3.3 Baseline

We have created a baseline to compare against our extraction algorithms. Baseline is a

frequent noun phrase selection algorithm. It picks the most frequent noun phrases in the

descriptions as the attributes. Baseline resulted in low recall and full precision for the

following reasons

1. Noun phrases contain attribute value pairs and result in few full matches.

2. Recall is low because the frequency of an attribute is distributed among the different

phrases of the attribute.

6.3.4 Noun Phrase Clustering vs Word Clustering

Though both the approaches are based on clustering, they are different in many respects.

• One approach clusters noun phrases, other aims at finding word clusters.

53

CHAPTER 6. EXPERIMENTS AND RESULTS

NP Clustering Word Clustering FrequentNP

Partial Precision 85.17 80.73 67.3

Full Precision 36.87 48.62 15.6

Recall 47.53 45.0 26.2

Table 6.4 Comparison With Baseline

• NP clustering groups occurrences of attributes whereas Word clustering groups words,

not their occurrences.

• NP clustering computes similarity among the samples using distance measures in

euclidean space. Word Clustering uses proximity between the words in the text to

group them.

• NP clustering uses the traditional Hierarchical clustering whereas Word Clusters are

obtained by partitioning the word association graph.

Precision and Recall

Table 6.1 shows the precision and recall values for different clustering algorithms. Recall

doesn’t vary significantly among the algorithms. Word Clustering method results in slightly

less recall compared to HAC algorithms. Only single linkage HAC gets a 5% higher recall

than WC method but it doesn’t give good precision values.

It is evident from the results presented in the Table 6.1 that NP clustering gets better

partial precision but results in significantly less number of full matches compared to Word

clustering. Table 6.1 shows that NP clustering produces around 37% full matches which

means that only 43% of the partial attributes are full matches. On the other hand, Word

clustering produces around 49% full matches which means that 62% of partial matches

are actually full matches. Thus Word clustering outperforms NP clustering in finding full

matches by a margin of 19%. A more detailed analysis on the attributes extracted by NP

54

CHAPTER 6. EXPERIMENTS AND RESULTS

clustering method gave more insight into the reasons for this.

• One of the reasons for reduction in full matches with NP clustering is repetition

of phrases across descriptions. For some attributes, the value remains the same for

most of the products in the class which result in same phrase occurring in multiple

descriptions. For example, most of the “Electric Guitar” varieties in the dataset have

Maple Neck. Here Neck is an attribute of Electric Guitar and its value is Maple. This

results in our extraction algorithm extracting Maple Neck instead of Neck.

• Ideally NP clustering should group all the occurrences of an attribute in a single

cluster. But in some cases, it results in multiple small clusters instead of one big

cluster. Each such small cluster contains occurrences of the attribute for a subset of

values and the extractor finds an output attribute from its phrases. This might result in

augmentation of part of value to the attribute. For example, Table 6.5 shows attributes

extracted by NP clustering method and Word clustering method for Acoustic Guitars.

NP clustering has produced two different clusters for Top and resulted in two partial

matches: Spruce Top and Betula Top. Similarly, it has generated two clusters for

Neck and extracted Mahogany Neck and Nato Neck. Multiple partial attributes for

a single gold standard attribute of this kind not only decrease the full precision, but

also degrade the quality of the attributes.

• NP clustering doesn’t result in good clusters if an attribute appears in long phrases

and the attribute is small (unigram or bigram). This is a special case of multiple clus-

ters mentioned above which would lead to partial matches instead of full matches.

Excessive partial matches also degrade the quality of the output attributes. All the above

reasons result in few full matches with the NP clustering methods. However, Word clus-

tering method does not suffer from the above problems and extracts more fully matching

attributes. This can be easily understood from the way it works. Unlike the NP cluster-

ing method, it groups words instead of their occurrences. So, a word can appear in only

55

CHAPTER 6. EXPERIMENTS AND RESULTS

one cluster and only one attribute. Thus, a gold standard attribute can appear in only one

output attribute either as a full match or a partial match but it cannot appear in multiple

attributes as partial matches. Table 6.5 shows attributes produced by NP clustering and

word clustering methods for product class Acoustic Guitar.

Synonyms

It is common for attributes to have synonyms and thus different words could be used in

different product descriptions to refer to the same attribute. For example, “screen” and

“display” are used while describing Digital Cameras. The methods proposed in this work

are focused on only extraction of attributes and are not adapted to handle synonyms in at-

tributes. However, we have performed a subjective analysis to understand how our extrac-

tion methods handle the synonyms since synonyms form an interesting subset of attributes.

We have analysed the attributes generated by both the extraction methods separately.

The NP clustering approach has produced separate clusters for synonyms in most cases and

thus treated them as independent attributes. But the word clustering approach has identified

only one of the synonyms. This could be easily understood from the way the method works.

Since, the method groups the words in the graph, all the synonymous words of an attribute

are strongly connected and fall into same group. But the attribute identification method

explained in Section 4.4 extracts only the top scoring n-gram from a word cluster. So only

one variant among the synonyms of the attribute is extracted from the cluster and other

synonyms in the group are ignored.

56

CHAPTER 6. EXPERIMENTS AND RESULTS

NP Clustering

Betula Top* Bridge Deluxe Semi Rigid

Fingerboard Fretboard Frets

Geared Tuning* Gig Bag Linden Binding

Mahogany Neck* Nato Neck Nut Width

Package Perfect Toy Picks

Pitch Pipe Scale Length Sides

Size Chord Chart* Spruce Top* Steel Strings*

Strap Button Strings Tuning Machines*

Zipper Closure

Word Clustering

Back Bag Binding

Body Chord Chart Deluxe Semirigid

Design DVD Fingerboard

Finish Frets Guitar

Includes Length Model

Neck Nut Width Package

Pitch Pipe Rosewood Fretboard* Strap Button

Strings Style Tone

Top Tuning Wood Construction*

Zipper Closure

* Partial Matches

Table 6.5 Attributes extracted by NP Clustering and Word Clustering Methods forAcoustic Guitars

Extracted Attributes Useful Attributes

iPods 32 18

Camcorders 32 19.2

Digital SLR Camera 31 18.4

Avg. 31.7 18.5

Table 6.6 No. of Attributes picked in Experiment

57

CHAPTER 6. EXPERIMENTS AND RESULTS

6.4 Ranking Experiments

In this section, we explain the experiments conducted to evaluate our ranking algorithm.

We have conducted a user study to get user preferences on attributes important for product

comparison. We have used these user preferences to evaluate our ranking algorithm. We

have also conducted experiments to compare the performance of the individual ranking

features.

6.4.1 User Study

The rational behind the design of the comparative summary is that it should assist con-

sumers in selecting a product while purchasing. A consumer looking for a product can

use the comparative summary to make quick comparisons between the different models

available and choose the one that best fits his requirements. Thus the goal of our attribute

ranking algorithm is to find attributes of interest to a user while comparing products. To

verify if the our algorithm actually finds attributes of interest to users, we have conducted

a small scale experiment. User preferences on attributes are taken and they are compared

to the results of our algorithm.

Users

10 users have participated in this experiment. All the users are either under graduate or

post graduate students of computer science.

58

CHAPTER 6. EXPERIMENTS AND RESULTS

Rank Digital SLRs Acoustic Guitars Kettles

1 accessories tuning machines limited warranty*

2 sensor chord chart spout

3 resolution deluxe semirigid switch

4 improved autofocus * rosewood fretboard * design

5 style settings pitch pipe interior

6 lcd monitor* finish housing

7 screen protectors back shutoff

8 display strings gauge

9 image stabilization wood construction* quarts

10 image retouching bag lid

* indicates a partial match

Table 6.7 Ranked Attributes

Task

Three product classes are selected for this experiment : {“Digital SLR”, “iPods”, “Cam-

corders”}. These three product classes are carefully selected from our dataset since they

are commonly used and thus familiar to everyone. This ensures that there is no bias in

users’ choice because of lack of knowledge about the products. Each user participating in

the experiment was given a list of attributes produced by our Word clustering Algorithm

for each product class. They were asked to pick attributes which according to them are

important while choosing a product for purchase. Users were asked to ignore errors in the

attribute lists. Since the importance of an attribute is subjective, the number of important

attributes for a product can vary from product to product and user to user. So, no restriction

was put on the maximum or minimum number of attributes they can pick. We refer to the

attributes selected by the users as Useful Attributes.

59

CHAPTER 6. EXPERIMENTS AND RESULTS

Figure 6.1 Complementary Cumulative Distribution of Useful Attributes in Rela-tion to Number of User Selections

Results

Our extraction system has extracted 120 attributes for the three products of which 95 are

partial matches and rest are errors. Each user picked 18.5 attributes per product on an

average which accounts to 46.4% of extracted attributes. Table 6.6 gives the number of

attributes extracted by our method and the number of attributes the users have picked from

them for each product.

The plot in Fig. 6.1 shows the complementary cumulative distribution for the random

variable X representing number of user selections which varies between 0 to 10. It gives

the percentage of useful attributes selected by at least x users. It depicts the agreement

among users in more detail. The plot shows the trends for all the products and also shows

the average for all products. 6.7% of attributes are selected by all users as useful. 35%

of attributes are selected by at least 70% users. This conforms with our hypothesis in

comparative summary design that a set of attributes exists which are considered important

60

CHAPTER 6. EXPERIMENTS AND RESULTS

commonly by many users while selecting a product.

6.4.2 Ranking

We evaluate the effectiveness of our ranking algorithm presented in Chapter 5 by analysing

the ranking positions obtained by the Useful Attributes. Since the goal of ranking is to

order the attributes according to their usefulness in product comparisons, Useful Attributes

should be ranked higher than other attributes. To verify this, we plot the distribution of

Useful Attributes at various ranking positions which is shown in Fig. 6.2. A good ranking

algorithm should rank more number of Useful Attributes in higher position and few of them

in last ranks. Ideally, all the Useful Attributes should be ranked above all other attributes

and they should be equally distributed in the first 19 positions. From the plot, we observe

that the first 19 positions contain more Useful Attributes than the remaining positions. 48%

of Useful Attributes have appeared in first 19 positions where 26% have appeared in the

next 19 positions. This clearly shows that the ranking algorithm is able to rank more useful

attributes in higher positions.

Figure 6.2 Distribution of Useful Attributes at Various Ranking Positions

61

CHAPTER 6. EXPERIMENTS AND RESULTS

6.4.3 Comparison of Ranking Features

The attribute ranking function we have explained in Chapter 5 uses three features pf , ce,

sp. To understand the effect of each individual feature on the ranking function, we ranked

the attributes using only one feature at a time as the ranking function. Let R be the ranking

function with optimized weights computed from least squares in Section 5.3.2. Let R1, R2

and R3 be ranking functions with only one individual features pf , ce, sp respectively.

R1 = pf(x) (6.4)

R2 = ce(x)

R3 = sp(x)

Figure 6.3 Distribution of Useful Attributes at various Ranking positions for Dif-ferent Ranking Features

The plot in Fig. 6.3 gives the percentage distribution of user selected attributes at the top

10 positions for the four ranking functions R, R1, R2 and R3. It can be observed from the62

CHAPTER 6. EXPERIMENTS AND RESULTS

graph that R ranked more number of user-picked-attributes in the first 5 positions than R1,

R2 and R3. The ranking functions R, R1, R2 and R3 extract 14%, 7%, 8% and 7% of user

selected attributes respectively in the first 5 positions . The combined ranking function R

clearly outperforms the individual features by a margin of 100%. The difference decreases

from 6th to 10th positions. This proves that the combined feature ranking function R

performs better than individual feature ranking functions in identifying the most useful

attributes.

6.5 Summary

This chapter explained the various experiments conducted to evaluate the algorithms pre-

sented in this thesis. We have crawled descriptions of 40 products from the amazon e-

commerce portal. We have also manually prepared a set of reference attributes for each

of these products which are used for evaluation. Experiments can be divided into two

categories: 1. Extraction experiments and 2. Ranking Experiments. We have extracted at-

tributes using both the Noun phrase clustering and word clustering methods and compared

the effectiveness in extracting the attributes. We found that both the algorithms are effective

in extracting attributes. However, the word clustering method yields significant improve-

ments in precision compared to noun phrase clustering methods. We have explained why

word clustering performs better than NP clustering method in identifying the attributes.

We have conducted a user study to evaluate the usefulness of the attributes in product

comparison. Attributes extracted by our algorithms are given to users and select attributes

useful while purchasing that product. We then compared the top ranking attributes from

our algorithm with the useful attributes selected by users. Experiments showed that our

algorithm is efficient in identifying useful attributes.

63

Chapter 7

Conclusions

In this thesis, we have studied problems in product information extraction. We have pre-

sented different solutions for the problems and evaluated them through experiments. We

conclude the thesis in this chapter with details about the contributions of the thesis and

directions for future work.

7.1 Contributions

In this thesis, we have proposed techniques for product information extraction and per-

formed thorough evaluation of our methods by conducting various experiments. We can

broadly classify our work into the following tasks

• Attribute Extraction

• Attribute Ranking

• Comparative Summary Generation

We have presented effective methods for each of the above tasks. We have also created

datasets for carrying out experiments and evaluate the methods. Our evaluation measures

64

CHAPTER 7. CONCLUSIONS

are based on standard IE evaluation measures which are widely used in evaluating many

related applications [5, 3].

Most of the existing methods for attribute extraction are either domain specific or su-

pervised and require training data to build classifiers. We have presented two novel unsu-

pervised methods which do not require any training data or domain specific information.

Moreover, our methods do not require many input descriptions and are effective in extract-

ing attributes even from small collections of 50 descriptions. Both the approaches are based

on clustering and achieve 80% precision in partial matches as shown in Chapter 6.3.1. We

have also compared the two methods and presented detailed analysis on the advantages and

disadvantages of one method over the other. We have concluded that the word clustering

method achives significantly higher full precision than the NP clustering method.

In this work, we have defined attribute ranking in the context of product comparisons.

The goal of our ranking algorithm is to identify attributes useful in product comparison.

With this rational in mind we have come up with novel features for ranking the attributes.

We have conducted user study to find useful attributes for comparison. The results of this

study showed that users actually consider a subset of attributes to be important in product

selection. This conforms with the rationale behind our ranking features which assume that

there exists some attributes which help in comparison. We have evaluated the performance

of our ranking algorithm in identifying Useful Attributes and showed that our algorithm

ranks useful attributes in higher positions as explained in Section 6.4.

Most of the existing summarization techniques focus on providing informative content

by compressing the text from a set of documents. There are no efforts in summarization

literature which give comparisons on entities or events. We have defined a novel form

of summaries referred to as comparative summaries which provide comparative study of

multiple items belonging to same category. The purpose of a comparative summary is to

provide a quick, concise comparison on multiple items of a category. We have presented

a method for generating binary comparative summaries for products using the attribute

extraction and ranking methods discussed above. This is explained in greater detail in

65

CHAPTER 7. CONCLUSIONS

Chapter 5.

7.2 Future Work

In this thesis, we have focused on extracting product attributes from text descriptions. A

very important and obvious extension of this work is value extraction. The methods pre-

sented here aim at extracting attributes for a product class. The unsupervised techniques

we have proposed for this task could be extended to extract values along with the attributes.

This would give attribute-value pairs for different products. Appending values to attributes

would also help in generation of valued comparative summaries which will be more infor-

mative than the binary comparative summaries.

An interesting direction of work is identification of relationships between attributes.

Attributes of some products can be further categorized and represented in a hierarchy.

For instance, “Cell Phone” has many attributes which can be grouped into: Dimensions

(Length, Width, Height), Display (Resolution, Colors, Type), Connectivity(Blue tooth, In-

frared, USB), Battery(Capacity, Talk time, Standby Time) etc. The graph clustering ap-

proach presented in this thesis is a good starting point for work in this direction. The

word graph provides a natural way of identifying relationships among the clusters and their

corresponding representative attributes.

This work has focused on attribute extraction from product descriptions which are short

with just few lines of text and incomplete sentences. However, there are other genres of text

in which descriptions could be available. Descriptions also exist in longer formats which

contain grammatically correct sentences and describe greater details about products. These

descriptions pose different challenges in extraction. The documents are more noisy but

they have the advantage of being grammatically correct which allows the use of existing

Natural Language Processing techniques. Extraction of attributes in this scenario requires

development of specialized techniques.

In this work, we have defined comparative summary and proposed techniques for au-

66

CHAPTER 7. CONCLUSIONS

tomatically extracting them. Comparative summaries could be a very useful information

access tool in any domain which has comparable entities or events with structured proper-

ties and values. The solutions presented in this thesis are focused on generation for product

domain. This should encourage development of efficient methods for generating com-

parative summaries for other domains. There are many other domains where comparative

summaries can be useful like “Sports”, “People Search” etc. The techniques presented here

may not work directly in other domains but some of these ideas could be used in similar

applications.

67

CHAPTER 7. CONCLUSIONS

Publications• “An Unsupervised Approach to Product Attribute Extraction.” Santosh Raju, Prasad

Pingali and Vasudeva Varma. Appeared at the 31st European Conference on Infor-

mation Retrieval (ECIR) - 2009.

• “A Graph Clustering Approach to Product Attribute Extraction.” Santosh Raju, Pra-

neeth Shistla, Vasudeva Varma. Appeared at the 4th Indian International Conference

on Artificial Intelligence (IICAI) - 2009.

68

Bibliography

[1] http://trec.nist.gov/data/docs eng.html.

[2] Eugene Agichtein and Venkatesh Ganti. Mining reference tables for automatic textsegmentation. In KDD ’04: Proceedings of the tenth ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages 20–29, New York, NY,USA, 2004. ACM.

[3] Eugene Agichtein and Luis Gravano. Snowball: extracting relations from large plain-text collections. In DL ’00: Proceedings of the fifth ACM conference on Digitallibraries, pages 85–94, New York, NY, USA, 2000. ACM.

[4] Chris Biemann. Chinese whispers - an efficient graph clustering algorithm and itsapplication to natural language processing problems. In Proceedings of TextGraphs:the Second Workshop on Graph Based Methods for Natural Language Processing,pages 73–80, New York City, 2006. Association for Computational Linguistics.

[5] Daniel M. Bikel, Scott Miller, Richard Schwartz, and Ralph Weischedel. Nymble:a high-performance learning name-finder. In Proceedings of the fifth conference onApplied natural language processing, pages 194–201, Morristown, NJ, USA, 1997.Association for Computational Linguistics.

[6] Vinayak Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic segmentationof text into structured records. SIGMOD Rec., 30(2):175–186, 2001.

[7] Andrew Borthwick, John Sterling, Eugene Agichtein, and Ralph Grishman. Exploit-ing diverse knowledge sources via maximum entropy in named entity recognition. InIN PROCEEDINGS OF THE SIXTH WORKSHOP ON VERY LARGE CORPORA,pages 152–160, 1998.

[8] Eric Brill. Transformation-based error-driven learning and natural language process-ing: a case study in part-of-speech tagging. Comput. Linguist., 21(4):543–565, 1995.

[9] Michael J. Cafarella, Doug Downey, Stephen Soderland, and Oren Etzioni. Knowit-now: fast, scalable information extraction from the web. In HLT ’05: Proceedings ofthe conference on Human Language Technology and Empirical Methods in NaturalLanguage Processing, pages 563–570, Morristown, NJ, USA, 2005. Association forComputational Linguistics.

69

BIBLIOGRAPHY

[10] Mary Elaine Califf and Raymond J. Mooney. Relational learning of pattern-matchrules for information extraction. In AAAI ’99/IAAI ’99: Proceedings of the sixteenthnational conference on Artificial intelligence and the eleventh Innovative applicationsof artificial intelligence conference innovative applications of artificial intelligence,pages 328–334, Menlo Park, CA, USA, 1999. American Association for ArtificialIntelligence.

[11] Mary Elaine Califf and Raymond J. Mooney. Bottom-up relational learning of patternmatching rules for information extraction, 2002.

[12] N. A. Chinchor. Overview of muc-7/met-2.

[13] Yejin Choi, Claire Cardie, Ellen Riloff, and Siddharth Patwardhan. Identifyingsources of opinions with conditional random fields and extraction patterns. In HLT’05: Proceedings of the conference on Human Language Technology and EmpiricalMethods in Natural Language Processing, pages 355–362, Morristown, NJ, USA,2005. Association for Computational Linguistics.

[14] Fabio Ciravegna. Adaptive information extraction from text by rule induction andgeneralisation.

[15] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Seriesin Telecommunications and Signal Processing). Wiley-Interscience, 2006.

[16] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. Gate: A frameworkand graphical development environment for robust nlp tools and applications. InProceedings of the 40th Annual Meeting of the ACL, 2002.

[17] Ido Dagan, Zvika Marx, and Eli Shamir. Cross-dataset clustering: revealing corre-sponding themes across multiple corpora. In COLING-02: proceedings of the 6thconference on Natural language learning, pages 1–7, Morristown, NJ, USA, 2002.Association for Computational Linguistics.

[18] Oren Etzioni, Michele Banko, Stephen Soderland, and Daniel S. Weld. Open infor-mation extraction from the web. Commun. ACM, 51(12):68–74, 2008.

[19] Ronen Feldman, Benjamin Rosenfeld, and Moshe Fresko. Teg&#x2014;a hybrid ap-proach to information extraction. Knowl. Inf. Syst., 9(1):1–18, 2006.

[20] Rayid Ghani, Katharina Probst, Yan Liu, Marko Krema, and Andrew Fano. Textmining for product attribute extraction. SIGKDD Explor. Newsl., 8(1):41–48, 2006.

[21] Ralph Grishman. Information extraction: Techniques and challenges. In SCIE ’97:International Summer School on Information Extraction, pages 10–27, London, UK,1997. Springer-Verlag.

70

BIBLIOGRAPHY

[22] Ralph Grishman, Silja Huttunen, and Roman Yangarber. Information extraction forenhanced access to disease outbreak reports. J. of Biomedical Informatics, 35(4):236–246, 2002.

[23] Ralph Grishman and Beth Sundheim. Message understanding conference-6: a briefhistory. In Proceedings of the 16th conference on Computational linguistics, pages466–471, Morristown, NJ, USA, 1996. Association for Computational Linguistics.

[24] Sanda Harabagiu and Finley Lacatusu. Topic themes for multi-document summa-rization. In SIGIR ’05: Proceedings of the 28th annual international ACM SIGIRconference on Research and development in information retrieval, pages 202–209,New York, NY, USA, 2005. ACM.

[25] Sanda M. Harabagiu, Marius A. Pasca, and Steven J. Maiorano. Experiments withopen-domain textual question answering. In Proceedings of the 18th conference onComputational linguistics, pages 292–298, Morristown, NJ, USA, 2000. Associationfor Computational Linguistics.

[26] Jerry R. Hobbs, John Bear, David Israel, and Mabry Tyson. Fastus: A finite-stateprocessor for information extraction from real-world text. pages 1172–1178, 1993.

[27] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In KDD’04: Proceedings of the tenth ACM SIGKDD international conference on Knowledgediscovery and data mining, pages 168–177, New York, NY, USA, 2004. ACM.

[28] Anette Hulth. Improved automatic keyword extraction given more linguistic knowl-edge. In Proceedings of the 2003 conference on Empirical methods in natural lan-guage processing, pages 216–223, Morristown, NJ, USA, 2003. Association for Com-putational Linguistics.

[29] Ramon Ferrer i Cancho and Ricard V. Sol. The small world of human language. Pro-ceedings of The Royal Society of London. Series B, Biological Sciences, 268:2261–2266, 2001.

[30] T. S. Jayram, Rajasekar Krishnamurthy, Sriram Raghavan, ShivakumarVaithyanathan, and Huaiyu Zhu. Avatar information extraction system. IEEEData Eng. Bull., 29(1):40–48, 2006.

[31] Nitin Jindal and Bing Liu. Identifying comparative sentences in text documents. InSIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference onResearch and development in information retrieval, pages 244–251, New York, NY,USA, 2006. ACM.

[32] Ryan McDonald Kevin Lerman. Contrastive summarization: An experiment withconsumer reviews. In North American Chapter of the Association for ComputationalLinguistics - Human Language Technologies (NAACL HLT) 2009 ; Proceedings of theMain Conference, Boulder, USA, 2009. Association for Computational Linguistics.

71

BIBLIOGRAPHY

[33] Cody Kwok, Oren Etzioni, and Daniel S. Weld. Scaling question answering to theweb. ACM Trans. Inf. Syst., 19(3):242–262, 2001.

[34] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional ran-dom fields: Probabilistic models for segmenting and labeling sequence data. In ICML’01: Proceedings of the Eighteenth International Conference on Machine Learning,pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.

[35] Dawn Lawrie, W. Bruce Croft, and Arnold Rosenberg. Finding topic words for hier-archical summarization. In SIGIR ’01: Proceedings of the 24th annual internationalACM SIGIR conference on Research and development in information retrieval, pages349–357, New York, NY, USA, 2001. ACM.

[36] Jimmy Lin. An exploration of the principles underlying redundancy-based factoidquestion answering. ACM Trans. Inf. Syst., 25(2):6, 2007.

[37] Robert Malouf. Markov models for language-independent named entity recognition.In COLING-02: proceedings of the 6th conference on Natural language learning,pages 1–4, Morristown, NJ, USA, 2002. Association for Computational Linguistics.

[38] Inderjeet Mani and Eric Bloedorn. Summarizing similarities and differences amongrelated documents. Inf. Retr., 1(1-2):35–67, 1999.

[39] Zvika Marx, Ido Dagan, and Eli Shamir. A generalized framework for revealinganalogous themes across related topics. In HLT ’05: Proceedings of the conferenceon Human Language Technology and Empirical Methods in Natural Language Pro-cessing, pages 979–986, Morristown, NJ, USA, 2005. Association for ComputationalLinguistics.

[40] Y. Matsuo and M. Ishizuka. Keyword extraction from a single document using wordco-occurrence statistical information. INTERNATIONAL JOURNAL ON ARTIFICIALINTELLIGENCE TOOLS, 13:157–170, 2004.

[41] Diana Maynard, Valentin Tablan, Cristian Ursu, Hamish Cunningham, and YorickWilks. Named entity recognition from diverse text types. In In Recent Advances inNatural Language Processing 2001 Conference, Tzigov Chark, 2001.

[42] Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. Maximum entropymarkov models for information extraction and segmentation. In ICML ’00: Pro-ceedings of the Seventeenth International Conference on Machine Learning, pages591–598, San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc.

[43] Ion Muslea. Extraction patterns for information extraction tasks: A survey. In InAAAI-99 Workshop on Machine Learning for Information Extraction, pages 1–6,1999.

[44] NIST. Automatic content extraction (ace) program. 1998present.72

BIBLIOGRAPHY

[45] Marius Pasca. Lightweight web-based fact repositories for textual question answer-ing. In CIKM ’07: Proceedings of the sixteenth ACM conference on Conference oninformation and knowledge management, pages 87–96, New York, NY, USA, 2007.ACM.

[46] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Found. Trends Inf.Retr., 2(1-2):1–135, 2008.

[47] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sentiment clas-sification using machine learning techniques. In EMNLP ’02: Proceedings of theACL-02 conference on Empirical methods in natural language processing, pages 79–86, Morristown, NJ, USA, 2002. Association for Computational Linguistics.

[48] Fuchun Peng and Andrew McCallum. Accurate information extraction from researchpapers using conditional random fields. In HLT-NAACL04, pages 329–336, 2004.

[49] Leonid Peshkin and Avi Pfeffer. Bayesian information extraction network. In IJ-CAI’03: Proceedings of the 18th international joint conference on Artificial intelli-gence, pages 421–426, San Francisco, CA, USA, 2003. Morgan Kaufmann PublishersInc.

[50] Ana-Maria Popescu. Information extraction from unstructured web text. 2007.Adviser-Etzioni, Oren.

[51] Ana-Maria Popescu and Oren Etzioni. Extracting product features and opinions fromreviews. In HLT ’05: Proceedings of the conference on HLT and EMNLP. ACL,2005.

[52] M. F. Porter. An algorithm for suffix stripping. pages 313–316, 1997.

[53] J. R. Quinlan. Learning logical definitions from relations. Mach. Learn., 5(3):239–266, 1990.

[54] Dragomir R. Radev and Kathleen R. McKeown. Generating natural language sum-maries from multiple on-line sources. Comput. Linguist., 24(3):470–500, 1998.

[55] Adwait Ratnaparkhi. Learning to parse natural language with maximum entropy mod-els. Mach. Learn., 34(1-3):151–175, 1999.

[56] Frederick Reiss, Sriram Raghavan, Rajasekar Krishnamurthy, Huaiyu Zhu, and Shiv-akumar Vaithyanathan. An algebraic approach to rule-based information extraction.In ICDE ’08: Proceedings of the 2008 IEEE 24th International Conference on DataEngineering, pages 933–942, Washington, DC, USA, 2008. IEEE Computer Society.

[57] Ellen Riloff. Automatically constructing a dictionary for information extraction tasks.In In Proceedings of the Eleventh National Conference on Artificial Intelligence,pages 811–816. MIT Press, 1993.

73

BIBLIOGRAPHY

[58] Mark Sanderson and Bruce Croft. Deriving concept hierarchies from text. In SI-GIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference onResearch and development in information retrieval, pages 206–213, New York, NY,USA, 1999. ACM.

[59] Christopher Scaffidi, Kevin Bierhoff, Eric Chang, Mikhael Felker, Herman Ng, andChun Jin. Red opal: product-feature scoring from reviews. In EC ’07: Proceedingsof the 8th ACM conference on Electronic commerce, pages 182–191, New York, NY,USA, 2007. ACM.

[60] Kristie Seymore, Andrew Mccallum, and Ronald Rosenfeld. Learning hidden markovmodel structure for information extraction. In In AAAI 99 Workshop on MachineLearning for Information Extraction, pages 37–42, 1999.

[61] Warren Shen, AnHai Doan, Jeffrey F. Naughton, and Raghu Ramakrishnan. Declar-ative information extraction using datalog with embedded extraction predicates. InVLDB ’07: Proceedings of the 33rd international conference on Very large databases, pages 1033–1044. VLDB Endowment, 2007.

[62] Stephen Soderland. Learning information extraction rules for semi-structured andfree text. Mach. Learn., 34(1-3):233–272, 1999.

[63] Min Song, Il-Yeol Song, and Xiaohua Hu. Kpspotter: a flexible information gain-based keyphrase extraction system. In WIDM ’03: Proceedings of the 5th ACM in-ternational workshop on Web information and data management, pages 50–53, NewYork, NY, USA, 2003. ACM.

[64] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: A large ontologyfrom wikipedia and wordnet. Web Semant., 6(3):203–217, 2008.

[65] B. M. Sundheim. Overview of the third message understanding evaluation and con-ference. In Proc. of the Third Message Understanding Conference (MUC-3), pages3–16, San Diego, CA, 1991.

[66] J Tait. Automatic summarization of english texts. Ph.D. Dissertation, 1983.

[67] Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the conll-2003 sharedtask: language-independent named entity recognition. In Proceedings of the seventhconference on Natural language learning at HLT-NAACL 2003, pages 142–147, Mor-ristown, NJ, USA, 2003. Association for Computational Linguistics.

[68] Takashi Tomokiyo and Matthew Hurst. A language model approach to keyphrase ex-traction. In Proceedings of the ACL 2003 workshop on Multiword expressions, pages33–40, Morristown, NJ, USA, 2003. Association for Computational Linguistics.

[69] Jordi Turmo, Alicia Ageno, and Neus Catala. Adaptive information extraction. ACMComput. Surv., 38(2):4, 2006.

74

BIBLIOGRAPHY

[70] Peter D. Turney. Learning algorithms for keyphrase extraction. Inf. Retr., 2(4):303–336, 2000.

[71] Peter D. Turney. Thumbs up or thumbs down?: semantic orientation applied to un-supervised classification of reviews. In ACL ’02: Proceedings of the 40th AnnualMeeting on Association for Computational Linguistics, pages 417–424, Morristown,NJ, USA, 2002. Association for Computational Linguistics.

[72] D. J. Watts. Small worlds : the dynamics of networks between order and randomness.1999.

[73] Daniel S. Weld, Raphael Hoffmann, and Fei Wu. Using wikipedia to bootstrap openinformation extraction. SIGMOD Rec., 37(4):62–68, 2008.

[74] Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig G. Nevill-Manning. Kea: practical automatic keyphrase extraction. In DL ’99: Proceedingsof the fourth ACM conference on Digital libraries, pages 254–255, New York, NY,USA, 1999. ACM.

[75] Tak-Lam Wong, Wai Lam, and Tik-Shun Wong. An unsupervised framework forextracting and normalizing product attributes from multiple web sites. In SIGIR ’08:Proceedings of the 31st annual international ACM SIGIR conference on Researchand development in information retrieval, pages 35–42, New York, NY, USA, 2008.ACM.

[76] Bo Wu, Xueqi Cheng, Yu Wang, Yan Guo, and Linhai Song. Simultaneous productattribute name and value extraction from web pages. In WI-IAT ’09: Proceedingsof the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence andIntelligent Agent Technology, pages 295–298, Washington, DC, USA, 2009. IEEEComputer Society.

[77] Fei Wu and Daniel S. Weld. Automatically refining the wikipedia infobox ontology.In WWW ’08: Proceeding of the 17th international conference on World Wide Web,pages 635–644, New York, NY, USA, 2008. ACM.

[78] ChengXiang Zhai, Atulya Velivelli, and Bei Yu. A cross-collection mixture modelfor comparative text mining. In KDD ’04: Proceedings of the tenth ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 743–748,New York, NY, USA, 2004. ACM.

[79] Hongkun Zhao, Weiyi Meng, and Clement Yu. Mining templates from search resultrecords of search engines. In KDD ’07: Proceedings of the 13th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 884–893,New York, NY, USA, 2007. ACM.

75

BIBLIOGRAPHY

[80] Jun Zhu, Bo Zhang, Zaiqing Nie, Ji-Rong Wen, and Hsiao-Wuen Hon. Webpageunderstanding: an integrated approach. In KDD ’07: Proceedings of the 13th ACMSIGKDD international conference on Knowledge discovery and data mining, pages903–912, New York, NY, USA, 2007. ACM.

76


Recommended