Ontology induction for mining experiential knowledge from customer reviews
Thomas Lee
Department of Operations and Information Management
University of Pennsylvania, The Wharton School Philadelphia, PA 19104
Abstract The rapid rate of product innovation, increasing degree of product complexity in categories ranging from digital cameras to toaster ovens fuels the need for resources to help consumers navigate the product space. For a given product category, rather than relying upon domain experts, we turn to the collective experience of the user community as captured in online customer reviews. In order to learn from users, we must first understand the concepts, the relationships, and the vocabulary with which the community discusses the product category. We induce the corresponding ontology from all reviews of all products in the category.
In this paper, we represent the set of customer reviews for a product category as a graph. Vocabulary tokens are represented as vertices and edges capture the co-occurrence of tokens within and between reviews. We learn hyponym (is-a) relationships between concepts by identifying the connected components of the graph. Meronym/holonym (part-of/has-a) relationships are defined by heuristically searching for a maximal clique. We discuss preliminary results from approximately 8000 on-line digital camera reviews and specify a set of quantitative metrics that is currently being evaluated.
Working Winter Information Systems Conference TLee 1 Paper Eccles School of Business
Introduction At the intersection of current manufacturing trends such as mass customization, personalization,
and short product life cycles, the modern consumer is faced with a dizzying selection of makes
and models in a myriad of product categories. Customers formerly relied upon knowledgeable
sales clerks and informed colleagues. Unfortunately, today's product variety challenges the
ability of even professional reviewers like the Consumer's Union to keep pace. Consumer
Reports might review 10 or 20 digital cameras in a single study. Amazon.com recently listed
more than 600 available digital cameras (and camera packages). Fortunately, the World Wide
Web has enabled a new knowledgebase for navigating the product space. On-line customer
reviews represent the community of users. The challenge is to harness the knowledge contained
within the virtual community of customer reviews to complement input from the physical
community of colleagues, neighbors, and clerks.
We previously proposed use-centric mining as one approach to leveraging reviews [18].
From the user community, as represented by customer reviews, we learn those features (product
attributes) that align with particular product uses and applications. For example, what features
does the user community make note of when purchasing a digital camera for vacation travel?
Which features are noteworthy if 'taking eBay auction photos' is the primary reason for
purchasing? A customer's prospective interests are matched to an associated set of product
features. Products are recommended based upon their support for the desired mix of features.
In this position paper, we present a graph-based approach to semi-automatically learn a
domain-specific ontology of product features for the purpose of enabling use-centric mining. We
define an ontology as a controlled vocabulary for discussing both concepts and relationships
between concepts [13, 21]. We learn a category-specific ontology of product features from the
Working Winter Information Systems Conference TLee 2 Paper Eccles School of Business
customer reviews of products in that category. For example, the 'battery types' used for motor
vehicles are not the same 'battery types' used in digital cameras. Using a common vocabulary of
product features (concepts) enables use-centric mining to discover associations between user
interests and product features from customer reviews.
We model the space of online customer reviews in a product category as a weighted
graph where vocabulary terms are vertices in the graph and edges are weighted by the frequency
with which two terms are used together to describe a product feature. Three contributions that
follow from our graph-based approach are presented in this position paper. First, we use co-
occurrences of terms within and between review documents to heuristically prune the graph.
Second, the principle of connected components is used to automatically define subclass or
hyponym (is-a) relationships between concepts. Third, we use cliques to automatically discover
meronym/holonym (part-of/has-a) relationships between concepts.
In the remainder of this paper, we begin with a background statement to both motivate
our work and to clarify the problem statement. We then provide a formal model and include
initial results from processing a set of 8695 digital camera reviews from Epinions.com [1]. The
preliminary status of this research leads us to describe a more formal set of evaluation methods
that we are in the process of conducting but for which we do not yet have results. We close with
a review of related work on ontology learning, a discussion, and future work.
Background
This section begins with a brief overview of use-centric mining to motivate our work. We then
focus on the specific problem that is addressed in this paper: Inducing an ontology of product
features. We describe the ontology and the nature of the learning problem in the context of use-
centric mining.
Working Winter Information Systems Conference TLee 3 Paper Eccles School of Business
Use-centric mining
Prior research involving customer reviews tends to fall into one of two categories: product
focused, or customer focused. Product-focused researchers cluster reviews by manufacturer and
model. One objective is to help designers understand positive and negative aspects of a design
both for incremental improvements and new product development [17, 26, 30]. Moreover,
prospective buyers use product-focused reviews to evaluate a specific item [17]. By contrast,
customer-focused research on product reviews attempts to model individual consumers.
Researchers combine data from all (multiple) reviews written by a single reviewer; the objective
is to create a user profile and predict future purchases [5]
The goal of use-centric mining is to model a particular product category based upon
feedback from that category's user community. For use-centric mining, we develop two
category-specific ontologies: one of product features and one of customer uses. Using ontology-
based extraction [8, 25], each review is mapped into a Boolean vector representation (see Figure
1). Finally, we mine the review vectors for those frequent item sets that associate particular
usages with particular features [15].
<!-- Ontology entries> <Concept> <CName>Ease of use</CName> … <ClassifyExtract> <RegExp>'ease[29]of[15]use' </RegExp> <RegExp>'easy[ -]to[ -]use' </RegExp> …</Concept> … <CName>Compactness</CName> … <ClassifyExtract> <RegExp>'handy'</RegExp> <RegExp>'pocket-size'</RegExp> …</Concept> <Concept> … Review Vector Ease of use Compactness Zoom Lens Cover … ProdID-ReviewerID 1 1 0 0
Figure 1. Ontology support for mapping a review into a Boolean vector
Working Winter Information Systems Conference TLee 4 Paper Eccles School of Business
Product feature ontologies
We define an ontology as a controlled vocabulary for discussing both concepts and relationships
between concepts [13, 21]. Each concept in our product ontology corresponds to a particular
product feature. Abstractly, each on-line review is a list of those features or concepts that a
customer likes or dislikes. Because there are many different ways that different customers might
refer to a particular feature (e.g. '2 AA NiMH' and 'rechargeable lithium ion' are different string
literals for describing a battery type), the ontology normalizes the language for distinguishing
between different features. Normalizing the terminology enables us to mine the feature-use
patterns for use-centric recommendations. Having identified the concepts, we now describe
relationships between concepts.
There are two types of relationships between concepts that we would like to learn.
Where each review is a list of camera feature(s), the first relationship is the hyponym
relationship that subdivides a concept like camera feature into its sub-classes. Battery type is a
feature of digital cameras as are resolution, and compression. We can then distinguish review
phrases like '2 AA NiMH,' which are instances of battery type, from other literals like 'MPEG-1'.
Moreover, a given subclass might be hierarchically decomposed into a more precise level of
specificity. For example, MPEG, JPEG, and TIFF are hyponyms of compression. The string
'MPEG-1' is an instance of the concept MPEG while the string 'JPEG' is an instance of the
concept JPEG which is-a compression scheme which is-a camera feature. By representing
concept features at different levels of precision, we aggregate reviews in different ways and can
isolate whether the concept of compression is important or whether customers care about a
specific compression scheme because it is easier to import into their software for making holiday
greeting cards.
Working Winter Information Systems Conference TLee 5 Paper Eccles School of Business
The second type of concept relationship that we would like to learn is the 'part-of' or 'has-
a' relationship (meronym/holonym) that describes the 'attributes' of a particular product feature.
For example, a battery type might be described by the size (instances of size include 'kcrv3' and
'AA'), by the chemistry (instances of battery chemistry are 'LiMH' and 'Alkaline') or by the
voltage. Each attribute constitutes a distinct concept. For example, digital camera users who
travel worldwide are concerned about battery size. AA and AAA sized batteries are easily
available world-wide while kcrv3 cells may be more difficult to purchase overseas. In this case,
use-centric mining would identify the holonym size rather than the more general concept battery
type for travelers.
Approach
In this paper, we focus on an unsupervised approach to inducing the ontology of features for a
particular product category from on-line customer reviews. Although the most common
approach to ontology development involves expert intervention, products within a single
category evolve over time and the breadth of product variety makes finding experts for multiple
categories implausible.
Abstractly, each customer review is a list of features. To minimize the complexities of
natural language processing and information extraction, we model each review solely by its list
of Pros and Cons (see Figure 2). In addition, we gain additional data on ontology concepts from
the 'product detail' page for each item in the product category. From each 'product detail' page,
we extract a list of features (battery type, compression type, etc.) and the corresponding
relationship data. For example, in Figure 3, 'MPEG,' 'TIFF,' and 'JPEG' are listed as instances of
Compression Types.
Working Winter Information Systems Conference TLee 6 Paper Eccles School of Business
Every review and every line of text from the product detail page is parsed into a list of
phrases using naïve, lexical processing techniques. Pro and Con comments are inter-mixed
because for ontology construction, we are only interested in identifying the concepts that
customers are interested in. Whether the feature is a strength or a weakness of a particular
product is not important. Phrases that emerge from lexical processing of the reviews in Figure 2
include 'small size,' 'movie mode,' 'great auto-focus,' '3.2 megapixels,' 'weak flash,' and 'very
slow response when using redeye reduction'. Corresponding phrases for the concepts in the
Figure 3 Product Description include '2 Lithium Batteries(LB-01),' '4 AA Alkaline,' '4 AA
NIMH Rechargeable,' '4 AA Lithium,' '4 AA NiCd Rechargeable Batteries (Optional).
Figure 2. Epinions Customer Review Page Figure 3. Epinions Product Description Page
Each phrase is tokenized into its constituent words, and the resulting token list is pruned
using shallow linguistic processing techniques including stop-words, capitalization, Porter
stemming, and a co-occurrence heuristic described below. For example, the phrase 'very slow
response when using redeye reduction' is reduced to redey reduct.
Each phrase is a graph where each token is a vertex and every token pair in the phrase is
an edge. Edges are weighted by their frequency. The graphs for each phrase in a review and
each review in the set of all reviews are combined. A portion of the graph constructed on an
Working Winter Information Systems Conference TLee 7 Paper Eccles School of Business
initial experiment of more than 8500 digital camera reviews from Epinions.com is depicted in
Figure 4. Edge weights are omitted for readability.
Hyponyms are defined in terms of a
Breadth First Search (BFS) of the token
graph. All phrases involving only tokens in
the BFS traversal of the graph from a given
node to a parameterized depth (we used a
depth of 2) are clustered in the same hyponym
(see Tables 1 and 2).
indoorredey
wboptimunustint
yellow
focuss
troubl
focusaction
still
warrantiineffect
reduct
unaccept
honor
pentax
part
servic
year
month
model
customhorribl
pri
plastic
move
flimsi
coupl
indoorredey
wboptimunustint
yellow
focuss
troubl
focusaction
still
warrantiineffect
reduct
unaccept
honor
pentax
part
servic
year
month
model
customhorribl
pri
plastic
move
flimsi
coupl
Figure 4. Graph of tokens from Epinions digital camera reviews
Meronyms and holonyms are
discovered by heuristically searching for a maximal clique within the connected component that
defines a single concept. (see Table 3).
Model
Our basic model builds on the familiar definitions of a graph [2]. A graph G = (V,E) is a pair of
the set of vertices V and the set of edges E. An edge in E is a connection between two vertices
and may be represented as a pair (vi,vj) V. A path is a list of vertices in V where there is an
edge in E from each vertex to the next vertex in the list.
A graph is connected if there exists a path between any two vertices in the graph. S =
(V’,E’) is a subgraph of G if v V’ v V and vi,vj E’ vi,vj E and vi,vj V’. Gi is a
maximally connected subgraph of G if Gi is a connected subgraph S = (V’,E’) and for all vj V
and V’, there is no vk V' where vj, vk E. The connected components of G are the set Gi …
Working Winter Information Systems Conference TLee 8 Paper Eccles School of Business
Gn of maximally connected subgraphs of G. By definition, for any two maximally connected
subgraphs Gi and Gj Vi Vj is empty and Gi … Gn = G.
A complete graph G = (V,E) is a graph for which there is an edge between every pair of
vertices. vi,vj V, vi,vj E. A clique S is a complete subgraph of G. S' = V',E' is a maximal
clique of G if S' is a clique and there is no vj V and V' such that vk V', vj, vk) E.
Having defined our notation, we can now construct a graph as described in the Approach
section. One graph is created from the set of all customer reviews (see Figure 2), and a separate
graph (with distinct connected components for each product feature) is created from the set of all
product descriptions (see Figure 3). Every customer review is reduced to a list of phrases
represented as string literals that are instances of the generic concept product feature. Phrases
are tokenized into individual words and then transformed into a graph. Each token represents a
vertex in the graph and edges are defined by the token pairs in the phrase vi, vj i j. A single
graph is constructed from all phrases of all lists of all customer review.
Likewise, every product description (one for each product instance in the category e.g.
the Canon S400 or the Olympus Camedia C 4000) is reduced to a set of phrase lists. Each phrase
list contains instances of a specific product feature hyponym. The product description graph
contains a connected component for each hyponym; and each such subgraph represents the
corresponding lists from all product description pages.
Pruning the graph (co-occurrences)
The graph constructed by the basic model is comprised of tokens taken from product reviews
that are directly input by customers. As a consequence, the set of vertices includes misspellings
and other inconsistencies that can lead to spurious edges between unrelated concepts or dilute the
Working Winter Information Systems Conference TLee 9 Paper Eccles School of Business
weight of existing edges. Therefore, we actively prune tokens (vertices) as we go through the
process of parsing a line into a list of phrases, construct the graph for each phrase, aggregate
phrase-graphs into the graph for a line, and combine graphs for different lines.
Our first step in pruning involves a series of lexical processing steps. Beginning with a
line, we naively parse on list separators including commas, slashes, or semicolons. We apply a
simple heuristic to count list separators and assume that multiple instances of a separator
correspond to a list (e.g. two or more commas) while single instances are ignored.
For each resulting phrase in the line, we apply the following filters:
• Discard stop words to reduce noise in the graph
• Convert all tokens to lower case to address capitalization and proper nouns (e.g. 'Microsoft
Windows' is a required operation system for some digital camera software)
• Apply Porter stemming to address plural forms as well as simple misspellings. Note that
introducing a dictionary, thesaurus, or Wordnet-type variants might also be of use. However,
in this initial work, we wanted to limit the use of external sources that might introduce
domain dependence. In this way, we focus on the robustness of domain-independent lexical
and graph-based processing.
• Discard any phrase that contains a non-alphanumeric or punctuation feature not eliminated
by parsing phrases. Special exceptions were generated by hand for decimal numbers and
hyphenated terms. Examples of discard phrases taken from the digital camera product
feature Computer Requirements include: 'iboook(tm),' 'windows®98 second edition (se),'
and 'windows 98*.' The intuition is that our data set ranges from several hundred to several
thousand phrases depending upon the starting feature concept. Therefore, we can safely
Working Winter Information Systems Conference TLee 10 Paper Eccles School of Business
discard outlier phrases. Discarding does raise the question of an optimal sample size of
phrases; we revisit this issue in the Evaluation and Discussion sections below.
The final pruning step is based on the co-occurrence of terms within a line. We assume
that each phrase in a line addresses a distinct product feature or concept. By extension, we
conclude that a single token that appears in multiple phrases of the same line is therefore not
representative of any of the phrases in which it appears. For example, a customer review line
might include the phrases 'great battery life' and 'great price.' In our graph-based representation,
the token 'great' would connect subgraphs representing two distinct concepts: battery life and
price. As a consequence, we prune tokens that appear in multiple phrases of the same line.
More formally, we apply the co-occurrence heuristic when combining the graphs of the
constituent phrases into the graph for the corresponding line. A line containing n phrases should
produce a graph of n connected components. Any vertex that appears in more than one clique of
the graph is discarded. Note that we can also test for a violation by checking whether the
number of connected components is fewer than the number of phrases.
Our application of co-occurrence runs counter to traditional research in information
retrieval and text mining. Co-occurrence is traditionally interpreted as an indicator of
interestingness [15, 29]. We believe that the traditional notions of intra-document frequency and
inter-document frequency have a different interpretation for text corpora like customer reviews.
We use co-occurrence to identify tokens that introduce noise into the graph and acknowledge
that the assumption may be unique to data sets that are domain specific, short in length, and
dense in token significance.
Working Winter Information Systems Conference TLee 11 Paper Eccles School of Business
Learning hyponyms
We equate the task of finding hyponyms (is-a relationships) with finding the connected
components of the graph. We find the connected components by arranging the vertices of the
graph in descending order by degree and repeatedly executing a breadth first search (BFS) to
find the distinct connected components. For example, in the case of a graph representing the list
for phrases describing the product description concept battery life, we can distinguish the
subclasses of battery life by hour and battery life by number of images. The phrases in Table 1
are excerpted from the 201 Epinions.com digital camera product descriptions in our data set of
nearly 600 digital cameras that provide phrases for the feature concept battery life.
Phrases Connected components by time
Connected components by images
2.5 hr 1.66 hr 2.5 hour 1.66 hr 250 shot 250 imag 250 shot 650 shot 180 imag 2.5 hr shot accord 135 imag pentax shot 390 imag 250 imag 2.5 hour pentax shot 180 imag shot accord 135 imag 1.67 hour 1.67 hour 650 shot 390 imag
Table 1. Connected components for representative phrases of Battery Life taken from Epinions Product Descriptions
The hypothesis that connected components isolate concepts breaks down in the general
graph generated from customer review phrases because of the tremendous expressiveness of the
English language. Figure 4 illustrates how the token redey, associated with the concept of
redeye reduction, is connected to the concept of warranty by the path redey, reduct, ineffec,
warranty. We address this problem in two ways. First, the co-occurrence heuristic decreases
noise by deleting spurious tokens such as 'great' or 'good' that might otherwise link distinct
concepts. Second, instead of searching for connected components, we cap the BFS at a depth of
two (which corresponds to the average length of our pruned phrases). Table 2 identifies the
phrases that define the concept of redeye reduction based upon a breadth first search of depth
two over the pruned graph generated from 8695 Epinions.com digital camera customer reviews.
Working Winter Information Systems Conference TLee 12 Paper Eccles School of Business
path length pruned phrases from the graph of customer reviews
0 redey
1 redey indoor reduct redey
2 indoor wb indoor optim reduct ineffect unaccept reduct indoor unus indoor tint yellow indoor focuss troubl Table 2. BFS of depth 2 in the customer review graph for the concept of redeye reduction
Learning meronyms/holonyms
Finding meronyms and holonyms is addressed by heuristically searching for a maximal clique in
the graph of a single hyponym. The (strong) hypothesis is that if the data set is large enough, we
can discover at least one instance of the Cartesian product of all feature attributes that define the
concept; and that we can find that in a maximal clique. Colloquially, there may not be a single
phrase that includes an instance of all feature attributes, but transitive relationships (edges
between different phrases) could complete the relation. Because the general problem of finding
a maximum clique is NP-complete, we adopt a simple heuristic for finding a maximal clique that
is presented as Figure 5 [4, 31]. We iterate repeatedly, selecting the largest such maximal clique.
Application to the hyponyms of battery life in Table 1 is trivial. For connected
components by time, because 'hr' and 'hour' never appear in the same phrase, the maximal clique
(which happens to be the maximum in this case) is represented by a token (integer) for duration
and the token for units ('hr' or 'hour'). The resulting meronym/holonym relationships are
included as Table 3. Note that the technique does not assign names to the concepts, although in
many cases, concept names naturally emerge. Note also a limitation of the strong assumption
regarding the maximal clique. For connected components by images, the tokens 'accord' and
'pentax' are necessarily forced, by default, into the meronym/holonym concept quantity. We take
up this issue again in the Discussion below.
Working Winter Information Systems Conference TLee 13 Paper Eccles School of Business
Table 4 shows excerpted results from applying the maximal clique heuristic to the graph
for battery type. Beginning from 575 product descriptions containing phrases for battery type,
we pruned the list to 21 tokens that combined to form 63 distinct phrases. As seen in Table 4,
phrases need not have a token for each meronym/holonym concept. Several phrases in Table 4
do not include the token 'recharg' and one phrase does not have a quantity. Logical constraints
can assist in assigning the tokens of a given phrase to their constituent meronym/holonym
concepts. However, there are times that such logical constraints are insufficient. Additional
information is required to assign the token 'kcrv3' to size. Notice also that battery voltage, as
captured by the phrase '9v' is omitted from the discovered meronym/holonym concepts. By first
applying the connectedness criterion for hyponyms, we would remove the phrase '9v' altogether,
but it is not clear that voltages are a hyponym of battery type rather than a meronym/holonym in
the same sense as battery chemistry or size. We discuss relaxing the strong clique-assumption
below.
# input g: the graph as a list of vertex edge pairs # v1: as the first vertex in the clique gen_maximal_clique(v1, g): v_list = edge_list(v1, g) # pairwise adjacent vertex of v1 ordered by degree. c_list = edge_list(v1, g) # candidate vertices for addition to the clique clique = [v1] # initialize the clique return iterate_clique(clique, c_list, v_list, graph) iterate_clique(clique, c_list, v_list, graph) if there are no candidates to add and no vertices to check, return c v_car is the first vertex in v_list v_cdr is the remainder of v_list e_car is the edge list of v_car if v_car is in c_list, v_car is adjacent to everything in the clique so far add(v_car, clique) update c_list to adjacent nodes of the new clique: intersect(c_list, e_car) recurse w/ v_cdr else recurse w/ v_cdr
Figure 5. Algorithm to find a maximal clique [31]
Working Winter Information Systems Conference TLee 14 Paper Eccles School of Business
Connected components by time
Connected components by images
Maximal clique 1.66 hr 250 shot Meronym/holonym concepts names quantity units quantity units Token assignment 1.66 hr 250 shot 2.5 hr 650 shot 2.5 hour accord shot 1.67 hour pentax shot 250 imag 135 imag 390 imag 180 imag Table 3. Meronym/holonym relationships for hyponyms of the feature Battery Life
Initial phrases Connected components (CC) by
rechargeable or not CC by voltage
Maximal clique recharg nimh aa 4 9v Meronym/holonym concepts names
recharg chemistry size qty voltage
recharg nimh aa 4 recharg nimh aa 4 aa recharg alkalin 4 recharg alkalin aa nimh recharg 2 recharg nimh 2 2 aaa alkalin alkalin aaa 2 aa lithium 4 lithium aa 4 recharg lithium built recharg lithium built 2 kcrv3 kcrv3 2
Token assignments
9v 9v Table 4. Meronym/holonym relationships for Battery Type
Evaluation
In this section, we outline on-going and future work to develop evaluation metrics. Historically,
evaluation of research in ontology induction was quite limited. Criteria were limited to
subjective assessments of the researchers and/or domain experts [19, 24, 27]. Assessment might
simply entail publishing representative results as we provided in the previous section. Only
more recently has research in ontology induction begun to develop more objective metrics.
Modica, Gal and Jamil [25] use precision and recall to systematically document the effectiveness
of different strategies at correctly incorporating new concepts from a representative data source
into an existing hierarchy ontology. Popescu, Yates and Etzioni [28] develop multiple
techniques to extract and construct sub-class structures from text. They define their reference
Working Winter Information Systems Conference TLee 15 Paper Eccles School of Business
standard for correctness as the union of all sub-class structures from all techniques and use
precision and recall as an internal consistency check.
We describe three different techniques for evaluating the validity of our co-occurrence
heuristic, the accuracy of ontology relationships, and the robustness of the induced ontologies
respectively. Moreover, while all three evaluation approaches are applicable to the existing
category of digital cameras, we have also collected reviews and product description information
for additional product categories including DVD players, camcorders, toaster ovens, and vacuum
cleaners. The intention is to evaluate the assumptions and heuristics across multiple product
categories that reflect the same inter-document and intra-document characteristics.
Validity of the co-occurrence heuristic
In our approach, we described a co-occurrence heuristic for discarding irrelevant terms in a pre-
processing step to the connectedness algorithm for categorizing phrases into distinct subclasses.
While we currently capture the tokens that fail the co-occurrence heuristic, we need to analyze
the discarded tokens to determine the Type I and Type II errors. Type I errors in this context
would discard terms that are indeed significant and Type II errors would report terms that linked
concepts within the graph that could/should have been discarded but were missed using this
heuristic. A more sophisticated, probabilistic model that considers not only co-occurrence within
a single review but co-occurrences between reviews might provide a better filter for terms to
discard. Because of the size of the data set and possible bias as a function of particular reviews
or classes of subsets of products, we also would like to check the heuristic by repeatedly
sampling from smaller data sets for validation.
Working Winter Information Systems Conference TLee 16 Paper Eccles School of Business
Accuracy of the ontology relationships
The traditional approach to validating ontologies involves asking domain experts to vet the
results. For an automated system that is intended to capture knowledge from multiple,
heterogeneous product domains, soliciting input from domain experts is not scalable. Instead,
we propose to compare our induced ontologies to publicly available expert-generated ontologies:
professional reviews and on-line buying guides. We can calculate metrics with respect to both
the product features as well as the categories or levels into which individual product features are
subdivided. Precision asks, "how many of the induced features and levels are actually used in
professional reviews and on-line buying guides." Recall asks "how many of the features and
levels used in practice are automatically induced?"
We have collected 22 different buying guides from print and on-line sources including
those by the Consumer Union (which publishes Consumer Reports), Consumer Electronics HQ,
and Forbes [6, 7, 23]. Collectively, the guides document 56 different product features, some of
which are considered subclasses by some recommenders and distinct features by others. For
example, some experts identify optical zoom and digital zoom as unique product attributes while
others classify optical and digital zoom as subclasses of zoom.
In addition, we have identified 11 different on-line shopping and recommendation sites
for digital cameras and a comparable number for the other product categories for which we have
data. As depicted in Figure 3, on-line shopping sites and recommendation sites, such as
Epinions.com [1], provide browsing interfaces that decompose product attributes into categories
for customers to easily subdivide. As with professional reviews, we can compare our induced
product features and categories to those used in on-line interfaces.
Working Winter Information Systems Conference TLee 17 Paper Eccles School of Business
Figure 3. Epinions.com Digital Camera Selection Wizard
Robustness of the induced ontology
While we can validate the accuracy of automatically induced ontologies with respect to one or
more external sources, our ultimate objective is to support the automated extraction of features
from on-line reviews for the purpose of tailoring usage-based recommendations [18]. As a
consequence, whether our automated ontologies align with external sources or not, the true test is
our ability to facilitate usage-based recommendations.
To that end, from our initial set of customer reviews, we have created a hand labeled set
of 594 digital camera customer reviews from Epinions.com. The data set was developed by two
independent student coders and includes inter-coder overlap on 203 of the reviews. The data set
includes 21 different digital camera products ranging in price from $45 to $1000 and covers
Working Winter Information Systems Conference TLee 18 Paper Eccles School of Business
resolutions of 1 MP through 6+ MP. The coders manually extracted and categorized both
product features and customer uses. Categories were assigned based upon product features and
levels (subclasses) as defined by the expert reviews used to externally validate ontology
accuracy.
The manual data set not only constitutes a reference for assessing the accuracy of
ontology-based extraction but also provides us with a means for incrementally evaluating our
usage-based recommendations [18]. Because use-extraction involves an approach independent
of ontology-based feature extraction, we can combine manually extracted usage data with
automatically extracted feature data to set a lower-bound on the salience of our resulting
recommendations.
Related work
There is a large body of work on the topic of ontology induction that appears in different
disciplines under different names. Linguists learn specific taxonomic relationships such as
hyponyms and meronyms. Others use the term 'concept hierarchy.' 'Clustering categorical data'
defines a third body of similar research. Although the terminology may differ, the underlying
objective is the same: to group terms into concepts and to identify different types of
relationships between concepts. We can loosely separate the literature based upon whether the
source data being categorized is relational or not. The different approaches will also vary in the
degree to which they are supervised or unsupervised.
Working Winter Information Systems Conference TLee 19 Paper Eccles School of Business
Non-relational data.
Most of the approaches to ontology induction that begin with non-relational data identify two
distinct processes: extract data and then learn relationships. Supervision may be introduced in
either of the two processes.
One approach is to begin with a single seed ontology and then to grow that ontology by
using supervised learning techniques to extract data from relevant Web pages. Missikoff and
Navigli [24] begin with a generic reference ontology and then extract from domain-specific web
pages to tailor the resulting ontology. By contrast, Modica, Gal and Jamil [25]seed their
ontology with a single, representative Web source. They use the source HTML to define concept
names and the underlying Document Object Model (DOM) to define the initial relationship
between concepts. Incremental refinement is managed using techniques borrowed from the
database schema matching research literature. In particular, they use unsupervised techniques
such as term similarity, concept (instance) matching, and external sources such as thesauruses.
Rather than explicitly separating extraction from concept learning, it is possible to
combine the two steps to define extraction patterns that are explicitly indicative of a particular
conceptual relation. Hearst did early work applying this approach to learning hyponym
relationships [16]. Maedche and Staab [20, 22] use lexical analysis and syntactic patterns to
extract sets of candidate hyponym concept pairs. They then apply frequent item-set analysis [3].
The confidence and support measures from association rule mining are used to estimate the
validity of a particular concept pair. Instead of association rule mining, KnowItAll [9, 28] draws
upon external sources such as WordNet and labeled training data to assess the validity of
relationships. Specifically, the frequency of a term pair is used to calculate the probability that
the pair is an instance of a particular sub-class relationship.
Working Winter Information Systems Conference TLee 20 Paper Eccles School of Business
While Maedche and Staab use generic extraction rules that are indicative of class
relationships in many different domains (e.g. X such as Y), Finkelstein-Landau and Morin [10]
explicitly draw on supervised techniques to learn domain-specific patterns of hyponyms. They
identify a training set of specific sentences containing relationship instances from which to build
their extraction patterns. Instead of an automated concept assessment technique, Finkelstein-
Landau and Morin rely upon iteration with experts to evaluate the resulting concept pairs.
Finkelstein-Landau and Morin contrast their supervised approach with an unsupervised
technique that returns to the earlier separation of extraction from relationship learning.
Extraction is performed using shallow natural language processing (NLP) techniques such as
tokenizers, Parts-of-Speech (POS) labeling, and chunkers. Relationships are defined based upon
statistical measures of co-occurrence both within a single document and between different
documents.
Like Finkelstein-Landau and Morin's unsupervised approach, we separate "information
extraction" from relationship learning. While we do rely upon lexical analysis to clean the data,
by focusing exclusively on customer review data and using a very narrow representation, we
greatly simplify the extraction step. In this way, we do not need a seed ontology like Missikoff
and Navigli and we do not explicitly need to leverage the source HTML or underlying DOM as
Modica, Gal, and Jamil do. At the same time, we believe that the heterogeneity of products
reviewed in on-line forums present a reasonable test for the robustness of our approach.
More significantly, many of the lexico-syntactic rules used for extracting hyponym
concept pairs are less applicable in the domain of customer reviews where grammatical
conventions might break down and text, as in the case of lists of Pros or Cons, are simply lists of
phrases with no associated syntactic or semantic context. A customer might simply write that
Working Winter Information Systems Conference TLee 21 Paper Eccles School of Business
"NiMH rechargeables are great" without reference to the concept's hypernym. As a
consequence, techniques to learn syntactic patterns that are representative of hyponym
relationships like Hearst or Maedche and Staab might prove less applicable.
By contrast, phrases in customer reviews often encode 'part-of' meronym relationships.
We leverage co-occurrences within a single review as a heuristic for discarding certain tokens
and discuss the use of co-occurrences between reviews as a weighting factor to relax some of our
stronger assumptions about cliques.
Relational data
Processing relational data greatly facilitates the use of unsupervised techniques by leveraging
knowledge of the structure to simplify the task.
Han and Fu [14] and Lu [19] develop routines for structuring both numerical and nominal
data. For numerical data, they use binning strategies to induce categories within a single
attribute domain. For example, how to categorize different levels of digital camera resolution
ranging from less than 1 megapixel to more than 8 megapixels. In binning, one categorizing
criteria might be to evenly distribute the domain into a pre-defined number of categories. For
nominal data, they do not categorize the values within a particular domain. Instead, they attempt
to learn the hyponym relationship between attribute domains. Based upon the cardinality and the
frequency histogram of each attribute domain, they define a partial order on the attribute
domains that define the hierarchical relationship between domains.
Suryanto and Compton [32] begin with a database in the form of a rule-based classifier.
Their objective is to define a taxonomy on the classes that are identified by the classifier. They
distinguish between subsumption (the hyponym relationship), mutual exclusivity (disjoint
Working Winter Information Systems Conference TLee 22 Paper Eccles School of Business
subclasses), and similarity. The intuition is that concepts with similar classification rules are
similar and the derivation to subclasses follows from transitivity of the rules.
Ganti, Gehrke, and Ramakrishnan [11] as well as Gibson, Kleinberg, and Raghavan [12]
assume a relational table and attempt to find subcategories within a particular attribute domain
Di. This differs from the work by Han and Fu and Lu, which define orderings between attribute
domains. Ganti et al. define the 'inter-attribute' summary as a weighted graph where all values of
all domains in the relation are vertices in the graph. The weight of an edge between ai in Di and
aj in Dj where i ≠ j, is the number of tuples in which ai and aj appear. In addition, they assume
a'priori knowledge of the relation and use that knowledge to calculate the 'intra-attribute'
summary between two attribute values in the same domain. Finally, their relations have no nulls
and the mapping of every value to its corresponding attribute domain is known. Gibson et al. use
the same relational assumptions but instantiate multiple instances of a vertex-weighted graph
which they call basins. They iterate over the set of basins until achieving a fixed-point where
vertex weights are separated into distinct positive and negative subsets defining the domain
subclasses.
Contrasting the work that assumes a priori knowledge of the relations, our approach
bears similarities to schema induction. Using the metaphor of learning relational schemas, each
concept is a relation; the corresponding phrases, extracted from the text of customer reviews or
product descriptions, are tuples of the relation. Finding the largest maximal clique defines the
relational schema. However, our tuples have many nulls. Thus, though we begin with a graph
representation that is similar to Ganti's, our graph is incomplete. Furthermore, we exploit the
graph in a different way. First, we use the principle of connectedness to cluster the tuples of a
Working Winter Information Systems Conference TLee 23 Paper Eccles School of Business
relation into hyponyms rather than categorizing the values of a single attribute domain. Second,
we cannot leverage intra-attribute knowledge because the schema is unknown.
To the degree that we can induce a relational structure, Gibson et al. and Ganti et al. offer
a complementary strategy to our connectedness approach for learning hyponym relations within
a domain. Where Han and Fu and Lu define a partial order over the attribute domains in the
schema, we treat the attribute domains as meronyms. In general, neither case is always true.
Discussion and future work
Earlier, we outlined a number of quantitative evaluation measures that are the subject of ongoing
work. In this section, we reflect on the preliminary results as represented by the features and
levels induced on a data set of more than 8000 digital camera product reviews from
Epinions.com covering more than 500 digital camera products, each with a product description
page also from Epinions.com. Even without more quantitative evaluation metrics, a number of
interesting conceptual and pragmatic limitations are revealed by the exploratory work on digital
camera reviews and descriptions. We conclude by presenting some natural extensions to the
current work beyond use-centric mining.
Conceptual issues
Conceptually, we can separate our observations into comments on hyponyms and comments on
meronym/holonyms. For hyponyms, we rely upon a depth-constrained BFS of the token graph.
As a consequence, certain phrases can appear in the subgraphs of more than one hyponym
concept. This overlap exists because we resort to a bound on the BFS rather than relying upon
connected components to distinguish hyponyms. As seen in Table 2, certain phrases, such as
'indoor focuss troubl' are most likely not relevant to the concept redeye reduction. Two solutions
Working Winter Information Systems Conference TLee 24 Paper Eccles School of Business
are to either refine the strategy for parsing phrases and pruning the tree (see below) or to
integrate edge weights into the construction of hyponym subgraphs.
Other issues emerge regarding our approach to inducing meronyms and holonyms. First,
we make the very strong assumption that a heuristic search for a maximal clique over the graph
of phrases that instantiate a concept will capture the complete set of meronyms and holonyms for
that concept. Although the graph of a concept is comparatively small, there is no guarantee that
our heuristic process does not miss a larger maximal clique that more completely represents the
appropriate concept attributes.
In actuality, our search for a maximal clique makes the implicit assumption that the set of
all meronyms and holonyms is captured by a maximum clique. In practice, this assumption may
not prove reasonable. As demonstrated in Table 4 with the phrase '9v,' there is not necessarily
any reason to expect that a particular feature attribute, in even a single instance, will necessarily
appear across all pairwise attributes. Instead, we suggest using edge weights, in concert with co-
occurrence, to search for dense subgraphs that may prove more suitable.
The limitations of our assumption about maximal cliques are exacerbated by the
incompleteness of phrases in reviews and product descriptions. Unlike the relational tuples in
related work by Ganti et al. and by Gibson et al. that do not contain nulls, phrases are quite
sparse. For a sufficiently complex concept, it is unusual for a single phrase to comprise tokens
that instantiate every meronym or holonym of the complex concept.
Not only is discovering meronym/holonyms difficult. For sparse phrases, assigning
tokens to the appropriate meronym or holonym is a separate challenge. First, the maximal clique
might miss the relevant concept. As seen in Table 4, the maximal clique does not include the
voltage concept. Moreover, there is sometimes little contextual information to guide token
Working Winter Information Systems Conference TLee 25 Paper Eccles School of Business
assignment. For example, absent additional information, the token 'kcrv3' could refer to the
concept of chemistry or size or even rechargeable. Finally, the data set may simply include
erroneous data. Our list of phrases for battery type includes the phrase 'neck strap.'
We are currently experimenting with two approaches to the challenges related to inducing
meronym/holonyms. First, we are attempting to identify additional sources of relation
information to assign the tokens in a phrase to the appropriate meronym or holonym. Inter-
document edge frequency is one source of additional information. A second source of
information involves foreign key relationships between one concept and other concepts. For
example, a review has an associated product description. That description contains additional
data and may in turn link to product descriptions (and reviews) for similar items (e.g. made by
the same manufacturer). A second approach to inducing meronym and holonym relationships
involves traditional clustering. If we can identify appropriate similarity measures, it is possible
to cluster tokens to maximize within-attribute-domain similarity while minimizing between-
attribute-domain similarity.
Pragmatic issues
In addition to conceptual issues, our early efforts reveal a number of pragmatic considerations.
First, our technique is strongly dependent upon good parsing. The robustness of our approach in
the face of parsing is an open question. For example, we parse the line '2 aa lithium ion, nimh,
alkaline batteries' into three phrases: '2 aa lithium ion,' 'nimh', and 'alkaline batteries.' The
grammatically correct decomposition would distribute the prefix '2 aa' and the suffix 'batteries'
across all three chemistry types (lithium ion, nimh, and alkaline). Moreover, the parse reveals
the limitations of our graph approach to hyphenated terms. 'lithium-ion' is a single vertex in the
graph whereas 'lithium' and 'ion' are distinct vertices. As suggested earlier, additional text
Working Winter Information Systems Conference TLee 26 Paper Eccles School of Business
processing to address spelling mistakes such as 'focuss' in Figure 4 and again in Table 2 will also
help. Alternatively, one might also suggest that our technique is complementary to other
approaches that develop more sophisticated parsing techniques.
Second, the technique may prove sensitive to data sample size. We rely upon a large data
set to yield the phrases from which we identify a maximal clique. The large data set is also a
boon because we discard phrases quite liberally to minimize the effects of naïve parsing (see
above). To measure the sensitivity to sample size, we propose to perform cross-validation on
smaller samples of reviews and product descriptions. We can plot the trade-off between sample
size and ontology performance measures to identify diminishing returns and attempt to estimate
a minimal number of required reviews. Care needs to be taken in ensuring sufficient
heterogeneity in the sample selection with respect to different product features and the
corresponding feature attributes.
Finally, the generalizability of our approach may appear limited at first glance. However,
our approach, with its attendant limitations, is quite general with respect to domain knowledge.
The process is designed to apply across heterogeneous product domains, and we are currently
testing the approach on other product categories. Moreover, while the strong assumptions that
we make about co-occurrence may appear limiting, there are other domains where similar
phrase-like text strings prevail as opposed to prose. Progress notes in medical records are one
such example. Finally, the technique is most susceptible to parsing (see above). To the degree
that complementary approaches can provide phrases as input, a graph-based approach may still
prove tractable.
Future work
Working Winter Information Systems Conference TLee 27 Paper Eccles School of Business
In addition to a more rigorous, quantitative analysis of our current approach as well as steps to
address some of the issues that emerged from early trials, there are a number of ways in which
we might enrich the relationships that we are learning.
For example, we currently learn hyponym relationships between classes of features.
However, with respect to a single instance, some subclasses are mutually exclusive while others
are not. For example, the feature camera type might have the following subclasses: slr,
standard, and compact. Digital cameras also have the feature lens of which fixed focus and zoom
are two such subclasses. A single instance of a digital camera would only be categorized in a
single type subclass but could support multiple interface types. From a marketing and
recommendation perspective, it would be particularly useful to extend our induced ontology to
distinguish between mutually exclusive subclasses and those that are not.
The two features of camera type and lens also exhibit a second dimension of the
hyponym relationship. Some subclasses are actually ordinal in nature. slr, standard, and
compact, are decreasing in size. fixed focus and zoom are not similarly ordered. Learning
ordinal subcategories is particularly useful because of the closely related task of aligning
different orderings. From a recommendation standpoint, aligning orderings is critical because
different customers may address a concept using parallel ordinal categories. For example, one
customer might seek a small, medium, or large digital camera whereas another might seek a
compact, standard, or professional. The problem is most commonly manifested in digital
cameras where resolution is mapped to digital photo enlargement. Some customers seek a 3.0
MP camera whereas others seek a resolution appropriate for printing high quality 8 x 10
enlargements.
Working Winter Information Systems Conference TLee 28 Paper Eccles School of Business
From aligning parallel orderings we might extend to aligning or merging entire
ontologies. There are many different sources of on-line customer reviews, and one of our end
objectives is to support use-centric mining across reviews gathered from heterogeneous sources.
Doing so, however may also require learning ontologies unique to distinct sites and then aligning
those ontologies [Modica, 2001 #14]. Ontology alignment can help ensure that the review
vectors derived from distinct sources are remain comparable.
While there are many buying guides that provide recommendations for specific products
or services, most guides tend to rely upon domain-specific experts. Unfortunately, reliance upon
experts is not scalable. Automated support for product-specific recommendations is necessitated
by the heterogeneity of product categories and the increasing complexity within a category. A
critical step in providing automated support lies in simply understanding the language used to
describe a particular product category. In this paper, we discussed a generic, graph-based
approach to learning the ontology for specific product categories for the purpose of creating use-
centric product recommendations.
Acknowledgements: The author would like to thank Alan Abrahams, Sean McGrew, Ellen Ngai, Sojeong Song, and the OPIM-IDT group for their invaluable help. This work was supported in part by the University of Pennsylvania Research Foundation.
References [1] "Epinions.com Customer Reviews," accessed: 5 July 2004, 2004, http://
www.epinions.com/ Digital_Cameras. [2] P. Black, "Dictionary of Algorithms and Data Structures," 3 Jan 2005, accessed:
February, 2005, www.nist.gov/dads. [3] C. Borgelt and R. Kruse, "Induction of Association Rules: Apriori Implementation,"
presented at 15th Conf on Computational Statistics (Compstat), Berlin, Germany, 2002. [4] D. Brelaz, "New Methods to Color the Vertices of a Graph," Communications of the
ACM, vol. 22, pp. 251-256, 1979. [5] K.-W. Cheung, J. T. Kwok, M. H. Law, and K.-C. Tsui, "Mining customer product
ratings for personalized marketing," Decision Support Systems, vol. 35, pp. 231-243, 2003.
[6] Consumer_Electronics_HQ, "Buying Your First Digital Camera: The Basics," accessed: 19 November, 2004, http://www.digitalcamera-hq.com/hqguides/firsttime-buyer.html.
Working Winter Information Systems Conference TLee 29 Paper Eccles School of Business
[7] Consumer_Union, "Digital Cameras," Consumer Reports Buying Guide 2005, pp.33-36; 237-240.
[8] D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, Y.-K. Ng, D. Quass, and R. D. Smith, "Conceptual-model-based data extraction," Data & Knowledge Engineering, vol. 33, pp. 227-251, 1999.
[9] O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates, "Web-scale information extraction in KnowItAll (Preliminary Results)," presented at WWW 2004, New York, New York, 2004.
[10] M. Finkelstein-Landau and E. Morin, "Extracting Semantic Relationships between Terms: Supervised vs. Unsupervised Methods," presented at International Workshop on Ontological Engineering on the Global Information Infrastructure, Dagstuhl Castle, Germany, 1999.
[11] V. Ganti, J. Gehrke, and R. Ramakrishnan, "CACTUS - Clustering categorial data using summaries," presented at ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, 1999.
[12] D. Gibson, J. Kleinberg, and P. Raghavan, "Clustering categorical data: an approach based on dynamical systems," presented at 24th International Conference on Very Large Databases (VLDB), 1998.
[13] T. Gruber, "Toward principles for the design of ontologies used for knowledge sharing, Stanford KSL-93-04," presented at International Workshop on Formal Ontology, 1993.
[14] J. Han and Y. Fu, "Dynamic generation and refinement of concept hierarchies for knowledge discovery in databases," presented at AAAI 94 Workshop on Knowledge Discovery in Databases (KDD94), 1994.
[15] J. Han and M. Kamber, Data Mining: Concepts and Techniques. San Francisco: Morgan Kaufmann Publishers, 2001.
[16] M. Hearst, "Automatic acquisition of hyponyms from large text corpora," presented at Fourteenth International Conference on Computation Linguistics (COLING), Nantes, France, 1992.
[17] M. Hu and B. Liu, "Mining and Summarizing Customer Reviews," presented at KDD04, Seattle, WA, 2004.
[18] T. Lee, "Use-centric mining of customer reviews," presented at Workshop on Information Technology and Systems (WITS), Washington, D.C., 2004.
[19] Y. Lu, Concept hierarchy in data mining: specification, generation and implementation, Thesis for the degree of Master of Science, Submitted to School of Computing Science, Simon Fraser University, Vancouver, BC, p. 106.
[20] A. Maedche and S. Staab, "Semi-automatic Engineering of Ontologies from Text," presented at Twelfth International Conference on Software Engineering and Knowledge Engineering (SEKE'2000), Chicago, 2000.
[21] A. Maedche and S. Staab, "Ontology learning for the Semantic Web," IEEE Intelligent Systems, pp. 72-79, 2001.
[22] A. Maedche, G. Neumann, and S. Staab, "Bootstrapping an Ontology-based Information Extraction System," in Intelligent Exploration of the Web, P. Szczepaniak, J. Segovia, J. Kacprzyk, and L. Zadeh, Eds. Heidelberg: Springer / Physica Verlag, 2002.
[23] S. Manes, "Digital camera guide," 11 June 2004, accessed: 19 November, 2004, http://msnbc.msn.com/id/4758940.
Working Winter Information Systems Conference TLee 30 Paper Eccles School of Business
[24] M. Missikoff and R. Navigli, "Integrated approach to Web ontology learning and engineering," IEEE Computer, pp. 54-57, 2002.
[25] G. Modica, A. Gal, and H. Jamil, "The Use of Machine-Generated Ontologies in Dynamic Information Seeking," presented at CoopIS 2001, Trento, Italy, 2001.
[26] T. Nasukawa and J. Yi, "Sentiment Analysis: Capturing Favorability Using Natural Language Processing," presented at K-CAP`03, Sanibel Island, Florida, 2003.
[27] B. Omelayenko, "Learning of ontologies for the Web: the analysis of existent approaches," presented at ICDT 01 International Workshop on Web Dynamics, London, UK, 2001.
[28] A.-M. Popescu, A. Yates, and O. Etzioni, "Class extraction from the World Wide Web," presented at AAAI 2004 Workshop on Adaptive Text Extraction and Mining (ATEM), 2004.
[29] G. Salton and M. McGill, Introduction to modern information retrieval. New York: McGraw-Hill, 1983.
[30] E. Schwartz, "Edmunds.com deploys text mining tool for user forums," InfoWorld, 6 August 2004.
[31] S. Skiena, The Algorithm Design Manual. New York: TELOS, Springer-Verlag, 1998. [32] H. Suryanto and P. Compton, "Learning classification taxonomies from a classification
knowledge based system," presented at ECAI 2000 Workshop on Ontology Learning, Berlin, 2000.
Working Winter Information Systems Conference TLee 31 Paper Eccles School of Business