Web-Scale Language-Independent Cataloging of Noisy Product ... · scale Japanese e-commerce dataset...

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 969–979,Valencia, Spain, April 3-7, 2017. c©2017 Association for Computational Linguistics

Web-Scale Language-Independent Cataloging of Noisy Product Listingsfor E-Commerce

Pradipto Das, Yandi Xia, Aaron Levine, Giuseppe Di Fabbrizio, and Ankur DattaRakuten Institute of Technology, Boston, MA, 02110 - USA

{pradipto.das, ts-yandi.xia, aaron.levine}@rakuten.com{giuseppe.difabbrizio, ankur.datta}@rakuten.com

Abstract

The cataloging of product listings throughtaxonomy categorization is a fundamentalproblem for any e-commerce marketplace,with applications ranging from personal-ized search recommendations to query un-derstanding. However, manual and rulebased approaches to categorization are notscalable. In this paper, we compare sev-eral classifiers for categorizing listings inboth English and Japanese product cata-logs. We show empirically that a combina-tion of words from product titles, naviga-tional breadcrumbs, and list prices, whenavailable, improves results significantly.We outline a novel method using corre-spondence topic models and a lightweightmanual process to reduce noise from mis-labeled data in the training set. We con-trast linear models, gradient boosted trees(GBTs) and convolutional neural networks(CNNs), and show that GBTs and CNNsyield the highest gains in error reduc-tion. Finally, we show GBTs appliedin a language-agnostic way on a large-scale Japanese e-commerce dataset haveimproved taxonomy categorization perfor-mance over current state-of-the-art basedon deep belief network models.

1 Introduction

Web-scale e-commerce catalogs are typically ex-posed to potential buyers using a taxonomy cat-egorization approach where each product is cate-gorized by a label from the taxonomy tree. Moste-commerce search engines use taxonomy labelsto optimize query results and match relevant list-ings to users’ preferences (Ganti et al., 2010). Toillustrate the general concept, consider Fig. 1. Amerchant pushes new men’s clothing listings to

an online catalog infrastructure, which then orga-nizes the listings into a taxonomy tree. When auser searches for a denim brand, “DSquared2”,the search engine first has to understand that theuser is searching for items in the “Jeans” category.Then, if the specific items cannot be found in theinventory, other relevant items in the “Jeans” cat-egory are returned in the search results to encour-age the user to browse further. However, achiev-ing good product categorization for e-commercemarket-places is challenging.

Commercial product taxonomies are organizedin tree structures three to ten levels deep, withthousands of leaf nodes (Sun et al., 2014; Shen etal., 2012b; Pyo et al., 2016; McAuley et al., 2015).Unavoidable human errors creep in while upload-ing data using such large taxonomies, contributingto mis-labeled listing noise in the data set. EvenEBay, where merchants have a unified taxonomy,reported a 15% error rate in categorization (Shenet al., 2012b). Furthermore, most e-commercecompanies receive millions of new listings permonth from hundreds of merchants composed ofwildly different formats, descriptions, prices andmeta-data for the same products. For instance,the two listings, “University of Alabama all-cottonnon iron dress shirt” and “U of Alabama 100%cotton no-iron regular fit shirt” by two merchantsrefer to the same product.

E-commerce systems trade-off between classi-fying a listing directly into one of thousands ofleaf node categories (Sun et al., 2014; ?) andsplitting the taxonomy at predefined depths (Shenet al., 2011; ?) with smaller subtree models. Inthe latter case, there is another trade-off betweenthe number of hierarchical subtrees and the prop-agation of error in the prediction cascade. Simi-lar to (Shen et al., 2012b; Cevahir and Murakami,2016), we classify product listings in two or threesteps, depending on the taxonomy size. First,we predict the top-level category and then clas-

969

(X)Learn:Y=

Men’sClothing

Polos Jeans DressShirts Coats&Jackets …

MachineLearningModel:f

sierradesignsshortsleevepackpoloshirt

signalflagpoloshirt

dsquared2bleachedpaintedslimfitjeans

universityofalabamaall-cottonnonironbrookscoolregularfitdressshirt

wallsbigmenwashedducksherpa linedhoodedjacket …x1

x2 x3x4 x6

y1 y2 y3 y4 yL

xN

DSquared2

DieselSafadoStraightFitJeans– size34x28

AGMen'sProtégéStraightLegJeans- Keats

Men’sDSquared2jeans areSOLDOUT!

Moremen’spremiumjeans

Merchants Anenduserperformingproductsearch

Women’sClothing Baby’sClothing

Clothing

vanheusenclassicfitdressshirt x5

Figure 1: E-commerce platform using taxonomy categorization to understand query intent, match mer-chant listings to potential buyers as well as to prevent buyers from navigating away on search misses.

sify the listings using another one or two levelsof subtree models selected by the previous predic-tions. For our large-scale taxonomy categoriza-tion experiments on product listings, we use twoin-house datasets,1 a publicly available Amazonproduct dataset (McAuley et al., 2015), and a pub-licly available Japanese product dataset.2

Our paper makes several contributions: 1) Weperform large-scale comparisons with several ro-bust classification methods and show that Gradi-ent Boosted Trees (GBTs) (Friedman, 2000; ?)and Convolutional Neural Networks (CNNs) (Le-Cun and Bengio, 1995; ?) perform substantiallybetter than state-of-the-art linear models (Section5). We further provide analysis of their perfor-mance with regards to imbalance in our datasets.2) We demonstrate that using both listing priceand navigational breadcrumbs – the branches thatmerchants assign to the listings in web pages fornavigational purposes – boost categorization per-formance (Section 5.3). 3) We effectively applycorrespondence topic models to detect and removemis-labeled instances in training data with mini-mal human intervention (Section 5.4). 4) We em-pirically demonstrate the effectiveness of GBTs ona large-scale Japanese product dataset over a re-cently published state-of-the-art method (Cevahirand Murakami, 2016), and in turn the otherwiselanguage-agnostic capabilities of our system givena language-dependent word tokenization method.

2 Related WorkThe nature of our problem is similar to those re-ported in (Bekkerman and Gavish, 2011; Shen etal., 2011; Shen et al., 2012b; Yu et al., 2013b;Sun et al., 2014; Kozareva, 2015; ?), but with

1The in-house datasets are from Rakuten USA, managedby Rakuten Ichiba, Japan’s largest e-commerce company.

2This dataset is from Rakuten Ichiba and is released underRakuten Data Release program.

more pronounced data quality issues. However,the existing methods for noisy product classifica-tion have only been applied to English. Their effi-cacy for moraic and agglutinative languages suchas Japanese remains unknown.

The work in Sun et al. (2014) emphasizes theuse of simple classifiers in combination with large-scale manual efforts to reduce noise and imperfec-tions from categorization outputs. While humanintervention is important, we show how unsuper-vised topic models can substantially reduce suchexpensive efforts for product listings crawled inthe wild. Further, unlike Sun et al. (2014), weadopt stronger baseline systems based on regu-larized linear models (Hastie et al., 2003; Zhang,2004; Zou and Hastie, 2005).

A recent work from Pyo et al. (2016) empha-sizes the use of recurrent neural networks for tax-onomy categorization purposes. Although, theymention that RNNs render unlabeled pre-trainingof word vectors (Mikolov et al., 2013) unneces-sary, in contrast, we show that training word em-beddings on the whole set of three product titlecorpora improves performance for CNN modelsand opens up the possibility of leveraging otherproduct corpora when available.

Shen et al. (2012b) advocate the use of algorith-mic splitting of the taxonomy using graph theo-retic latent group discovery to mitigate data imbal-ance problems at the leaf nodes. They use a com-bination of k-NN classifiers at the coarser leveland SVMs (Cortes and Vapnik, 1995) classifiersat the leaf levels. Their SVMs solve much easierk-way multi-class categorization problems wherek ∈ {3, 4, 5} with much less data imbalance. We,however, have found that SVMs do not work wellin scenarios where k is large and the data is im-balanced. Due to our high-dimensional featurespaces, we avoided k-NN classifiers that can cause

970

prohibitively long prediction times under arbitraryfeature transformations (Manning et al., 2008; Ce-vahir and Murakami, 2016).

The use of a bi-level classification using k-NNand hierarchical clustering is incorporated in Ce-vahir and Murakami (2016)’s work, where theyuse nearest neighbor methods in addition to DeepBelief Networks (DBN) and Deep Auto Encoders(DAE) over both titles and descriptions of theJapanese product listing dataset. We show in Sec-tion 5.6, that using a tri-level cascade of GBT clas-sifiers over titles, we significantly outperform thek-NN+DBN classifier on average.

3 Dataset CharacteristicsWe use two in-house datasets, named BU1 andBU2, one publicly available Amazon dataset(AMZ) (McAuley et al., 2015), and a Japaneseproduct listing dataset named RAI (Cevahir andMurakami, 2016) (short for Rakuten Ichiba) forthe experiments in this paper.

BU1 is categorized using human annotation ef-forts and rule-based automated systems. Thisleads to a high precision training set at the expenseof coverage. On the other hand, for BU2, noisytaxonomy labels from external data vendors havebeen automatically mapped to an in-house taxon-omy without any human error correction, resultingin a larger dataset at the cost of precision. BU2also suffers from inconsistencies in regards to in-complete or malformed product titles and meta-data arising out of errors in the web crawlers thatvendors use to aggregate new listings. However,for BU2, the noise is distributed identically in thetraining and test sets, thus evaluation of the classi-fiers is not impeded by it.

The Japanese RAI dataset consists of172, 480, 000 records split across 26, 223leaf nodes. The distribution of product listings inthe leaf nodes is based on the popularity of certainproduct categories and is thus highly imbalanced.For instance, the top level “Sports & Outdoor”category has 2, 565 leaf nodes, while the “Travel/ Tours / Tickets” category has only 38. The RAIdataset has 35 categories at depth one (level-onecategories) and 400 categories at depth two of thefull taxonomy tree. The total depth of the treevaries from three to five levels.

The severity of data imbalance for BU2 isshown in Figure 2. The top-level “Home, Furni-ture and Patio” subtree that accounts for almosthalf of the BU2 dataset. Table 1 shows dataset

0123456789

101112131415

Millions Toys

Home,FurnitureandPa]oJewelryWatchesBag,HandbagsandAccessoriesHealth,BeautyandFragranceShoesElectronicsandComputersOfficeSportsFitnessAutomo]veIndustrialBabyProductsBabyKidsClothesMen'sClothingWomen'sClothing

Figure 2: Top-level category distribution of 40million deduplicated listings from an earlier Dec2015 snapshot of BU2. Each category subtree isalso imbalanced, as seen in exploded view of the“Home, Furniture, and Patio” category.

characteristics for the four different kinds of prod-uct datasets we use in our analyses. It lists thenumber of branches for the top-level taxonomysubtrees, the total number of branches ending atleaf nodes for which there are a non-zero num-ber of listings and two important summary statis-tics that helps quantify the nature of imbalance.We first calculate the Pearson correlation coeffi-cient (PCC) between the number of listings andbranches in each of the top-level subtrees for eachof the four datasets.

A perfectly balanced tree will have a PCC of1.0. BU1 shows the most benign kind of imbal-ance with a PCC of 0.643. This confirms thatthe number of branches in the subtrees correlatewell with the volume of listings. Both AMZ andRAI datasets show the highest branching factors intheir taxonomies. For the AMZ dataset, it could be

Datasets Subtrees Branches Listings PCC KL

BU1 16 1,146 12.1M 0.643 0.872BU2 15 571 60M 0.209 0.715AMZ 25 18,188 7.46M 0.269 1.654RAI 35 26,223 172.5M 0.474 7.887

Table 1: Dataset properties on: total number oftop-level category subtrees, branches and listings

due to the fact that the crawled taxonomy is differ-ent from Amazon’s internal catalog. The RakutenIchiba taxonomy has been incrementally adjustedto grow in size over several years by creating newbranches to support newer and popular products.We observe that for RAI, AMZ and BU2 in par-ticular, the number of branches in the subtrees donot correlate well with the volume of listings. Thisindicates a much higher level of imbalance.

We also compute the average Kullback-Leibler

971

(KL) divergence, KL(p(x)|q(x)), (Cover andThomas, 1991) between the empirical distributionover listings in branches for each subtree rooted inthe nodes at depth one, p(x), compared to a uni-form distribution, q(x). Here, the KL divergenceacts as a measure of imbalance of the listing distri-bution and is indicative of the categorization per-formance that one may obtain on a dataset; highKL divergence leads to poorer categorization andvice-versa (see Section 5).

4 Gradient Boosted Trees andConvolutional Neural Networks

GBTs (Friedman, 2000) optimize a loss func-tional: L = Ey[L(y, F (x)|X)] where F (x) canbe a mathematically difficult to characterize func-tion, such as a decision tree f(x) over X. Theoptimal value of the function is expressed asF ?(x) =

∑Mm=0 fm(x,a,w), where f0(x,a,w)

is the initial guess and {fm(x,a,w)}Mm=1 are ad-ditive boosts on x defined by the optimizationmethod. The parameter am of fm(x,a,w) de-notes split points of predictor variables and wm

denotes the boosting weights on the leaf nodes ofthe decision trees corresponding to the partitionedtraining set Xj for region j. To compute F ?(x),we need to calculate, for each boosting round m,

{am,wm} = arg mina,w

N∑i=1

L(yi, Fm(xi)) (1)

with Fm(x) = Fm−1(x) + fm(x,am,wm). Thisexpression is indicative of a gradient descent step:

Fm(x) = Fm−1(x) + ρm (−gm(xi)) (2)

where ρm is the step length and[∂L(y,F (x))∂F (x)

]F (xi)=Fm−1(xi)

= gm(xi) being the

search direction. To solve am and wm, we makethe basis functions fm(xi;a,w) correlate mostto −gm(xi), where the gradients are defined overthe training data distribution. In particular, usingTaylor series expansion, we can get closed formsolutions for am and wm – see Chen and Guestrin(2016) for details. It can be shown that am =arg mina

∑Ni=1 (−gm(xi)− ρmfm(xi,a,wm))2

and ρm = arg minρ∑N

i=1 L(yi, Fm−1(xi) +ρfm(xi;am,wm)) which yields,

Fm(x) = Fm−1(x) + ρmfm(x,am,wm) (3)

Each boosting round m updates the weightswm,j on the leaves and helps create a new tree

in the next iteration. The optimal selection of de-cision tree parameters is based on optimizing thefm(x,a,w) using a logistic loss. For GBTs, eachdecision tree is resistant to imbalance and outliers(Hastie et al., 2003), and F (x) can approximatearbitrarily complex decision boundaries.

The convolutional neural network we use isbased on the CNN architecture described in Le-Cun and Bengio (1995; Kim (2014) using the Ten-sorFlow framework (Abadi and others, 2015). Asin Kim (2014), we enhance the performance of“vanilla” CNNs (Fig. 3 right) using word em-bedding vectors (Mikolov et al., 2013) trained onthe product titles from all datasets, without taxon-omy labels. Context windows of width n, corre-sponding to n-grams and embedded in a 300 di-mensional word embedding space, are convolvedwith L filters followed by rectified non-linear unitactivation and a max-pooling operation over theset of all windows W . This operation results in aL×1 vector, which is then connected to a softmaxoutput layer of dimension K × 1, where K is thenumber of classes. Section A lists more details onparameters.

The CNN model tries to allocate as few filtersto the context windows while balancing the con-straints on the back-propagation of error resid-uals with regards to cross-entropy loss L =−∑K

k=1 qk log pk, where pk is the probability ofa product title x belonging to class k predicted byour model, and q ∈ {0, 1}K is a one-hot vectorthat represents the true label of title x. This re-sults in a higher predictive power for the CNNs,while still matching complex decision boundariesin a smoother fashion than GBTs. We note herethat for all models, the predicted probabilities arenot calibrated (Zadrozny and Elkan, 2002).

5 Experimental Setup and ResultsWe use Naıve Bayes (NB) (Ng and Jordan, 2001)similar to the approach described in Shen et al.(2012a) and Sun et al. (2014), and Logistic Re-gression (LogReg) classifiers with L1 (Fan et al.,2008) and Elastic Net regularization, as robustbaselines. Parameter setups for the various modelsand algorithms are mentioned in Section A.

5.1 Data PreprocessingProduct listing datasets in English – BU1 is ex-clusively comprised of product titles, hence, ourfeatures are primarily extracted from these titles.For AMZ and BU2, we additionally extract the list

972

65.00

70.00

75.00

80.00

85.00

90.00

95.00

100.00

apparel&accessories

appliances

automotive

babyproducts

electronics&accessories

grocery&gourm

etfood

health&beauty

home&kitchen

jewelry&watches

officeproducts

petsupplies

shoes

sports&outdoors

tickets&events

tools&homeimprovement

toys&games

NB LogRegElasticNet(OvA)

LogRegL1(OvO) GBT

CNNw/pretraining CNN-Vanilla

65.00

70.00

75.00

80.00

85.00

90.00

95.00

100.00

apparel&accessories

appliances

automotive

babyproducts

electronics&accessories

grocery&gourm

etfood

health&beauty

home&kitchen

jewelry&watches

officeproducts

petsupplies

shoes

sports&outdoors

tickets&events

tools&homeimprovement

toys&games

NB LogRegElasticNet(OvA)

LogRegL1(OvO) GBT

CNNw/pretraining CNN-Vanilla

78808284868890

WordUnigramCount

WordUnigram

BiPosi9onalCount

WordBigramCount

WordBigram

BiPosi9onalCount

NB LogRegElas9cNet(OvA)LogRegL1(OvO) GBTCNNw/pretraining CNNvanilla

Figure 3: Classifier performance on BU1 test set. The CNN classifier has only one configuration andthus shows constant curves in all plots. Left figure shows prediction on 10% test set using word unigramcount features; middle figure shows prediction on 10% test set using word bigram bi-positional countfeatures; and the right figure shows mean micro-precision over different feature setups except CNNs. Inall figures, “OvO” means “One vs. One” and “OvA” means “One vs All”.

price whenever available. For BU2, we also usethe leaf node of any available navigational bread-crumbs. In order to decrease training and cate-gorization run times, we employ a number of vo-cabulary filtering methods. Further, English stop-words and rare tokens that appear in 10 listingsor less are then filtered out. This reduces vocabu-lary sizes by up to 50%, without a significant re-duction in categorization performance. For CNNs,we replace numbers by the nominal form [NUM]and remove rare tokens. We also remove punctua-tions and then lowercase the resulting text. Parts ofspeech (POS) tagging using a generic tagger fromManning et al. (2014) trained on English text pro-duced very noisy features, as is expected for out-of-domain tagging. Consequently, we do not usePOS features due to the absence of a suitable train-ing set for listings unlike that in Putthividhya andHu (2011). For GBTs, we also experiment withtitle word expansion using nearest neighbors fromWord2Vec model (Mikolov et al., 2013), for in-stance, to group words like “t-shirts”, “tshirt”,“t-shirt” in their respective equivalence classes,however, the overall results have not been better.

Product listing datasets in Japanese – CJKlanguages like Japanese lack white space betweenwords. Hence, the first pre-processing step re-quires a specific Japanese tokenization tool toproperly segment the words in the product titles.

For our experiments, we used the MeCab3 to-kenizer trained using features that are augmentedwith in-house product keyword dictionaries. Ro-maji words written using Latin characters are sep-

3https://sourceforge.net/projects/mecab/

arated from Kanji and Kana words. All brack-ets are normalized to square brackets and punc-tuations from non-numeric tokens are removed.We also use canonical normalization to changethe code points of the resulting Japanese text intoan NFKC normalized4 form, then remove any-thing outside of standard Japanese UTF-8 charac-ter ranges. Finally, the resulting text is lowercased.

Due to the size of the RAI dataset taxonomytree, three groups of models are trained to clas-sify new listings into one of 35 level-one cate-gories, then one of 400 level-two categories, and,finally, the leaf node of the taxonomy tree. Wehave found this scheme to be working better forthe RAI dataset than a bi-level scheme that weadopted for the other English datasets.

Applying GBTs on the Japanese dataset in-volved a bit more feature engineering. At the to-kenized word-level, we use counts of word uni-grams and word bi-grams. For character features,the product title is first normalized as discussedabove. Consequently, character 2, 3, and 4-gramsare extracted with their counts, where extractionsinclude single spaces appearing at the end of wordboundaries. Identification of the best set of fea-ture combinations in this case has been performedduring cross-validation.

5.2 Initial Experiments on BU1 datasetOur initial experiments use unigram counts andthree other features: word bigram counts, bi-positional unigram counts, and bi-positional bi-gram counts. Consider a title text “120 gb hdd5400rpm sata fdb 2 5 mobile” from the “Data

4http://unicode.org/reports/tr15/

973

storage” leaf node of the Electronics taxonomysubtree and another title text “acer aspire v7582pg 6421 touchscreen ultrabook 15 6 full hd in-tel i5 4200u 8gb ram 120 gb hdd ssd nvidia geforcegt 720m” from the “Laptops and notbooks” leafnode. In such cases, we observe that merchantstend to place terms pertaining to storage devicespecifics in the front of product titles for “Datastorage” and similar terms towards the end of thetitles for “Laptops”. As such, we split the titlelength in half and augment word uni/bigrams witha left/right-half position.

This makes sense from a Naıve Bayes pointof view, since terms like “120 gb”[Left Half],“gb hdd”[Left Half], “120 gb”[Right Half] and“gb hdd”[Right Half] de-correlates the featurespace better, which is suitable for the naıve as-sumption in NB classification. This also helpsin sightly better explanation of the class posteri-ors. These assumptions for NB are validated in thethree figures: Fig. 3 left, Fig. 3 middle and Fig.3 right. Word unigram count features performstrongly for all classifiers except NB, whereas bi-positional word bigram features helped only NBsignificantly.

Additionally, the micro-precision and F1 scoresfor CNNs and GBTs are significantly higher com-pared to other algorithms on word unigrams usingpaired t-test with a p-value < 0.0001. The per-formances of GBTs and LogReg L1 classifiers de-teriorate over the other feature sets as well. Thebi-positional and bigram feature sets also do notproduce any improvements for the AMZ dataset.Based on these initial results, we focus on wordunigrams in all of our subsequent experiments.

5.3 Categorization Improvements withNavigational Breadcrumbs and ListPrices on BU2 Dataset

BU2 is a challenging dataset in terms of class im-balance and noise and we sought to improve cate-gorization performance using available meta-data.To start, we experiment with a smaller dataset con-sisting of ≈ 500, 000 deduplicated listings underthe “Women’s Clothing” taxonomy subtree, ex-tracted from our Dec 2015 snapshot of 40 millionrecords. Then we train and test against ≈ 2.85million deduplicated “Women’s Clothing” listingsfrom the Feb 2016 snapshot of BU2. In all exper-iments, 10% of the data is used as test set. Thewomens clothing category had been chosen dueto the importance of the category from a business

82848688909294

]tlewords

]tlewords

]tlewords[rem

overare&stop

words]

]tlewordswith

breadcrum

bleaves[rem

overare&stop

words]

]tlewordswith

listprices

[rem

overare&stop

words]

]tlewordswith

breadcrum

bleaves&listprices[rem

ove

rare&stop

words]

Precision F1

Dec2015

Feb2016

Figure 4: Improvements in micro-precision andF1 for GBTs on BU2 dataset for “Women’s Cloth-ing” subtree

standpoint, which provided early access to listingsin this category. Further, data distributions remainthe same in the two snapshots and the Feb 2016snapshot consists of listings in addition to thosefor the Dec 2015 snapshot.

The first noteworthy fact in Fig. 4 is that themicro-precision and F1 of the GBTs substantiallyimprove after increasing the size of the dataset.Further, stop words and rare words filtering de-crease precision and F1 by less than 1%, despitehalving the feature space. The addition of navi-gational leaf nodes and list prices prove advanta-geous, with both features independently boostingperformance and raising micro-precision and F1 toover 90%. Despite finding similar gains in catego-rization performance for other top-level subtreesby using these meta features, we needed a systemto filter mis-categorized listings from our trainingdata as well.

5.4 Noise Analysis of BU2 Dataset usingCorrespondence LDA for Text

The BU2 dataset has the noisiest ground-truth la-bels, as incorrect labels have been assigned toproduct listings. However, since the manual ver-ification of millions of listings is infeasible, usingsome proxy for ground truth is a viable alternativethat has previously produced encouraging results(Shen et al., 2012b). We next describe how resort-ing to unsupervised topic models helped to detectand remove incorrect listings.

As shown in Fig. 8, categorization perfor-mance for the “Shoes” taxonomy subtree is over25 points below the “Women’s Clothing” cate-gory. Such a large difference could be caused byincorrect assignments of listings to the correct cat-

974

hardcoverguidedesignhandbook

internationalhealthbusinesssociallaw

heartdiamondpearl40ringsterlingchain

charm3914k47

paperbackbookhistorygodhomeundsoulstoriesjourneybibleder

vhs worldseriestimekingwarball

housetrekchristmas space

iclovedayenightusingleladychildwomanuk gooddeluxepark

set2010024casekitoz 30handdrivebodywalldigitalft48spraypaper

licensestandardsymantec systemserviceciscosupportyear

dvd jazzmediamixedcountryartistsvol play

musicproductpop

decalendarlaamerican discel2013art2009ray2012compact

Figure 5: Selection of most probablewords under the latent “noise” topics overlistings in “Shoes” subtree. Human anno-tators inspect whether such sets of wordsbelong to a Shoes topic or not.

oxfordburgundyplainespressowingmadison

Oxfords>Men’sShoes>Shoes

prbootsworkmensblkcomposite

Boots>Men’sShoes>Shoes

tanpolyurethanelifesaddlestridebed

Flats>Women’sShoes>Shoes

13reac]onyorkakdrivingsteven

Sneakers&Athle]cShoes>Men’sShoes>Shoes

originalrosebonemarklizardcopper

Climbing>Men’sShoes>Shoes

Ambiguouslabels

Figure 6: Interpretation of latent topics us-ing predictions from a GBT classifier. Thetopics here do not include those in Fig. 5, butare all from Feb 2016 snapshot of the BU2dataset.

egories. However, unlike Sun et al. (2014), asthere are over 3.4 million “Shoes” listings in theBU2 dataset, a manual analysis to detect noisy la-bels is infeasible. To address this problem, wecompute p(x) over latent topics zk, and automati-cally annotate the most probable words over eachtopic.

We choose our CorrMMLDA model (Das et al.,2011) to discover the latent topical structure of thelistings in the “Shoes” category because of tworeasons. Firstly, the model is a natural choice forour scenario since it is intuitive to assume thatstore and brand names are distributions over wordsin titles. This is illustrated in the graphical modelin Fig. 7, where the store “Saks Fifth Avenue” andthe brand “Joie” are represented as words wd,min the M plate of listing d and are distributionsover the words in the product title “Joie kidmoreEmbossed slipon sneakers” represented as wordswd,n in the N plate of the same listing d. The ti-tle words are in turn distributions over the latenttopics zd for listing d ∈ {1..D}.

Secondly, the CorrMMLDA model has beenshown to exhibit lower held-out perplexities in-dicative of improved topic quality. The reason be-hind the lower perplexity stems from the follow-ing observations: Using the notation in Das et al.(2011), we denote the free parameters of the vari-ational distributions over words in brand and storenames, say λd,m,n, as multinomials over words inthe titles and those over words in the title, sayφd,n,k, as multinomials over latent topics zd. Itis easy to see that the posterior over the topic zd,kfor each wd,m of brand and store names, is depen-dent on λ and φ through

∑Ndn=1 λd,m,n × φd,n,k.

This means that if a certain topic zd = j gener-ates all words in the title, i.e., φd,n,j > 0, then

wnα zn r

β

N

Mym wm

wdN =

D K

JoieKidmoreEmbossedSliponsneakers

wdM =SaksFifthAvenueJoie

Figure 7: Correspondence MMLDA model.

only that topic also generates the brand and storenames thereby increasing likelihood of fit and re-ducing perplexity. The other topics zd 6= j do notcontribute towards explaining the topical structureof the listing d.

6065707580859095100

Toys

Home&Furniture

Jewelry&W

atches

Bags&Accessorie

sHe

alth&Beauty

Shoe

sElectron

ics&

Com

p.

Office

Sports&Fitn

ess

Automo]

ve

Indu

stria

lBa

byProdu

cts

Baby&KidsC

lothes

Men

'sClothing

Wom

en'sClothing

LOGREGL1MICROPRECISION LOGREGL1MICROF1

Figure 8: Micro-precision and F1 across fifteentop-level categories on 10% (4 million listings) ofDec 2015 BU2 snapshot.

We train the CorrMMLDA model with K=100latent topics. A sample of nine latent topics andtheir most probable words shown in Fig. 5 demon-strates that topics outside of the “Shoes” domaincan be manually identified, while reducing hu-man annotation efforts from 3.4 million recordsto one hundred. We choose K = 100 since itis roughly twice the number of branches for theShoes subtree. This choice provides modelingflexibility while respecting the number of ground

975

405060708090

100

Toys

Home&Furniture

Jewelry&W

atches

Bags&Accessorie

sHe

alth&Beauty

Shoe

sElectron

ics&

Com

p.

Office

Sports&Fitn

ess

AutomoI

ve

Indu

stria

lBa

byProdu

cts

Baby&KidsC

lothing

Men

'sClothing

Wom

en'sClothing

NB LogRegElasIcNet(OvA)LogRegL1(OvO) GBTCNNw/pretraining

Figure 9: Micro-precision on 10% ofBU2 across categories (see Sect. 5.4)

0102030405060708090

100

appliances

arts,cra7s&se

wing

automo>

ve

babyprodu

cts

beauty

books

cds&

vinyl

cellph

ones&accessorie

sclothing,sho

es&jewelry

collec>bles&fine

art

electron

ics

grocery&gou

rmetfo

od

health&personalcare

home&kitche

nindu

stria

l&sc

ien>

fic

movies&

tv

musical&instrumen

ts

officeprodu

cts

pa>o

,law

n&garde

npe

tsup

plies

so7w

are

sports&outdo

ors

tools&

hom

etoys&gam

es

vide

ogames

NB LogRegElas>cNet(OvA)LogRegL1(OvO) GBTCNNw/pretraining

Figure 10: Micro-precision on 10% of AMZ acrosscategories

Dataset NB LogReg ElasticNet LogReg L1 GBT CNN w pretraining Mean KL log(N/B)BU1 81.45 86.30 86.75 89.03* 89.12* 0.872 9.27BU2 68.21 84.29 85.01 90.63* 88.67 0.715 11.54AMZ 49.01 69.39 66.65 67.17 72.66* 1.654 6.02

Table 2: Mean micro-precision on 10% test set from BU1, BU2 and AMZ English datasets

truth classes.We next run a list of the most probable six

words, the average length of a “Shoes” listing’s ti-tle, from each latent topic through our GBT classi-fier trained on the full, noisy data, but without con-sidering any metadata, due to bag-of-words natureof the topic descriptions. As shown in the bottomtwo rows in Fig. 6, categories mismatching theirtopics are manually labeled as ambiguous. As a fi-nal validation, we uniformly sampled a hundredlistings from each ambiguous topic detected bythe model. Manual inspections revealed numer-ous listings from merchants not selling shoes arewrongly cataloged in the “Shoes” subtree due tovendor’s error. To this end, we remove listings cor-responding to such “out-of-category” merchantsfrom all top-level categories.

Thus, by manually inspectingK×6 most prob-able words from the K=100 topics and J × 100listings, where J << K, instead of 3.4 million, afew annotators accomplished in hours what wouldhave taken hundreds of annotators several monthsaccording to the estimates in Sun et al. (2014).

5.5 Results on BU2 and AMZ DatasetsIn section 5.2, we have shown the efficacy of wordunigram features on the BU1 dataset. Figure 8shows that LogReg with L1 regularization (Yuet al., 2013b; Yu et al., 2013a) initially achieves83% mean micro-precision and F1 on the initialBU2 dataset. This falls short of our expectationof achieving an overall 90% precision (red line in

Fig. 8), but forms a robust baseline for our subse-quent experiments with the AMZ and the cleanedBU2 datasets. We additionally use the list priceand the navigational breadcrumb leaf nodes for theBU2 dataset and, when available, the list price forthe AMZ dataset.

Overall, Naıve Bayes, being an overly simpli-fied generative model, generalizes very poorly onall datasets (see Figs. 3, 9 and 10). A possibleoption to improve NB’s performance is to use sub-sampling techniques as described in Chawla et al.(2002); however, sub-sampling can have its ownproblems for when dealing with product datasets(Sun et al., 2014).

From Table 2, we observe that most classifierstend to perform well when log(N/B) is relativelyhigh. The N in the previous ratio is the total num-ber of listings and B is the total number of cate-gories. Figures with a ∗ are statistically better thanother non-starred ones in the same row except thelast two columns. From Fig. 9 and Table 2, it isclear that GBTs are better on BU2.

We also experiment with CNNs augmented touse meta-data while respecting the convolutionalconstraints on title text, however, the performanceimproved only marginally. It is not immediatelyclear why all the classifiers suffer on the “CDsand Vinyl” category, which has more than 500branches – see Fig. 10. The AMZ dataset alsosuffers from novel cases of data imbalance. Forinstance, most of the listings in “Books” and “Gro-

976

50.0055.0060.0065.0070.0075.0080.0085.0090.0095.00100.00

#��9�;";!

(DigitalContent)

CKc<4(9qA

(FiberOptic…

;"8�<PD<Gb

(Hom

e…

�� (!'Men''sClothing')

;&=<>a<&$��…

eYk

(Watches)

. $<. $� �

(PetFood&…

�0=!<�$%�

(Sports&…

Wj<ES<�� $

(Travel/…

� �;Ì<wJ<l_JD…

� �<-*=<1�'"�…

�5�8=<��8=…

wI

(Food)

TV< =#� <�37

(TV/…

o<(�

(Cars/Motorbikes)

oÌ<(�Ì

(Car&…

��(Books)

dO<��3<x\

(Beauty,…

�� $<BR

(Health&Wellness)

�=!<�hL

(Sweets/Snacks)

FiI<�;��$<?m…

�1=$+�;<�,: $…

v(Shoes)

XÌtn<VTD<Ug…

��</*=<�=2

(Toys,…

( �<Q^<,7;%tn…

:#�=�+� �6;

(!…

M�<�=*�<@s

(Learning/…

Nu

(Hom

eAppliances&Sm

allElectrics)

*=9<]r

(AlcoholicBeverages)

f<�=#;<DIY(Gardening&

…

��

(Sake)

CD<DVD<ZJ

(Music&Video)

)��;<Hp[J

(Com

puters&…

��(Beverages)

Micro-F1

KNN+DBN GBT Important statistics for the Rakuten Ichiba dataset and classifier performance:- Average PCC at level one – 0.47- Average KL at level one – 7.88- Average KL at level two – 3.79- Average of the micro-F1 scores across

level one categories shown:- KNN+DBN – 73.85- GBT – 76.89*

- GBT performs significantly better than KNN+DBN for 28 out of 35 level one categories. There is one other category, Smartphones and Tablets, where GBT performs only slightly better than KNN+DBN.

Figure 11: Comparison of GBTs versus the method from Cevahir and Murakami (2016) on a 10% testset from the Rakuten Ichiba Japanese product listing dataset.

cery” are in one branch, with most other branchescontaining less than 10 listings. In summary, fromboth Figs. 9 and 10, we observe that GBTs andCNNs with pre-training perform best even in ex-treme data imbalance. It is possible that GBTsneed finer parameter tuning per top-level subtreefor datasets resembling AMZ.

5.6 Results on Rakuten Ichiba DatasetIn this section, we report our findings on the ef-ficacy of GBTs vis-a-vis another hybrid nearestneighbor and deep learning based method fromCevahir and Murakami (2016). Our decision toemploy a tri-level classifier cascade, instead of thebi-level one used for the other datasets, stems fromour observations of the KL divergence values (seeSection 3 and Table 1) at the first and second leveldepths of the RAI taxonomy tree. Moving fromthe first level down to the second decreases theKL divergence by more than 50%. We thus expectGBTs to perform better due to this reduced imbal-ance. We also cross-validated this assumption onsome popular categories, such as “Clothing”.

From Fig. 11 and the statistics noted therein,we observe that, on average, GBTs outperform theKNN+DBN model from Cevahir and Murakami(2016) by 3 percentage points across all top levelcategories, which is statistically significant undera paired t-test with p < 0.0001. As with previousexperiments, only a common best parameter con-figuration has been set for GBTs, without resort-ing to time consuming cross-validation across allcategories. For the 29 categories on which GBTsdo better, the mean of the absolute percentage im-provement is 11.78, with a standard deviation of5.07. Also, it has been observed that GBTs sig-nificantly outperform KNN+DBN in 28 of thosecategories.

The comparison in Fig. 11 is more holistic. Un-

like the top level categorization scores obtainedin Figs. 3, 9 and 10, the scores in Fig. 11 havebeen obtained by categorizing each test examplethrough the entire cascade of hierarchical modelsfor two classifiers. Even with this setting, the per-formance of GBTs is significantly better.

6 ConclusionLarge-scale taxonomy categorization with noisyand imbalanced data is a challenging task. Wedemonstrate deep learning and gradient tree boost-ing models with operational robustness in realindustrial settings for e-commence catalogs withseveral millions of items. We summarize ourcontributions as follows: 1) We conclude thatGBTs and CNNs can be used as new state-of-the-art baselines for product taxonomy categorizationproblems, regardless of the language used; 2)We quantify the nature of imbalance for differ-ent product datasets in terms of distributional di-vergence and correlate that to prediction perfor-mance; 3) We also show evidence to suggest thatwords from product titles, together with leaf nodesfrom navigational breadcrumbs and list prices,when available, can boost categorization perfor-mance significantly on all the product datasets. Fi-nally, 4) we showcase a novel use of topic modelswith minimal human intervention to clean largeamounts of noise particularly when the source ofnoise cannot be controlled. This is unlike anyexperiment reported in previous publications onproduct categorization. Automatic topic labelingfor a given category with a pre-trained classifierfrom another dataset can help create an initial tax-onomy over listings for which none exist. A majorbenefit of this approach is that it reduces manualefforts on initial taxonomy creation.

977

ReferencesMartn Abadi et al. 2015. Tensorflow: Large-scale

machine learning on heterogeneous distributed sys-tems.

Ron Bekkerman and Matan Gavish. 2011. High-precision phrase-based document classification ona modern scale. In Proceedings of the 17th ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining, KDD ’11, pages 231–239, New York, NY, USA. ACM.

Ali Cevahir and Koji Murakami. 2016. Large-scalemulti-class and hierarchical product categorizationfor an e-commerce giant. In Proceedings of COL-ING 2016, the 26th International Conference onComputational Linguistics: Technical Papers, pages525–535, Osaka, Japan, December. The COLING2016 Organizing Committee.

Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O.Hall, and W. Philip Kegelmeyer. 2002. Smote: Syn-thetic minority over-sampling technique. J. Artif.Int. Res., 16(1):321–357, June.

Tianqi Chen and Carlos Guestrin. 2016. Xgboost: Ascalable tree boosting system. In Proceedings of the22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, San Fran-cisco, CA, USA, August 13-17, 2016, pages 785–794.

Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Mach. Learn., 20(3):273–297,September.

Thomas M. Cover and Joy A. Thomas. 1991. Ele-ments of Information Theory. Wiley-Interscience,New York, NY, USA.

Pradipto Das, Rohini Srihari, and Yun Fu. 2011. Si-multaneous joint and conditional modeling of docu-ments tagged from two perspectives. In Proceedingsof the 20th ACM International Conference on In-formation and Knowledge Management, CIKM ’11,pages 1353–1362, New York, NY, USA. ACM.

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A li-brary for large linear classification. J. Mach. Learn.Res., 9:1871–1874, June.

Jerome H. Friedman. 2000. Greedy function approx-imation: A gradient boosting machine. Annals ofStatistics, 29:1189–1232.

Venkatesh Ganti, Arnd Christian Konig, and Xiao Li.2010. Precomputing search features for fast and ac-curate query classification. In Proceedings of theThird ACM International Conference on Web Searchand Data Mining, WSDM ’10.

Xavier Glorot and Yoshua Bengio. 2010. Understand-ing the difficulty of training deep feedforward neu-ral networks. In In Proceedings of the International

Conference on Artificial Intelligence and Statistics(AISTATS10). Society for Artificial Intelligence andStatistics.

Trevor Hastie, Robert Tibshirani, and Jerome Fried-man. 2003. The Elements of Statistical Learning:Data Mining, Inference, and Prediction. Springer,August.

Yoon Kim. 2014. Convolutional neural networksfor sentence classification. In Proceedings of the2014 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October. Association for Com-putational Linguistics.

Diederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization. CoRR,abs/1412.6980.

Zornitsa Kozareva. 2015. Everyone likes shopping!multi-class product categorization for e-commerce.In NAACL HLT 2015, The 2015 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, Denver, Colorado, USA, May 31 - June 5,2015, pages 1329–1333.

Y. LeCun and Y. Bengio. 1995. Convolutional net-works for images, speech, and time-series. In M. A.Arbib, editor, The Handbook of Brain Theory andNeural Networks. MIT Press.

Christopher D. Manning, Prabhakar Raghavan, andHinrich Schutze. 2008. Introduction to InformationRetrieval. Cambridge University Press, New York,NY, USA.

Christopher D. Manning, Mihai Surdeanu, John Bauer,Jenny Finkel, Steven J. Bethard, and David Mc-Closky. 2014. The Stanford CoreNLP natural lan-guage processing toolkit. In Association for Compu-tational Linguistics (ACL) System Demonstrations,pages 55–60.

Julian McAuley, Rahul Pandey, and Jure Leskovec.2015. Inferring networks of substitutable and com-plementary products. In Proceedings of the 21thACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, KDD ’15, pages785–794, New York, NY, USA. ACM.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.Corrado, and Jeffrey Dean. 2013. Distributed rep-resentations of words and phrases and their com-positionality. In Advances in Neural InformationProcessing Systems 26: 27th Annual Conference onNeural Information Processing Systems 2013. Pro-ceedings of a meeting held December 5-8, 2013,Lake Tahoe, Nevada, United States., pages 3111–3119.

Andrew Ng and Michael Jordan. 2001. On Discrimi-native vs. Generative Classifiers: A Comparison ofLogistic Regression and Naive Bayes. In Advancesin Neural Information Processing Systems (NIPS),volume 14.

978

Duangmanee (Pew) Putthividhya and Junling Hu.2011. Bootstrapped named entity recognition forproduct attribute extraction. In Proceedings of theConference on Empirical Methods in Natural Lan-guage Processing, EMNLP ’11, pages 1557–1567,Stroudsburg, PA, USA. Association for Computa-tional Linguistics.

Hyuna Pyo, Jung-Woo Ha, and Jeonghee Kim. 2016.Large-scale item categorization in e-commerce us-ing multiple recurrent neural networks. In Proceed-ings of the 22nd ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining,KDD ’16, New York, NY, USA. ACM.

Dan Shen, Jean David Ruvini, Manas Somaiya, andNeel Sundaresan. 2011. Item categorization in thee-commerce domain. In Proceedings of the 20thACM International Conference on Information andKnowledge Management, CIKM ’11, pages 1921–1924, New York, NY, USA. ACM.

Dan Shen, Jean-David Ruvini, Rajyashree Mukherjee,and Neel Sundaresan. 2012a. A study of smoothingalgorithms for item categorization on e-commercesites. Neurocomput., 92:54–60, September.

Dan Shen, Jean-David Ruvini, and Badrul Sarwar.2012b. Large-scale item categorization for e-commerce. In Proceedings of the 21st ACM In-ternational Conference on Information and Knowl-edge Management, CIKM ’12, pages 595–604, NewYork, NY, USA. ACM.

Chong Sun, Narasimhan Rampalli, Frank Yang, andAnHai Doan. 2014. Chimera: Large-scale classi-fication using machine learning, rules, and crowd-sourcing. Proc. VLDB Endow., 7(13), August.

Hsiang-Fu Yu, Chia-Hua Ho, Yu-Chin Juan, and Chih-Jen Lin. 2013a. LibShortText: A Library for Short-text Classication and Analysis. Technical report,Department of Computer Science, National TaiwanUniversity, Taipei 106, Taiwan.

Hsiang-Fu Yu, Chia hua Ho, Prakash Arunachalam,Manas Somaiya, and Chih jen Lin. 2013b. Producttitle classification versus text classification. Techni-cal report, UTexas, Austin; NTU; EBay.

Bianca Zadrozny and Charles Elkan. 2002. Trans-forming classifier scores into accurate multiclassprobability estimates. In Proceedings of the EighthACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, KDD ’02, pages694–699, New York, NY, USA. ACM.

Tong Zhang. 2004. Solving large scale linear pre-diction problems using stochastic gradient descentalgorithms. In ICML 2004: Proceedings of the21st International Conference on Machine Learn-ing, pages 919–926.

Hui Zou and Trevor Hastie. 2005. Regularization andvariable selection via the elastic net. Journal of theRoyal Statistical Society, Series B, 67:301–320.

A Supplemental Materials: ModelParameters

In this paper, the baseline classifiers comprise ofNaıve Bayes (NB) (Ng and Jordan, 2001) similarto the approach described in Shen et al. (2012a)and Sun et al. (2014), and Logistic Regression(LogReg) classifiers with L1 (Fan et al., 2008) andElastic Net regularization. The objective functionsof both GBTs and CNNs involve L2 regularizersover the set of parameters. Our development setfor parameter tuning is generated by randomly se-lecting 10% of the listings under the “apparel /clothing” categories. The optimized parametersobtained from this scaled-down configuration isthen extended to all other classifiers to reduce ex-perimentation time.

For parameter tuning, we set a linear combina-tion of 15% L1 regularization and 85% L2 regu-larization for Elastic Net. For GBTs (Chen andGuestrin, 2016) on both English and Japanesedata, we limit each decision tree growth to a max-imum depth of 500 and the number of boostingrounds is set to 50. Additionally, for leaf nodeweights, we use L2 regularization with a regu-larization constant of 0.5. For GBTs on Englishdata, the initial learning rate is 0.2. For GBTs onJapanese data, the initial learning rate is assigneda value of 0.05 .

For CNNs, we use context window widths ofsizes 1, 3, 4, 5 for four convolution filters, a batchsize of 1024 and an embedding dimension of300. The parameters for the embeddings arenon-static. The convolutional filters are initial-ized with Xavier initialization (Glorot and Ben-gio, 2010). We use mini-batch stochastic gradi-ent descent with Adam optimizer (Kingma and Ba,2014) to perform parameter optimization.

LogReg classifiers and CNN need data to benormalized along each dimension, which is notneeded for NB and GBT.

979

Date post:	11-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Web-Scale Language-Independent Cataloging of Noisy Product ... · scale Japanese e-commerce dataset...

Documents