Date post: | 13-Jun-2015 |
Category: |
Technology |
Upload: | shion-deysarkar |
View: | 401 times |
Download: | 0 times |
Finding Products on the Internetusing Neural Networks
http://www.datafiniti.net
● Goals
○ Collect vast amounts of data through web crawling
○ Normalize and deduplicate data
○ Make it searchable and meaningful
Motivation
Challenges
48 billion pages on the Internet○ Crawled 6 billion+ pages in the past year
Mostly unstructured data
Limitations of customized crawls○ Non-scalable○ Less robust
Solution: Intelligent Classifiers
Advantages○ Generic code: Scalability ○ More robust
Challenges○ More difficult to parse data of interest
Problem
Problem
Product PageProduct Page
Product CategoryProduct Category
Product PageProduct Page
Some Other Page
Problem
Solution
Minimize dependency on HTML
Supervised learning for page classification○ Neural networks
Heuristic algorithms for data parsing
Our Approach
Hidden Layer
Input Layer
Output Layer
AV(Product)
AV(Product Category)
AV(Other)Pag
e Fe
atur
es
AV: Activation Value : {0, 1}
Neural Network
Classification_Type = Type with max. AV
Page FeaturesBuy Widget
Price
Image
Num Clickable Images with
Price
Shipping Info
Page Features
Weight
Product Code
Keywords
Num. Words on
Page
Trained offline Dataset Feature Vector
Normalization
Neural Network
Input Layer Parameter Set
(P)
Hidden Layer Parameter Set
(Q)
Training
Web page Feature Vector Normalized Feature Vector (x)
Neural Network
Input Layer Parameter Set
(P)
Hidden Layer Parameter Set
(Q)
AV(Prod)
AV(ProdCat)
AV(Other)Page_Type = max{ AV(Prod), AV(ProdCat), AV(Other) }
Output of hidden layer: L1 = sigmoid(PTX)
Final output: L2 = sigmoid(QTL1)
L2 = { AV(Prod), AV(ProdCat), AV(Other) }
sigmoid(s) = 1 / (1 + e-s)
Deployment
Notation○ True Positive (TP)○ False Positive (FP)○ False Negative (FN)
Precision : TP / (TP + FP)
Recall : TP / (TP + FN)
F-score: 2PR / (P + R)
Known Dataset○ Precision = 1.0○ Recall = 0.985○ F-score = 0.9925
Live System/Unknown Data○ Precision = 0.854○ Recall = Difficult to
calculate
Evaluation
Problem
Data Extraction
Product Name
Product Price
Product Code○ UPC, EAN, ISBN, ASIN
Fields to Collect
Product PageGetting Product Name
Potential Names
<title>Pebble Smart Watch for Select Apple and Android Devices 301RD - Best Buy</title>
Match Found
Product PageGetting Product PricePrice values
with text - discard
Old Price - discard
Current Price - Accept
Improve classification accuracy
Increase/improve collection of data fields
Future Work
Questions?https://www.datafiniti.net
http://blog.datafiniti.net@datafiniti
Price
Image(s)
# clickable images adjacent to price values
"Add to cart", "Buy" widget
# words in page text
Keywords○ Product detail, specifications, features, size,
color, weight, shipping, availability, SKU, UPC, ISBN, ASIN
Page Features
Some Other Page
ProductImages
Related Products
PriceWidget to buy
Shipping Info
Classification Intuition