Post on 23-Apr-2018
transcript
Girish NathanMisha Bilenko
Microsoft Azure Machine Learning
How to Work with Large Datasets to Build Predictive Models
Agenda
1. How to Work with Large Datasets• Sample Dataset: NYC Taxi • HDInsight (Hadoop on Azure) • iPython notebook and HDInsight
2. Building Predictive Models• Azure ML Studio• Learning with Counts
3. Putting it all together: Learning with Counts and HDInsight
Sample Data: NYC Taxi• One year log of NYC taxi rides• 60GB, publicly available at http://www.andresmh.com/nyctaxitrips/• Trip (driver id, times, locations) and fare (fare, tip, tolls)
• Rest of tutorial: data wrangling and tip prediction• Tools: AzCopy, HDInsight, iPython, Azure ML Studio
• 100% Apache Hadoop as an Azure service• Can deploy on Windows or Linux• Provides Map-Reduce capability over big data in Azure
blobs• Head node: job and cluster monitoring• Hive: SQL-like queries as an alternative to writing codeSELECT Col1, COUNT(*) AS Count_Col1 FROM Your_TableGROUP BY Col1 ORDER BY Count_Col1 DESC LIMIT 10;
HD Insight : Hadoop on Azure
• Web-based Python REPL environment• Combines authoring, execution, visualization• Can author and execute HDInsight Hive queries• Sample query (python code snippet)
def submit_hive_query(self): response=urllib2.urlopen(self.url, self.hiveParams)data = json.load(response)self.hiveJobID = data[‘id’] def query(self, queryString):self.submit_hive_query()Example query string: SELECT * FROM sample_table LIMIT 10;
Ipython Notebook
• Fully managed cloud service• Browser based authoring of
dataflow• Best in class machine
learning algorithms • Support for R/Python/SQL• Collaborative data science • Quickly deploy models as
web services/REST API’s• Publish to a gallery for
collaboration with community
What is Azure ML Studio
(Distributed Robust Algorithm for CoUnt-based LeArning)
Misha Bilenko
Microsoft Azure Machine LearningMicrosoft Research
Learning with Counts a.k.a Dracula
adid = 1010054353adText = K2 ski sale!adURL= www.k2.com/sale
Userid = 0xb49129827048dd9bIP = 131.107.65.14
Query = powder skisQCategories = {skiing, outdoor gear}
8
¿𝑢𝑠𝑒𝑟𝑠 109 ¿𝑞𝑢𝑒𝑟𝑖𝑒𝑠 109+¿¿𝑎𝑑𝑠 107 ¿ (𝑎𝑑×𝑞𝑢𝑒𝑟𝑦 ) 1010+¿ ¿
• Information retrieval• Advertising, recommending, search: item, page/query, user
• Transaction classification• Payment fraud: transaction, product, user• Email spam: message, sender, recipient• Intrusion detection: session, system, user• IoT: device, location
Large Scale learning in multi entity domains
adid: 1010054353adText: Fall ski sale!adURL: www.k2.com/sale
userid 0xb49129827048dd9bIP 131.107.65.14
query powder skisqCategories {skiing, outdoor gear}
9
• Problem: representing high-cardinality attributes as features• Scalable: to billions of attribute values• Efficient: predictions/sec• Flexible: for a variety of downstream learners• Adaptive: to distribution change
• Standard approaches: binary features, hashing, projections• What everyone uses in industry: learning with counts• This talk: formalization and generalization
Large Scale learning in multi entity domains
• Features are transforms of conditional statistics (per-label counts)
= [N+ N- log(N+)-log(N-) IsBackoff]• log(N+)-log(N-) = log log-odds/Naïve Bayes estimate
• N+, N- indicators of confidence of the naïve estimate
• IsFromRest: indicator of back-off vs. “real count”
) )
131.107.65.14
) )
k2. com
)
powder skis
)
powder skis , k2. com
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430… … …
REST 745623 13964931
Learning with Counts
• Features are transforms of conditional counts = [N+ N- log(N+)-log(N-) IsBackoff]
Scalable “head” in memory + tail in backoff; or: count-min sketch Efficient low cost, low dimensionality Flexible low dimensionality works well with non-linear learners new values easily added, back-off for infrequent values, temporal counts
) )
131.107.65.14
) )
k2. com
)
powder skis
)
powder skis , k2. com
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.107.65.14 12 430… … …
REST 745623 13964931
Learning with Counts
Aggregate for different • Standard MapReduce• Bin function: any projection• Backoff options: “tail bin”, hashing,
hierarchical (shrinkage)
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964… … …
REST 6321789 43477252
Query × AdId
facebook, ad1 54546 978964
facebook, ad2 232343 8431467
dozen roses, ad3 12973 430982… … …
REST 4419312
52754683
timeTnow
Counting
IP[2]
173.194.*.* 46964 993424
87.250.*.* 6341 91356
131.253.*.* 75126 430826… … …
12
Learning with Counts : aggregation
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964… … …
REST 6321789 43477252
timeTnow
Train predictor
….
IsBackoff
ln𝑁 +¿− ln𝑁−¿Aggregatedfeatures
Original numeric features𝑁−𝑁+¿¿
Counting
Train non-linear model on count-based features
• Counts, transforms, lookup properties
• Additional features can be injected
Query × AdId
facebook, ad1 54546 978964
facebook, ad2 232343 8431467
dozen roses, ad3 12973 430982… … …
REST 4419312
52754683
13
Learning with Counts : combiner training
IP
173.194.33.9 46964 993424
87.250.251.11 31 843
131.253.13.32 12 430… … …
REST 745623 13964931
query
facebook 281912 7957321
dozen roses 32791 640964… … …
REST 6321789 43477252
URL × Country
url1, US 54546 978964
url2, CA 232343 8431467
url3, FR 12973 430982… … …
REST 4419312
52754683
timeTnow
….
IsBackoff
ln𝑁 +¿− ln𝑁−¿Aggregatedfeatures
𝑁−𝑁+¿¿
Counting
• Counts are updated continuously
• Combiner re-training infrequent
Ttrain
Original numeric features
Prediction with counts
• State-of-the-art accuracy• Good fit for map-reduce• Modular (vs. monolithic)• Learner can be tuned/monitored/replaced in isolation
• Monitorable, debuggable (this is HUGE in practice!)• Temporal changes easy to monitor• Easy emergency recovery (remove bot attacks, etc.)• Decomposable predictions• Error debugging (which feature can we blame…)
15
What is great about learning with Counts ?
Learning with Counts : in Azure ML
• HDInsight: large data storage and map-reduce processing
• Azure ML: cloud ML and analytics accessible anywhere
• Learning with Counts: intuitive, flexible large-scale ML solution
Putting it all together
Thanks for your time
Useful Links:http://azure.microsoft.com/ml- Sign up for your free Azure ML Trial
http://bit.ly/datasc_ebook - Free tutorial on how to use Azure ML
Need Azure ML for teaching in classroom ? - Contact the speakers
Other Questions ? - Contact the speakers
Speakers :-Misha Bilenko : mbilenko@Microsoft.comGirish Nathan – ginathan@Microsoft.com