Meetup - August 25, 2015

Post on 22-Jan-2017

44 views 0 download

transcript

Case Studies in Text-Mining and BeyondLarkin Liu - Paytm Labs

About me

- Visiting Scientist at Paytm Labs- MASc Student in Operations Research, Industrial Engineering,

University of Toronto- BASc Thesis: “Application of Multivariate and Univariate Time Series

Analysis Methods to Stochastic Processes”- Areas of research and development,

- Supervised learning algorithms - Reinforcement Learning, Clustering- Stochastic models - Hidden Markov Models, Time Series- Optimization - Decision Theory

- Working with the following languages and frameworks, such as,Only been horseback riding once actually…

What is Paytm?

Paytm Labs• Fraud Detection

• Anomaly Detection• Classification

• Consumer Analytics• Recommender Systems• Inventory Optimization

Paytm• India’s fastest growing

eCommerce platform

• Online shopping and mobile wallet

• 100 Million users• 10+ million transactions

per day

Principles of Data Science• Occam’s Razor

• “Assume the solution of lowest complexity is the correct solution.” - Most often it is the optimal assumption.

• Data science is statistics• Hypothesis Testing• Regression• Bayesian Inference

• Data science is software engineering• Big Data - Optimization• Data Structures & Algorithms• Artificial Intelligence - Machine Learning

• Data science is holistic in nature• Biostatistics • Particle physicists• Mechatronics

Text Mining - Gibberish Emails vs. Intelligent EmailsExamples of Intelligent Emails:

● larkin.liu123@gmail.com● ashwin_55@hotmail.com● adam223@yahoo.com

Examples of Gibberish Emails:● asasdf1234@gmail.com● asdlkvnaev@hotmail.com● qwlefvnas@yahoo.com

Objective: Distinguish intelligent emails from spammy gibberish emails, and feed it into our decision model.

Step 1: “larkin.liu123@gmail.com”

Step 2: N-Grams:“l a r k i n . l i u” 1-Gram“la ar rk ki in n. .l li iu” 2-Gram

Step 3: Build 1st Order Markov Chain

Step 4: Build probability thresholds based on sampling of intelligent and gibberish names.

Step 5: Optimize model, experiment with parameters on ROC curve.

Step 6: Build robust classifier.

Text Mining - Gibberish Emails vs. Intelligent Emails (Cont.)

Distribution of P(X). ROC curve on 200 names, varying P(X) threshold.

Text Mining - Address Fingerprinting

The following address expressions are actually the same address:

● sco-74, F.f, swastik vihar, mdc, sec-5 sco-74, F.f, swastik vihar, mdc, sec-5● SCO-74, F.F, Swastik Vihar, MDC, Sec-5, Panchkula, Haryana SCO-74, F.F, Swastik Vihar, MDC,

Sec-5, Panchkula, Haryana● SCO-74, F.F, Swastik Vihar, MDC, Sec-5, Panchkula, Haryana PANCHKULA

The fact that someone would purposefully write the same address in multiple ways is an indicator that they are trying to deceive the system.

Objective: Build an algorithm to map all addresses of identical semantic meaning to one hash. Subsequently, all marketplace activity can be associated to one identity to detect fraudulent behaviour.

Text Mining - Address Fingerprinting (Cont.)

Number of distinct customers per Address Fingerprint, 62 customers address fingerprint

Measure of transaction velocity per address fingerprint.

Community Detection

Based on associations of address fingerprints and remote IP addresses, we can generate an associative organization network of related address fingerprints.

Q & A

Now is the time to ask questions...