Copyright 2011 Trend Micro Inc.Classification 8/2/2013 1
Mathematics, Algorithms, Technologies and Products
Liwei Ren, Ph.D
Data Security Research, Trend Micro
March, 2013, NUS, Singapore
Copyright 2011 Trend Micro Inc.
Agenda
• Introduction
• Mathematical Problems Products
• Amazing Stories of Mathematical Success
• My Practices
• Conclusions
• Q&A
Classification 8/2/2013 2
Copyright 2011 Trend Micro Inc.
Introduction
Classification 8/2/2013 3
• Liwei Ren– Education
• MS/BS in mathematics, Tsinghua University, Beijing, China
• MS in information science, University of Pittsburgh , USA
• Ph.D in mathematics, University of Pittsburgh, USA
– Research Interests
• Data security, differential compression, storage optimization, and fast data transfer protocols.
– Major Works
• N academic papers, M patents and K startup company where N≥10, M ≥15 and K=1
• Trend Micro™ – Global security software company with headquarter in Tokyo, and R&D centers in
Silicon Valley, Nanjing, and Taipei.
– One of top cloud security vendors
– One of top 3 anti-malware vendors
Copyright 2011 Trend Micro Inc.
Introduction
Classification 8/2/2013 4
• Why am I visiting NUS?– To share how mathematics can make differences in industry with
successful technologies and products
– To share how mathematicians can make significant contributions to the society in industry.
• What do you think about the role of mathematics in the world?– A philosophy
– An academic discipline
– An art of abstract elegance and supreme beauty
– A language for describing sciences
– A tool for our life
– All of above
Copyright 2011 Trend Micro Inc.
Introduction
5
• Why did you become a mathematician?– To pursue truth and virtue
– I enjoy it and it is a part of my life
– I can make most contributions to the world with my talent
– To teach mathematics with joy
– To publish many papers and have great achievement
– It is a way of making a living
– To solve practical problems as an applied mathematician
Copyright 2011 Trend Micro Inc.
Introduction
6
• In the era of Internet and cloud computing, – There are so many challenging mathematical problems.
– Applied mathematicians and software engineers invent advanced technologies by solving real mathematical problems.
– Some of them even go further to found start-up companies with their new inventions.
Copyright 2011 Trend Micro Inc.
Mathematical Problems Products
• Mathematical problems Products ?
• How?
• Two general approaches:– Top-down
– Bottom-up
Classification 8/2/2013
Copyright 2011 Trend Micro Inc.
Mathematical Problems Products
Top-Down Approach
Classification 8/2/2013 8
Copyright 2011 Trend Micro Inc.
Mathematical Problems Products
Bottom-Up Approach
Classification 8/2/2013 9
Copyright 2011 Trend Micro Inc.
Amazing Stories of Mathematical Success
• Some mathematicians and computer scientists from Universities built successful hi-tech companies with algorithmic technologies.
• Three excellent examples.
Classification 8/2/2013 10
Copyright 2011 Trend Micro Inc.
Amazing Stories of Mathematical Success
• RSA Security, Inc.– Founded by 3 applied mathematicians Ron Rivest, Adi Shamir and Len
Adleman in 1982
– Key technology: RSA public key cryptography algorithm
– Industry Sector: Data Security Software
– Affiliation: MIT
– Excellence in Mathematics Award at RSA Conference
– Acquired by EMC with $2.1B in 2006
Classification 8/2/2013 11
Copyright 2011 Trend Micro Inc.
Amazing Stories of Mathematical Success
• Akamai Technologies, Inc– Founded by mathematicians Prof. Tom Leighton and his student Daniel
M. Lewin in 1998
– Key technology: dynamic content routing algorithms
– Industry Sector: content delivery network (CDN)
– Affiliation: MIT
– A Mathematical Success Story, SIAM News: Vol 32, Num 10, 1999
– Akamai is a public company (Nasdaq: AKAM) with revenue $1.27B in 2012.
Classification 8/2/2013 12
Copyright 2011 Trend Micro Inc.
Amazing Stories of Mathematical Success
• Data Domain, Inc.– Founded by computer scientist Prof. Kai Li and others in 2001
– Key technology: data de-duplication algorithms
– Industry Sector: computer storage.
– Affiliation: Princeton University
– Also acquired by EMC with $2.5B in 2009
Classification 8/2/2013 13
Copyright 2011 Trend Micro Inc.
My Practices
• Let me share my mathematical practices in industry.
• My relevant experience in software industry– Sr. software engineer, 2 IT companies, 1996 -- 2002
– Principal research engineer, InnoPath Software, 2002– 2005
– Chief scientist & co-founder, Provilla Technologies, 2005– 2007
– Sr. architect and research director, Trend Micro, 2007– present
• Two technical domains with mathematical practices:– Data Loss Prevention (DLP)
– Firmware Over The Air (FOTA)
Classification 8/2/2013 14
Copyright 2011 Trend Micro Inc.
My Practices
• Two simple yet valuable problems to share with you:
1. Near duplicate document identification (NDDI)
2. Differential compression for executable files (DCE)
15
NDDI ODCE
Math Model Textual Fixed Points Secondary Code Change
Algorithm DataDNA Secondary Change
Removal
Technology Document Fingerprinting Differential Compression
of Executables
Product LeakProof™ DeltaUpgrade™
Technical domain Data Loss Prevention
(DLP)
Firmware Over the Air
(FOTA)
Company Provilla / Trend Micro InnoPath
Contribution Created a company with
many jobs
Better FOTA for 30
million phones
Copyright 2011 Trend Micro Inc.
My Practices
• NDDI (near duplicate document )is a fundamental problem that must be solved for a DLP system.
• Problem Definition:– Let S= { T1, T2, …,Tn} be a set of known texts
– Given a query text T, one needs to identify one or more documents t ϵ S such that T and t share common textual content significantly.
Classification 8/2/2013 16
• A technology solving this problem is named as document fingerprinting.
Copyright 2011 Trend Micro Inc.
My Practices
• Alternate Problem Definition:– Let S= { T1, T2, …,Tn} be a set of known texts
– Given a query text T and X%, one needs to identify one or more texts t ϵ S such that SIM(T,t)≥X%
Classification 8/2/2013 17
where SIM(x,y) is a similarity function that needs to be defined mathematically.
Copyright 2011 Trend Micro Inc.
My Practices
• Mathematical Modeling– Observation:
• Across multiple versions of a text, many characters are not changed with respect to their neighborhood
• For example:
– … The research required to solve mathematical problems can take years or even centuries of sustained inquiry…
– …………The research required to solve mathematical problems can take many years of sustained inquiry…
– Textual Fixed Points:
• If a character and its neighborhood as a textual string exist in two texts, this character is a fixed point of the two texts.
• Two near duplicate texts have many fixed points.
– We only need a subset for the efficiency.
• One needs to extract a subset of fixed points from a given text T and generate hash values from their neighborhood.
• Lets denote the extracted subset of fixed points as FS(T).
Classification 8/2/2013 18
Copyright 2011 Trend Micro Inc.
My Practices
• Mathematical Modeling– The concept of near duplicate texts can be presented as:
• FS(T1) ∩ FS(T2) ≠ Φ
– The NDDI problem can be described as:
• Given a query text T, one needs to identify one or more documents t ϵ S such
that | FS(T) ∩ FS(t) | ≥ M where M is a pre-defined integer.
Classification 8/2/2013 19
Copyright 2011 Trend Micro Inc.
My Practices
• A solution is an algorithm to solve the problem described by this mathematical model. – In industry, the corresponding technology is called document
fingerprinting
– We designed two different fingerprinting technologies over the years.
• DataDNA 1.0– Liwei Ren & el., US patent 7516130, Matching engine with signature generation.
– Liwei Ren & el., US patent 7747642, Matching engine for querying relevant documents.
– Liwei Ren & el., US patent 7860853, Document matching engine using asymmetric signature generation.
• DataDNA 2.0– Liwei Ren & el., US patent 8359472, Document fingerprinting with asymmetric
selection of anchor points.
Classification 8/2/2013 20
Copyright 2011 Trend Micro Inc.
My Practices
• Summary:– A document fingerprinting technology was developed based on the
DataDNA 1.0 algorithm
– Provilla raised funding from investors with this technology in early 2005
– We developed a DLP product LeakProof™ with document fingerprinting as our core technology.
– The company Provilla™ was acquired Trend Micro™ in late 2007
• DataDNA technology played an important role when Trend Micro decided to acquire Provilla
Classification 8/2/2013 21
Copyright 2011 Trend Micro Inc.
My Practices
• Differential Compression of Executables (ODC) is a mathematical problem for a FOTA system.
• Differential Compression in general:
Classification 8/2/2013 22
where T and R are general files. T stands for target and R for reference.
Copyright 2011 Trend Micro Inc.
My Practices
• Differential Compression for FOTA:
Classification 8/2/2013 23
Copyright 2011 Trend Micro Inc.
My Practices
• If T and R are executable files, we should have better diff rate than general files according to the information theory.– How can we achieve this?
• Mathematical Modeling:– To optimize the differential compression, one way is to reduce the
differences between two files.
– We need to figure out what the code changes are between two versions of a software executable.
• Primary code change: instructions are altered due to source code changes.
• Secondary code change: an instruction is altered at the byte level due to code change happening at other places.
• We use JUMP as an example to illustrate the concept. An JUMP instruction is a few bytes that encode the distance between the source and destination.
Classification 8/2/2013 24
Copyright 2011 Trend Micro Inc.
My Practices
• Mathematical Model:– Secondary code change for JUMP
Classification 8/2/2013 25
Copyright 2011 Trend Micro Inc.
My Practices
• Mathematical Modeling:– Secondary Change Removal for JUMP:
• The secondary code change causes instr1 ≠ instr2
• Given the file R, if we can derive instr2 from instr1, we can replace instr1 in R with instr2.
• For all such instructions in R, we can do the same substitution, we transfer R into another file and denote it as P(R,H(R,T)) where H stands for hints. We have the new formal presentation:
Classification 8/2/2013 26
Copyright 2011 Trend Micro Inc.
My Practices
• Mathematical Modeling:– Lets start with common code blocks between two versions where we
usually can identify the common blocks across versions.
Classification 8/2/2013 27
Copyright 2011 Trend Micro Inc.
My Practices
• Mathematical Modeling:– How to derive a new JUMP instruction from an old one?
Classification 8/2/2013 28
Copyright 2011 Trend Micro Inc.
My Practices
• Mathematical Modeling:– How to derive a new JUMP instruction from an old one?
• instr2 = Encode(destAddr2 – srcAddr2)
• destAddr2 – srcAddr2 = (destAddr1 – srcAddr1) + (destAddr2 – srcAddr2 ) -(destAddr1 – srcAddr1) = Decode(instr1) + (destAddr2 – destAddr1) –(srcAddr2 - srcAddr1) = Decode(instr1) + (destBlkAddr2 – destBlkAddr1) –(srcBlkAddr2 - srcBlkAddr1)
– instr2 = Decode(instr1) + (destBlkAddr2 – destBlkAddr1) – (srcBlkAddr2 -srcBlkAddr1)
– We can do the similar to other instructions such as data pointers.
– All these instructions such as JUMP or data pointers are called profitable instructions.
Classification 8/2/2013 29
Copyright 2011 Trend Micro Inc.
My Practices
• A solution is an algorithm to identify all the profitable instructions and remove the code changes accordingly.
• US patent 7089270 provides one of the solutions– Liwei Ren & el., Processing software image for use in generating difference files.
Classification 8/2/2013 30
Copyright 2011 Trend Micro Inc.
My Practices
• Summary:– An optimized differential compression technology was developed based
on the algorithm described in patent 7089270.
– InnoPath™ enhanced its flagship product DeltaUpgrade™ significantly by integrating this advanced technology.
– InnoPath™ won many new customer deals due to its superior technical advantage over its competitors.
– This technology supports 30 millions mobile phones.
Classification 8/2/2013 31
Copyright 2011 Trend Micro Inc.
My Practices
• Other than NDDI and DCE, there are many math problems in practices that I have worked:– Subgraph isomorphism
– Multi-value binary search
– RegEx pattern optimization
– Keyword proximity match
• An interesting extension is the minimal M-color enclosing circle problem.
– Remote differential compression
– Malware clustering and detection
– ……
Classification 8/2/2013 32
Copyright 2011 Trend Micro Inc.
Conclusions
• Mathematics can make differences in industry with mathematical models and algorithms
• Mathematicians can contribute to the society significantly by inventing novel technologies, building useful products and creating job opportunities.
• There has never been a better time to be a mathematician.— James R. Schatz, Chief of Math Research Group, NSA
Classification 8/2/2013 33