Data Mining: Crossing the Chasm

Post on 23-Feb-2016

30 views 0 download

Tags:

description

Data Mining: Crossing the Chasm. Rakesh Agrawal IBM Almaden Research Center. Thesis. The greatest challenge facing data mining is to make the transition from being an early market technology to mainstream technology We have the opportunity to make this transition successful. Outline. - PowerPoint PPT Presentation

transcript

Data Mining: Crossing the Chasm

Rakesh AgrawalIBM Almaden Research Center

Thesis

• The greatest challenge facing data mining is to make the transition from being an early market technology to mainstream technology

• We have the opportunity to make this transition successful

Outline

• Chasm in the technology adoption life cycle, à la Geoffrey Moore†

• Experience with Quest/Intelligent Miner• Ideas for successful chasm crossing

† Geoffrey A Moore. Crossing the Chasm. Harper Business. http://www.chasmgroup.com

Technology Adoption Life Cycle

Techies: Try it!

Visionaries: Get ahead of the herd!

Pragmatists: Stick with the herd!

Conservatives: Hold on!

Skeptics: No way!

Late Majority

Early Majority

Early Adopters

LaggardsInnovators

Psychographic profile of each group is different

Innovators: Technology Enthusiasts

• Intrigued by any fundamental advance in technology

• Like to alpha test new products• Can ignore the missing elements• Want access to top technologists• Want no-profit pricing (preferably free)

Gatekeepers to early adopters

Early Adopters: Visionaries

• Driven by vision of dramatic competitive advantage via revolutionary breakthroughs

• Great imagination for strategic applications• Not so price-sensitive• Want rapid time to market• Demand high degree of customization

Fund the development of early market

Early Majority: Pragmatists

• Want sustainable productivity improvement through evolutionary change

• Astute managers of mission-critical apps• Understand real-world issues and tradeoffs• Focus on proven applications; want to see

the solution in production

Bulwark of the mainstream market

Late Majority: Conservatives

• Want to stay even with the competition• Risk averse• Price sensitive• Need completely pre-assembled solutions

Extend technology life cycles

Laggards: Skeptics

• Driven to maintain status quo• Good at debunking marketing hype• Disbelieve productivity-improvement

arguments• Can be formidable opposition to early

adoption of a technology

Retard the development of high-tech markets

Crack in the curve

Early Market Mainstream Market

Chasm

The greatest peril in the development of a high-tech market lies in making the transition from an early market dominated by a few visionaries to a mainstream market dominated by pragmatists.

Visionaries vs. Pragmatists

• Adventurous• First strike capability• Early buy-in• State of the art• Think big• Spend big

• Prudent• Staying power• Wait-and-see• Industry standard• Manage expectation• Spend to budget

Is data mining following this curve?

• Yes!!!• My personal viewpoint based on

Quest/Intelligent Miner experience

Quest

• Started as skunk work in early nineties• Inspired by needs articulated by industry

visionaries:– Transaction data collected over a long period– Current tools/SQL don’t cut it– About ready to throw data

Approach

• Examine “real” applications• Identify operations that cut across

applications• Design fast, scalable algorithms for each

operation• Develop applications by composing

operations

Operations

• Associations• Sequential Patterns• Similar time series

• New Operations• Completeness,

scalability

• Classification• Clustering• Deviations

• Adopted from Statistics/Learning

• Scalability

http://www.almaden.ibm.com/cs/quest

Bringing Quest to market

• Visionaries who inspired Quest did not become first customers:– Wanted evidence that the technology “worked”

• Frustrating attempts to interest major IBM customers:– Integration with existing applications– Too-far-out technology– Resistance from in-house analytic groups

First hits

• Small information-based companies who provided data in exchange for free results

• CIO who wanted to be seen as the technology pioneer in his industry

• CIO who wanted the success story to feature in the company’s annual report

Led to the formation of a group offering services using Quest

Characteristics of engagements

• Mostly associations and sequential patterns• Completeness a big plus• Unanticipated uses• Feedback for further development

Into the product land

• Formation of a small “out-of-plan” product group to productize Quest

• Facilitated by a closet mathematician• Successes of the services group used for

market validation• Continued development and infusion of

technology

Intelligent Miner

• Serious product• Integrates technologies from various groups• Fast, scalable, runs on multiple platforms• Several “early market” success

stories

http://www.software.ibm.com/data/iminer/

Are we in the chasm?

• Perceived to be sophisticated technology, usable only by specialists

• Long, expensive projects• Stand-alone, loosely-coupled with data

infrastructures• Difficult to infuse into existing mission-

critical applications

Chasm Crossing

• Personal speculations on some technical challenges

• Do not imply IBM research/product directions

XML-based Data Mining Standard (1)

• Model Building:– A pair of standard

DTDs for each operation

– Interchangeable library of operator implementations

Operator

Model

ParametersData Specs Standard

DTD

Standard DTD

Library

Ack: Mattos, Pirahesh, Schwenkries

XML-based Data Mining Standard (2)

• Model Deployment:– Mapping XML object

provides mapping between names and format in the model object and the data record

– Model could have been developed on a different system

Application

Result

Mapping

Standard DTDs

Standard DTD

Library

Model DataRecord

Implications

• Standard interfaces for application developers to incorporate data mining

• Coupling with relational databases – mappings from DTDs to relational schemas– implementation using existing infrastructure

Data Mining Benchmarks

• UC Irvine repository• Generating synthetic benchmarks modeled

after real data sets is a hard problem– How to map names into meaningful literals– How to preserve empirical distributions

Ack: Srikant, Ullman

Auto-focus data mining

• Automatic parameter tuning• Automatic algorithm selection (à la join

method selection in database query optimization)

Ack: Andreas Arning

Web: Greatest opportunity

• Huge collection of data (e.g. Yahoo collecting ~50GB every day)

• Universal digital distribution medium makes data mining results actionable in fundamentally new ways

• But watch for privacy pitfall

Privacy-preserving data mining

• Technical vs. legislated solutions• Implication for data mining algorithms

when some fields of a data record have been fudged according to the user’s privacy sensitivity

Ack: R. Srikant

Personalization

• Internet might provide for the first time tools necessary for users to capture information about themselves and to selectively release this information†

• Will we be providing these tools?

† John Hagel, Marc Singer. Net Worth. Harvard Business School Press.

What about Association Rules?

• Very long patterns• Separating wheat from chaff• Principled introduction of domain

knowledge

What else?

• Formal foundations of data mining

Summary

• Closely couple data mining with database systems

• Embed data mining into applications

• Focus on web

• Standard interfaces• Benchmarks• Auto focussing

• Personalization• Privacy

Concluding remarks

• Data mining, a great technology– Combination of intriguing theoretical questions

with large commercial interest in the technology

• Poised for transitioning into mainstream technology

• Will we rise to the challenge as a community?

Acknowledgments

Arning Arnold Bayardo Baur Bollinger Brodbeck

Baune Carey Chandra Cody Faloutsos Gardner

Gehrke Ghosh Greissl Gruhl Grove Gunopulos

Gupta Haas Ho Imielinski Iyer Lent

Leyman Lin Lingenfelder Mason McPherson Megiddo

Mehta Miranda Psaila Raghavan Rissanen Sawhney

Sarawagi Schwenkries Schkolnick Shafer Shim Somani

Srikant Staub Swami Traiger Vu Zait