IBM Software Group
1
Nigel FreemanContent Discovery specialist - IBM Software Group
May 2006
Information is Everywhere Managing Information for Discovery and Search
IBM Software Group
2
Agenda
Too much information – drowning or swimming ?
IBM is going beyond mere ‘search’… IBM Content Discovery architecture
Content Integration services: making connections between existing systems Information Integration Content Edition – overview
Enterprise Search: not the same as Internet search What do you need from Enterprise Search and text analytics middleware?
OmniFind – overview
Text Analysis: - Unstructured Information Management Architecture UIMA
Contextual Delivery, Information Accelerators to generate customer solutions WebSphere Content Discovery Server – overview
IBM Content Discovery products, summary
Customer Examples
IBM Software Group
3
Drowning in information, or swimming? Organisations today are faced with an ever-growing abundance of information.
The lack of a proper systems to access and manage their collective wisdom can cripple an organisation - not being able to find the relevant information when it is needed or finding it too late translates into bad decisions, missed opportunities, wasting time and money reinventing information that already exists.
“It is clear that we are all drowning in a sea of information. The challenge is to learn to swim in that sea, rather than drown in it.”- from a study by University of California, Berkeley School of Information Management and Systems
By implementing cutting-edge systems for organizing and accessing information, organisations will promote growth at significantly reduced cost to today’s enterprise.
“ An enterprise with 1,000 knowledge workers wastes $48,000 per week – $2.5 million per year – due to an inability to locate and retrieve information.” The High Cost of Not Finding Information, IDC
IBM w3 advertisement “w3 personalisation…”
IBM Software Group
4
Information is isolated in multiple silos …
Independent Systems
Customer Service
Council Tax
Social Services Education
Leisure Services
Planning Housing
The problem…
IBM Software Group
5
… and the vast majority is unstructured
• Office Documents• Images• Web pages• E-mail• Audio & Video• Free-form text fields
(comments/notes)
• File servers• Websites • Portals• ECM systems• Collaborative systems • Databases (BLOBs and
free-form text fields)
Examples Where It Exists
IBM Software Group
6
Typical search experience is not good enough
“Loan”
I need help finding a loan for college
Typical Online Experience
Burden of discovery is on the end user!
IBM Software Group
7
There is inherent tension between business and IT
Line-of-Business Owners and Project Leads Must deliver information to
their specific customers, partners and employees to facilitate business process
Care most about best of breed functionality and direct control over the end user experience
IT Architects and CIOs Must make information
available from across the enterprise in a secure and standard format
Care most about achieving leverage and reuse, with a low total cost of ownership
Search App 1 Search App 2 Search App 3
Enterprise Search Infrastructure
IBM Software Group
8
The IBM Approach: Content Discovery
Information is isolated in multiple silos
Native, bi-directional access ensures all assets are available and content can be continually improved
Much of it is unstructured, limiting its use
Uncovering the inherent meaning of unstructured content can enhance search relevance, giving new levels of business insight
Traditional search is a bottleneck to facilitating action
Understanding user intent and application context allows organizations to get the right information to the right people at the right time
IT wants standards but business wants control
Complete solutions built on a Service Oriented Architecture allow organisations to balance the needs of business and IT
Going Beyond “Search” to “Find”
IBM Software Group
9
Content Discovery
Analysis & Discovery Services
IBM Content Discovery Architecture
Content Integration Services
Information Accelerators
Search & Indexing
Text Analysis (UIMA)
Contextual Delivery
Extract knowledge and meaning, for greater relevance and insight
Industry vocabularies and solution templates shorten deployment time
Broad content access and native integration for secure
read and write access
Scalable search capability with sophisticated indexing and retrieval
Understand user intent and context, to guide action and
navigate large result sets
IBM Software Group
10
Content Discovery
Analysis & Discovery Services
Content Integration Services
Information Accelerators
Search & Indexing
Text Analysis (UIMA)
Contextual Delivery
IBM Software Group
11
The Problem: Multiple Silos of Content
36%
14%
25%
17%
1 repository5%
2-5 repositories
6-10 repositories10-15 repositories
4%
More than 15 repositories
Don't know
Survey base: 81 North American decision-makers(multiple responses accepted)
“The Future of Content in the Enterprise,” Connie Moore and Robert Markham
IBM Software Group
12
WebSphere II Content Edition
SOA, enterprise-class integration architecture for “content”
Single interface to multiple content sources and workflow systems
Many “out of the box” connectors and toolkit for custom connectors
Two-way access to expose underlying functionality
Adds cross-repository services such as federated search, event services, single sign-on, etc
“Out of the box”client, development components and APIs for building custom applications
CALL CENTER COMPLIANCESELF-SERVICECRM WEBSITES
Lets you work with content from multiple disparate content sources -
as if it were stored in one unified system
IBM Software Group
13
Display associated metadata with the ability to preview a document and update content or properties
Provide a single point of access to all documents associated with the customer, regardless of where they are stored
Content Integration ServicesSeamless Access to Distributed Content from Business Applications
IBM Software Group
14
WebSphere II Content Edition Integration Services Many Out-of-the-Box Connectors
Pre-built and fully supported real-time, bi-directional connectors
Exposes content, workflow and functionality of underlying systems
Available for most major commercial systems, including…
Connector SDK for custom systems
INTEGRATION SERVICES
Documentum Content Server, FileNet Content Services, FileNet Image Services, FileNet P8 Content Manager, FileNet P8 Business Process Manager, Hummingbird DM, IBM Content Manager, IBM Content Manager OnDemand, IBM Portal Document Manager, Lotus Domino Document Manager, IBM Lotus Notes, IBM WebSphere MQ Workflow, Interwoven Teamsite Content Server, Microsoft Index Server, OpenText Livelink Enterprise Server, Stellent Content Server, File Systems, Lab Services, Partner Connectors
IBM Software Group
15
WebSphere II Content Edition Federation Services Meta Data Mapping
Common schema across different systems
Federated Search Single search interface across multiple disparate systems
Virtual Repository Single, unified view of distributed content Consolidated view of work tasks from multiple workflow systems
Subscription Event Services Subscription-based notification of changes to content, across
multiple repositories
View Services Convert content on-the-fly to browser-readable formats (eg PDF,
HTML)
Single Sign-On (SSO) authentication Native and integration with LDAP and Active Directory
INTEGRATION SERVICES
FEDERATION SERVICES
IBM Software Group
16
WebSphere II Content Edition Developer Services
Federated Client Complete out-of-the-box UI for working with distributed content
Includes key functionality and a highly usable interface
Web Components Accelerates time to market for custom applications
Development components plug into web applications
Completely customizable look and feel
Includes JSR 168 compliant portlets
WebSphere II Content Edition API Complete access to content and workflow functionality
Easy to use Java API and SOAP-based Web Services API
INTEGRATION SERVICES
DEVELOPER SERVICES
FEDERATION SERVICES
IBM Software Group
17
IBM Federated Records Management
Consists of IBM DB2 Records Manager, WebSphere II
Content Edition, FRM Solution Components*
Key Features Central policy mgmt on distributed content
“Touchless” records declaration
Federated search for discovery operations
Two-way, consistent UI to content systems
…the application of records management to distributed content
Business Value Reduce risk with centralized RM policies
Accelerate time to compliance
Reduce discovery costs
Consolidate over a phased timeframe
Provide a “future proof” infrastructure
1
DCTM FILE OTEX HUMC
… Other Content Repositories …
DB2 Records Manager
2
DB2Content Manager
DB2Content Manager
Leave records in native repository
Move records to strategic repository at declaration
*Services Offering
IBM Software Group
18
Content Discovery
Analysis & Discovery Services
Content Integration Services
Information Accelerators
Search & Indexing
Text AnalysisContextual
Delivery
IBM Software Group
19
OmniFind: it’s not Google… …because Intranet Search is different from Internet Search
Corporate intranets are smaller … but it’s more difficult to return highly relevant resultsLess content in a corporate intranet … lower chance for perfectly
matching document
Less well linked – fewer links and anchor text cues – so Page Ranking isn’t the answer
The heterogeneous nature (both in form and size) makes search precision difficult
IBM Software Group
20
Q26: For which solutions do you plan to keep your existing tool, and for which would you like the portal to provide?
* Base = Those with portal solutions implemented, planned or under evaluation.
Intend to keep existing tool
Would like Portal to provide
Search 32% 68%
Content management 39% 61%
Reporting 40% 60%
Authentication/single sign on 41% 59%
Process automation/workflow 42% 59%
Collaboration 43% 57%
Directory 43% 57%
Enterprise application integration (EAI) 46% 54%
Taxonomy 52% 48%
Activity Tracking 60% 41%
Application server 63% 37%
Desktop productivity (spreadsheet, word processing, etc.) 68% 32%
Windows desktop 79% 21%
Search and content management are the top two capabilities expected by 289 Portal customers
Reference: Enterprise Portal Purchase and Usage Characteristics, Final Report, META Group Multi-Client Study, November 2003
IBM Software Group
21
WebSphere II OmniFind Edition
Crawl Index Search
Excellent search quality
Complements and uses IBM’s offerings in portal, content management, and Information Integration
Crawls a broad range of enterprise data sources
Leverages systems’ own security mechanisms
Open architecture (UIMA) for text analytics and semantic queries
Rich multilingual capabilities
Keyword
search
Semantic
search
Text
analysis
IBM Software Group
22
Key Technologies
Crawling Scalable Web crawler Data Source crawlers Custom Crawlers
Parsing/Tokenizing
HTML / XML 200+ Doc Filters Advanced Linguistics
SearchApplications
Categorization (optional)
Dynamic & Admin-influenced ranking Fielded Search Parametric Search Semantic search
Searching
Text Analytics Partner Apps UIMA
Indexing
Global Analysis Static Ranking Store
Security
Sources of
EnterpriseContent
Sources of
EnterpriseContent
IBM Software Group
23
OmniFind Crawlers Web content
HTTP / HTTPS
News groups (NNTP)
WebSphere Portal portlets and Portal Document Manager
Collaboration Lotus Notes /Domino databases, Domino.Doc, QuickPlace
MS Exchange public folders
Windows and Unix File systems - over 250 file formats: PDF, MS Word / Excel / Powerpoint, Lotus SmartSuite, etc etc
Enterprise Content Management systems DB2 Content Manager
via WebSphere Information Integrator Content Edition: FileNet Content Services, FileNet P8, Documentum, Hummingbird DM, OpenText LiveLink and more in future
Relational Data sources DB2 family (DB2, Informix, DB2 for z/OS)
WS Information Integrator relational data sources (Oracle, Informix, MS SQL Server, Sybase)
Federated access to LDAP and JDBC
Data Listener API for Custom crawlers
IIStandard Edition
Content Manager
QuickPlaceDomino
Domino.doc
MS Exchange
Windows FileSystem
Unix File System
Websites
Newsgroups
Data Listener
II Content Edition
SQL Server
IBM Software Group
24
OmniFind Security
Security can be set at Collection level or Document level
OmniFind uses the application’s own security for Access-Control Lists for the following data sources: Lotus Notes / Domino
Domino Document Manager
QuickPlace
WebSphere Portal Document Manager
Portal pages
FileNet CS
Windows File System
Documentum
IBM Software Group
25
Linguistic Support The document language is detected automatically and used for language-specific result filtering at
search time. Language-specific base form computation (eg “mouse” for “mice”) is provided.
Automatic language detection also works for Arabic, Hebrew, Hungarian and Turkish (but no base form support yet).
Basic Support Text is segmented using either white space information (for simple text languages) or n-grams (for
complex text languages). If simple and complex script languages are mixed in one document, the best segmentation strategy
(either white space or n-gram) is selected for each individual script range within the document. Basic support processing should work for all languages. No language limitation is built into OmniFind. IBM tests basic support for the following list of languages:
Simple Text Languages (STL)
Albanian, Bulgarian, Belarusian, Catalan, Croatian, Estonian, Hungarian, Icelandic, Indonesian, Kazakh, Latvian, Lithuanian, Macedonian, Malay, Romanian, Serbian (Cyrillic & Latin), Slovak, Slovenian, Turkish, Ukrainian
Complex Text Languages (CTL)
Arabic, Bengali, Gujarati, Hebrew, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu, Thai, Vietnamese
Language Support in OmniFind
OmniFind has Linguistic support for: Chinese (Simplified & Traditional), Czech, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Japanese, Korean, Norwegian (Bokmal & Nynorsk), Polish, Portuguese, Portuguese, Russian, Spanish, Swedish
IBM Software Group
26
Search & Indexing ServicesSimple “Google” Style Search for Enterprise Content
Out-of-the-box search application provides “Google”-style results list with paging
• relevancy ranking, date, field values• site collapse• customizable look and feel
Configurable ‘Quick links’ provide immediate access to predetermined relevant sites, documents or applications
Broad support for searching across enterprise content sources
“Did you mean?” synonym expansion provides one click access to other potentially relevant queries or can be used for spelling correction
IBM Software Group
27
Content Discovery
Analysis & Discovery Services
Content Integration Services
Information Accelerators
Search & Indexing
Text Analysis (UIMA)
Contextual Delivery
Unstructured Information Management
Architecture (UIMA)
IBM Software Group
28
Most BI implementations ignore knowledge buried within free form text They can only report on predefined structured data, such as problem codes…
Problem descriptions, technician comments, call center notes and customer correspondence can contain a lot of the supporting details required for true insights
Text Analysis ServicesLeveraging Knowledge Buried in Unstructured Information
IBM Software Group
29
Text Analysis ServicesExtract Knowledge From Unstructured Information Identify concepts, entities and facts buried in unstructured content
Determine underlying issues or problems, parts referenced and actions from technician or customer service notes, customer surveys, consumer review sites and other sources
PART 1: Fuel PumpPART 2: Fuel FilterPART 3: Wiring HarnessPART 4: Wiring Harness Cover
PROBLEM 1: Corrosion PART 3: Wiring Harness
ACTION 1: Replace PART 1: Fuel Pump PART 2: Fuel Filter
ACTION 2: Remove PART 4: Wiring Harness Cover
Extracted knowledge can now be sent to a search engine, database or delivered as a service to rules processing engines and other business applications
Provide broader access through more simplified search and browse interfaces
IBM Software Group
30
Report on facts extracted from unstructured information Show other parts referenced, underlying root problems or issues,
and actions taken…
Create alerts to be notified of specified findings or thresholds
Provide simplified search interface extending access to broader set of users Easily find information about claims involving a fuel pump…
See all of the other parts, problems and actions referenced in the warranty claim
Text Analysis ServicesLeveraging Knowledge Buried in Unstructured Information
IBM Software Group
31
Iden
tify
Lang
uage
Fin
d W
ords
& R
oots
Cat
egor
izat
ion
Plu
g In
Ann
otat
or
Plu
g In
Ann
otat
or
ExtractedMetadataand Facts
Text Data Warehouse
RulesEngine
...any Application
Search Application
Reports
Search Index
WebSphere II OmniFind Edition
Plu
g In
Ann
otat
or
Plu
g In
Ann
otat
or
UIMA UIMA: Unstructured Information Management Architecture: a “plug and
play” framework for advanced text analysis components
UIMA framework allows “Annotators” to add value to text find words specific to an industry, from dictionary or by rules
add further information around these terms, like Latitude/Longitude for places
allow Indexed and annotated results to go to other processes / systems as well as to a Search Engine, for further analysis or semantic search
IBM Software Group
32
Content Discovery
Analysis & Discovery Services
Content Integration Services
Information Accelerators
Search & Indexing
Text AnalysisContextual
Delivery
WebSphere Content Discovery Server
(iPhrase)
WCDS demo on-screen “WCDS Self Service demo.exe”
IBM Software Group
33
Embed Rich HTML responses within
result
Interactive promotion
guides action
Understands user intent and provides actionable response
WebSphere Content Discovery for Self Service
IBM Software Group
34
Contextual Delivery ServicesIntegration into Contact Centres facilitates faster Problem Resolution
Launch query for possible resolutions directly from Siebel Call Center…
…leverage context and customer info to automatically find most relevant content
Return integration enables creation of new solutions based on findings
Enable agents to easily filter content by source, product and other attributes
IBM Software Group
35
Empower business managers to easily refine the end-user experience
Monitor end-user behavior and effectiveness of business rules
Contextual Delivery ServicesBusiness User Control
IBM Software Group
36
IBM Product Offerings
Integrating Content from Multiple Sources into
Business Applications
WebSphereContentEdition
WebSphereOmniFind
Edition
WebSphereContent Discovery
Server
Infrastructure for Enterprise Search and
Text Analytics
Business Driven Search Applications
Contextual Delivery
Search & Indexing
Text Analytics
Content Integration
IBM Software Group
37
Customer Examples
Content Discovery
Analysis & Discovery Services
Content Integration Services
Information Accelerators
Search & Indexing
Text Analysis (UIMA)
Contextual Delivery
IBM Software Group
38
Growth through AcquisitionChallenge
Wachovia improved business effectiveness and addressed compliance issues by providing integrated view of all content
Access and work with content from multiple repositories following mergers
Deliver repository independent customer service, brokerage and workflow applications
Benefits
Greater accessibility resulted in 50-fold increase in number of content retrievals
$2.3 million savings within 2 years for a 64% return on initial investment
$1 million savings for each additional business unit implementing content integration services
Business executives have immediate access to newly acquired systems
Content Integration
IBM Software Group
39
Challenge
IFPMA makes it easier for doctors and patients to research clinical trial information worldwide
Doctors and patients need to find info about all clinical trials sponsored by the pharmaceutical industry
Unstructured information from multiple companies and clinical trials registries
Benefits
Enables searching by disease area, medicine name or trial location
Recognizes medical and geographical synonyms across multiple languages, without manual indexing
Allows doctors and patients to find trials they can join and review summarized results
Search & Indexing
Text Analytics
IBM Software Group
40
Challenge
CBI Engineering increased productivity by allowing employees to access Lotus Notes from their intranet search solution
Need for improved search relevancy across file system and Lotus Notes to make engineers more productive
Must respect security already defined within Lotus Notes
Benefits Common search framework for intranet, file system and
Lotus Notes content
Engineers able to seamlessly access native Notes documents from intranet search results
Allowed CBI to provide broad content access while honoring stringent native repository security
Search & Indexing
IBM Software Group
41
Challenge
IBM Workplace for Customer Support (Lotus Premium Support) increased customer satisfaction and productivity with Content Discovery
Revitalize customer interest in using lower cost online support channel
Streamline customer self-sufficiency while continuing to deliver personalized service from IBM support staff
Benefits Increased customer satisfaction through the delivery of relevant information in 3 clicks
or less
Unified content from disparate repositories to simplify problem resolution
Enabled resolution of repetitive product problems in less than five minutes
Decreased number of problem management reports submitted
Personalization enables results to be automatically limited to customer owned products
Customers can escalate and preserve context
Enables searching across multiple content stores and easy user navigation
Contextual Delivery
IBM Software Group
42
Summary
Getting the right information to the right people at the right time is a key element of achieving Information On Demand
IBM is building this capability around a portfolio of Content Integration
Text Analytics
Search & Indexing
Contextual Delivery
Information Accelerators
IBM Content Discovery brings these capabilities together to help organizations drive measurable results for their business
IBM Software Group
44
The IBM Content Discovery software portfolio
WebSphere ContentDiscovery Server
WebSphere IIOmniFind Edition
WebSphere IIContent Edition
Allows organizations to …
Quickly deploy business driven solutions that increase revenue and reduce support costs
Records Management
M&A Content Migration
Byproviding …
Example initiatives
A rich understanding of user intent and application context to help people quickly find the information they need to make purchases, answer questions, and solve problems
Implement a single search architecture to underpin enterprise portal and BI initiatives
Robust enterprise search capabilities and a text analytics foundation able to uncover the inherent meaning of large volumes of content from around the globe
Manage, leverage and extend their enterprise content without painful ripping and replacing
Virtual access to dozens of content silos via a single interface to increase productivity, manage risk, and lower development costs
Issues Analytics
Intranet Search
eCommerce
Self-Service websites
IBM Software Group
45
OmniFind - Linguistic Analysis
Linguistic processing when adding document to index Determines language of document
Tokenizes text
Creates index using tokens
Linguistic processing performed during search Query string segmented, analyzed, searched in index
Stop word removal – removing “a”, “the”, etc.
Character normalization Normalization performed in Unicode
Case normalization – finding documents with “USA” when searching with “usa”
Umlaut normalization – finding documents with “shoen” when searching with “schön”
Accent removal – finding documents with “é” when searching for “e”
Other diacritics removal – finding documents with “ç” when searching for “c”
Ligature expansion – finding documents with “Æ” when searching for “ae”
Normalization works in both directions
IBM Software Group
46
OmniFind - Linguistic Analysis
Recognize documents in a wide range of languages: Arabic, Chinese (traditional and simplified), Czech, Danish, Dutch, English, Finnish,
French, German, Greek, Hebrew, Hungarian, Italian, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish, Swedish, Turkish
Dictionary-based linguistic support for documents in recognized languages Word segmentation Stemming, find “mice” when searching for “mouse” Break contractions into parts, make “wouldn’t” into “would” and “not” Clitics, a form of contractions, make “l’avenue” into “le” and “avenue” Recognize non-alphabetic characters as part of or separate from a lexical unit, e.g.,
URLs, dates Recognize abbreviations Recognize end of sentence for sentence segmentation
Basic support for documents not in a recognized language Word segmentation via white space or blanks, and, n-gram segmentation