Date post: | 08-Apr-2017 |
Category: |
Technology |
Upload: | abbyy |
View: | 61 times |
Download: | 1 times |
1
ABBYY Compreno
Driving Impact from Unstructured Information Analytics
<NAME><DATE>
ABBYY Worldwide
2
Global
16 offices with more than 1.250 employeesin Europe, USA, Asia, Australia und Russia
Innovative
27% revenue investment in R&D, more than 400 developers and scientists
Reliable
Connected
Trusted partner to over 1000 companies in more than 150 countries around the world
Successful
> 40 million software users process more than 9 billion pages per year with ABBYY products
Enabling
Recognise, capture, (translate), analyse – we transform information into action
Strong and independent core technology that evolves with the needs of the digital revolution
Digital Universe
2.5 Exabyte of data generated every day = 2.5 Mio Terabyte = 2.5 x 1018 Byte(source: Northwestern University, 2016)
Majority (ca. 80%) is unstructured
3
1.4 x 1014 Word pages3.5 x 1013 PPT slides2 x 1013 PDF pages (image & text)2 x 1014 emails 4 x 1013 scanned pages3 x 1013 images (.tiff)1.4 x 1016 .txt files(source for average file sizes: netdocuments.com, 2016)
Reports, brochures, datasheets, presentations, research documents, service documents, pricelists, process descriptions, project descriptions, product feature specifications, customer communication, accident/security reports, contracts, email, web texts, articles in magazines, complete intranets …
Unstructured Content I
What do unstructured documents have in common? ● They are composed in natural language
What is the problem about natural language?● Complex to analyse and summarise ● Does have a structure but is not standardized (different people use different terms, expressions, syntax to talk
about the same thing)
● Content is unexpected and cannot be processed with rules ● Limited/no metadata
4
Unstructured Content II
● The computer does not know what the document is about and there is no source to get this information from
● Information is “locked” within documents● Information that may be valuable, or confidential, business-critical, or defensibly deletable, but is
difficult to find and manage
There is no business value in content that can’t be analysed or found
Natural language requires dedicated processing technology
5
ABBYY Compreno
6Confidential
What is it? Natural Language Processing (NLP) technology
What does it do? Advanced automated text analysis● Gathers information about a document from the document ● Understands meaning of words within context● Reveals relationships between words● Builds stories across documents● Extracts insights and intelligence from unstructured text
How Compreno works Key Components
7
SemanticsSemantic analysis is used to interpret syntactic structures in terms of universal, language-independent concepts and their relations.
SyntaxIdentifies formal relations among words in a sentence or across several sentences. The system analyzes a text and builds a tree of syntactic relations.
StatisticsData gleaned from parallel and monolingual corpora are
used for training the analysis algorithms and verifying and expanding the formal descriptions available to the system.
Semantics
Syntax
Statistics
ABBYY ComprenoPlatform for document understanding
Core uses of Compreno technology
● Classify unstructured documents
● Identify and extract entities, facts and events from texts
8Confidential
What is classification?To go from …
9
10
Mammals Birds Reptiles Fish
What is classification?… to
Categorisation based on particular shared features
How document classification worksThree main steps
11
Training
Set up model, define categories, select/collect training documents, train
model, choose best algorithm
Test and tune
Analyse test results, eliminate mistakes, adjust training set,
retrain model
Classification
Deploy model to production, classify
documents
Document classification – Why?
12
Essential step in information management
Enable advanced analysis and decision-making
Generate business value
Why is classification not as easy as it seems?Building up a reliable classification workflow is difficult…
13
Big Content
Technical challenges- Big training sets- Complex algorithms- Difficult to integrate
Business challenges- Traditional classification methods don’t do the job- High investments for building and maintaining the rule sets and classification schemes required (classification expert knowledge)
New, dedicated processing methods required
Unstructured documents
ABBYY Smart Classifier
● Text classification module for organising unstructured documents ● Assign unseen documents to predefined categories based on statistical,
morphological and semantic analysis ● Uses supervised machine learning to produce a classification model from sample
inputs● Classification creates meta data derived from the document context
14
Next generation document classification
Unstructured information processing● Unlock information ● Make content searchable, accessible and retrievable
Automated classification ● High speed● Constant quality● No manual work
Semantic-based classification● Deep text analysis techniques employed for even more accurate classification
15
Smart Classifier features and values
Smart Classifier features and values
Machine learning ● System learns automatically based on the training documents● No particular knowledge required to setup classification● No specification of rules necessary ● Small training sets
Automatic algorithm optimisation ● Selection of the best-performing algorithm for each document set
16
Smart Classifier features and values
Simple UI● No specific knowledge required to create a model, train the system and launch a
classification workflow
Input document formats and languages● Process content regardless of original format ● OCR for processing of images● 39 classification languages
17
IT Integration of Smart Classifier Leverage existing systems and infrastructure
18
Smart Classifier Workflows
19
Create and deploy classification model
01 | Category definition and selection of sample documents
02 | Setup of classification model
03 | Model training
04 | Model testing, quality evaluation and tuning
05 | Deployment to production
Document classification workflow
01| Category definition and selection of sample documents ● Category = a group of documents that have particular shared features● Category definition is a management decision, no special IT skills required● Content and process experts select representative documents for each category● Minimum: 10 documents per category● For reliable statistics: ±100 documents per category● Representative sample of documents
● Documents must be typical for category: The more representative of the respective category a document is, the better the model will perform (garbage in, garbage out).
● Proportion of docs assigned to each category should be the same as in the collection of documents to be classified
● Smart Classifier accepts many formats (plain text, Office, HTML, XML, PDFs (Image formats are submitted to OCR))
● Folder structure: Each (sub-)category = dedicated (sub-) folder● Create training set and control set and save them as ZIP files
20
02| Setup of classification model
● The Classification Model defines, how and by which categories document classification will be performed.
● Model creation via Model Editor web UI or REST API (code samples included in documentation)
● Set parameters ● Document language (39 languages supported)● Category assignment (what category will be assigned to the document if more than one was
returned as candidate category)● Quality criteria (trade-off between precision and recall)
21
02| Setup of classification modelModel Editor web interface
22
03| Model training
● Load training documents● Train classification model● Machine learning
● The system automatically identifies and uses the most relevant features from the training documents for creating the classification model
23
04| Model testing, quality evaluation and tuning
● Load and test control set to determine whether training process was successful● Classification results in control set must meet expectations before model can be deployed
● Model Editor provides instant visibility of each document within a classification project● Source text and key words picked by the algorithms can be analysed and checked ● Terms that should be ignored during classification can be added to a stop word list
● Analyse: F-measure, precision, recall● Debug: Confidence level, selected keywords● Adjust: Inclusiveness, stop words, documents in classes (re-assign category)● Upload further training/control documents
24
04| Model testing, quality evaluation and tuning
25
05| Deployment to production
● When the model is deployed it becomes available via the Compreno REST API
● If you make changes to the model, it needs to be retrained for changes to become effective
26Confidential
Document classification workflow
27Confidential
Once the system is set up and a classification model is published for operation, incoming classification tasks will be accepted01| A new document classification task is created02| The document is converted into an internal format03| The document is classified04| The document classification results are saved05| The task is completed
Document classification workflow
28Confidential
● Classification results in Model Editor
Smart Classifier application scenarios
Enterprise content management and its subdomainsArchiving, records management (Information Governance), document management, enterprise search
● Classification of incoming and stored documents● Definition of category-based access rights and retention policies● Search enhancement
29
Smart Classifier application scenarios Information lifecycle
ManageStore ArchiveDisposeCreate Capture
30
Classification of incoming documents
Add documents to the system that have a value, i.e. are enhanced with metadata
Classification for aid in risk mitigationCategory-based document access
rights
Category-based disposal policy
Classification for aid in complianceCategory-based retention policy
Classification to improve enterprise search systems
Add class to search index
Category-based routing and distribution
Post-process• Classification for metadata correction• Classification of legacy content for data
improvement
Smart Classifier application scenarios
Data migration● Organise content before, during or after migration
Client support ● Category-based prioritisation and routing of client issues shorten response times
eDiscovery● Quickly gather and prepare documents
Mailroom● Automatically select the most suitable processing workflow
E-mail management● Additional metadata facilitates and accelerates routing
31
Smart Classifier benefitsFor all enterprises
Create access to information
Efficient information
managementAid compliance &
risk mitigationCost efficiency
32
Smart Classifier benefitsFor ISVs
Create better customer applications
Quick ROI
33
Smart Classifier benefitsFor BPOs
Accelerate business processes
34
Easier cost calculation
ABBYY ComprenoPlatform for document understanding
Core uses of Compreno technology
● Classify unstructured documents
● Identify and extract entities, facts and events from texts
35Confidential
ABBYY InfoExtractor SDK
● Information extraction module for processing natural language texts ● Natively processes unstructured documents and accesses the embedded textual
information● Identifies different facts, entities and the relationship between them● Automatically extracts critical data ● Combines related data into facts
36Confidential
How InfoExtractor works IFrom text to semantics
Syntactic parsing: Determine the structure of the input text; understand how concepts relate to one another within one or more sentences
Semantic parsing: Contextual analysis = Obtaining and representing the meaning of a sentence
Universal Semantic Hierarchy: Language independent hierarchy of concepts to reflect the meaning and relations of words and sentences
Derive meaning of sentence by understanding of the context and the “speaker's” intent. An ontology is a formal representation of concepts and the relationships between those concepts.
Lexical analysis: Convert sequence of characters into sequence of wordsMorphological analysis: Analyse the structure of words and parts of words
Connect entities with other entities and facts, even if the words that define them are replaced with pronouns or omitted in the text Example: The company has denied reports it is preparing to default on its loans if it cannot reach agreement on its bailout terms with international creditors
38
How InfoExtractor works IIIdentify relationships between words
Get the complete story
39
Gather only relevant facts
How InfoExtractor works IIIDefine the contextual meaning of a word
Example: Some people work with PDF documents but not all employees do.
40
Don’t miss any valuable facts
How InfoExtractor works IVDetect omitted words
InfoExtractor features and values
41Confidential
Natural Language Processing ● Understand the meaning of words and relations between them
Extraction of entities and events● Extract the facts and story lines embedded in unstructured information● Persons, organisations, dates● Deals, purchases, employment details
Identify relationships between entities and events● Contracting parties, subject of the contract, financial figures
InfoExtractor features and values
Basic and custom ontologies● Basic ontologies including widely used words● Custom ontologies for industry solutions
Customized entities for specific cases● Custom ontology dictionaries to extract complicated examples of entities (e.g. Asian names or
companies)
Input document formats and languages● Work with text regardless of source● English, Russian, German● OCR embedded for image processing
42Confidential
IT Integration of InfoExtractor
The information extraction process
44
InfoExtractor application scenarios
Contract Management● Use Case: Mass contract ingestion● Document Type: Contract ● Customer: ISVs, Service Providers ● Benefit: Extend service offering & increase revenues
Customer On-Boarding● Use Case: Capture & upload customer information at point of entry into the system● Document Type: Statuary documents, contracts ● Customer: Banks, insurance companies● Benefit: Accelerate document processing
45
InfoExtractor application scenarios
Applicant Tracking ● Use Case: Tag and upload CVs to improve search ● Document Type: CV ● Customer: HR departments ● Benefit: Minimise resources required to process all the necessary CVs
Credit Risk Mitigation ● Use Case: Decide on providing loans; check various sources of information on potential loan customers. ● Document Type: Contracts, statuary documents, court decisions ● Customer: Banks ● Benefit: Accelerate document processing
46
InfoExtractor benefitsGet decision-critical information with less costs and efforts
Intelligence and insights
Aid predictive decision making
Uncover hidden risks
Cost efficiency
47
Use analytics to create new value out of existing
and new data
Get the big picture by connecting entities, facts
and events across documents
Accelerate and automate content upload and analysis to optimise manual processes
Take critical decisions faster based on relevant
information
48
Good classification and information extraction let organisations solve
tasks they are not capable of solving at the moment
Smart Classifier and InfoExtractor make document classification and
information extraction simple
Summary
Licensing
● Smart Classifier and InfoExtractor are available for testing via time and volume limited trial license
● Different license models● Perpetual with software maintenance● Subscription (yearly)● OEM licensing
● Standard license model based on renewable peak volume
● Backend can be scaled up
49
50
<Name><Name>@abbyy.com
ABBYY Europe GmbHElsenheimerstraße 4980687 MunichGermany
www.abbyy.com
51
Thank You