Introducing Compreno - Natural Language Processing Technology

1

ABBYY Compreno

Driving Impact from Unstructured Information Analytics

<NAME><DATE>

ABBYY Worldwide

2

Global

16 offices with more than 1.250 employeesin Europe, USA, Asia, Australia und Russia

Innovative

27% revenue investment in R&D, more than 400 developers and scientists

Reliable

Connected

Trusted partner to over 1000 companies in more than 150 countries around the world

Successful

> 40 million software users process more than 9 billion pages per year with ABBYY products

Enabling

Recognise, capture, (translate), analyse – we transform information into action

Strong and independent core technology that evolves with the needs of the digital revolution

Digital Universe

2.5 Exabyte of data generated every day = 2.5 Mio Terabyte = 2.5 x 1018 Byte(source: Northwestern University, 2016)

Majority (ca. 80%) is unstructured

3

1.4 x 1014 Word pages3.5 x 1013 PPT slides2 x 1013 PDF pages (image & text)2 x 1014 emails 4 x 1013 scanned pages3 x 1013 images (.tiff)1.4 x 1016 .txt files(source for average file sizes: netdocuments.com, 2016)

Reports, brochures, datasheets, presentations, research documents, service documents, pricelists, process descriptions, project descriptions, product feature specifications, customer communication, accident/security reports, contracts, email, web texts, articles in magazines, complete intranets …

Unstructured Content I

What do unstructured documents have in common? ● They are composed in natural language

What is the problem about natural language?● Complex to analyse and summarise ● Does have a structure but is not standardized (different people use different terms, expressions, syntax to talk

about the same thing)

● Content is unexpected and cannot be processed with rules ● Limited/no metadata

4

Unstructured Content II

● The computer does not know what the document is about and there is no source to get this information from

● Information is “locked” within documents● Information that may be valuable, or confidential, business-critical, or defensibly deletable, but is

difficult to find and manage

There is no business value in content that can’t be analysed or found

Natural language requires dedicated processing technology

5

ABBYY Compreno

6Confidential

What is it? Natural Language Processing (NLP) technology

What does it do? Advanced automated text analysis● Gathers information about a document from the document ● Understands meaning of words within context● Reveals relationships between words● Builds stories across documents● Extracts insights and intelligence from unstructured text

How Compreno works Key Components

7

SemanticsSemantic analysis is used to interpret syntactic structures in terms of universal, language-independent concepts and their relations.

SyntaxIdentifies formal relations among words in a sentence or across several sentences. The system analyzes a text and builds a tree of syntactic relations.

StatisticsData gleaned from parallel and monolingual corpora are

used for training the analysis algorithms and verifying and expanding the formal descriptions available to the system.

Semantics

Syntax

Statistics

ABBYY ComprenoPlatform for document understanding

Core uses of Compreno technology

● Classify unstructured documents

● Identify and extract entities, facts and events from texts

8Confidential

What is classification?To go from …

9

10

Mammals Birds Reptiles Fish

What is classification?… to

Categorisation based on particular shared features

How document classification worksThree main steps

11

Training

Set up model, define categories, select/collect training documents, train

model, choose best algorithm

Test and tune

Analyse test results, eliminate mistakes, adjust training set,

retrain model

Classification

Deploy model to production, classify

documents

Document classification – Why?

12

Essential step in information management

Enable advanced analysis and decision-making

Generate business value

Why is classification not as easy as it seems?Building up a reliable classification workflow is difficult…

13

Big Content

Technical challenges- Big training sets- Complex algorithms- Difficult to integrate

Business challenges- Traditional classification methods don’t do the job- High investments for building and maintaining the rule sets and classification schemes required (classification expert knowledge)

New, dedicated processing methods required

Unstructured documents

ABBYY Smart Classifier

● Text classification module for organising unstructured documents ● Assign unseen documents to predefined categories based on statistical,

morphological and semantic analysis ● Uses supervised machine learning to produce a classification model from sample

inputs● Classification creates meta data derived from the document context

14

Next generation document classification

Unstructured information processing● Unlock information ● Make content searchable, accessible and retrievable

Automated classification ● High speed● Constant quality● No manual work

Semantic-based classification● Deep text analysis techniques employed for even more accurate classification

15

Smart Classifier features and values


Machine learning ● System learns automatically based on the training documents● No particular knowledge required to setup classification● No specification of rules necessary ● Small training sets

Automatic algorithm optimisation ● Selection of the best-performing algorithm for each document set

16


Simple UI● No specific knowledge required to create a model, train the system and launch a

classification workflow

Input document formats and languages● Process content regardless of original format ● OCR for processing of images● 39 classification languages

17

IT Integration of Smart Classifier Leverage existing systems and infrastructure

18

Smart Classifier Workflows

19

Create and deploy classification model

01 | Category definition and selection of sample documents

02 | Setup of classification model

03 | Model training

04 | Model testing, quality evaluation and tuning

05 | Deployment to production

Document classification workflow

01| Category definition and selection of sample documents ● Category = a group of documents that have particular shared features● Category definition is a management decision, no special IT skills required● Content and process experts select representative documents for each category● Minimum: 10 documents per category● For reliable statistics: ±100 documents per category● Representative sample of documents

● Documents must be typical for category: The more representative of the respective category a document is, the better the model will perform (garbage in, garbage out).

● Proportion of docs assigned to each category should be the same as in the collection of documents to be classified

● Smart Classifier accepts many formats (plain text, Office, HTML, XML, PDFs (Image formats are submitted to OCR))

● Folder structure: Each (sub-)category = dedicated (sub-) folder● Create training set and control set and save them as ZIP files

20

02| Setup of classification model

● The Classification Model defines, how and by which categories document classification will be performed.

● Model creation via Model Editor web UI or REST API (code samples included in documentation)

● Set parameters ● Document language (39 languages supported)● Category assignment (what category will be assigned to the document if more than one was

returned as candidate category)● Quality criteria (trade-off between precision and recall)

21

02| Setup of classification modelModel Editor web interface

22

03| Model training

● Load training documents● Train classification model● Machine learning

● The system automatically identifies and uses the most relevant features from the training documents for creating the classification model

23

04| Model testing, quality evaluation and tuning

● Load and test control set to determine whether training process was successful● Classification results in control set must meet expectations before model can be deployed

● Model Editor provides instant visibility of each document within a classification project● Source text and key words picked by the algorithms can be analysed and checked ● Terms that should be ignored during classification can be added to a stop word list

● Analyse: F-measure, precision, recall● Debug: Confidence level, selected keywords● Adjust: Inclusiveness, stop words, documents in classes (re-assign category)● Upload further training/control documents

24

04| Model testing, quality evaluation and tuning

25

05| Deployment to production

● When the model is deployed it becomes available via the Compreno REST API

● If you make changes to the model, it needs to be retrained for changes to become effective

26Confidential


27Confidential

Once the system is set up and a classification model is published for operation, incoming classification tasks will be accepted01| A new document classification task is created02| The document is converted into an internal format03| The document is classified04| The document classification results are saved05| The task is completed


28Confidential

● Classification results in Model Editor

Smart Classifier application scenarios

Enterprise content management and its subdomainsArchiving, records management (Information Governance), document management, enterprise search

● Classification of incoming and stored documents● Definition of category-based access rights and retention policies● Search enhancement

29

Smart Classifier application scenarios Information lifecycle

ManageStore ArchiveDisposeCreate Capture

30

Classification of incoming documents

Add documents to the system that have a value, i.e. are enhanced with metadata

Classification for aid in risk mitigationCategory-based document access

rights

Category-based disposal policy

Classification for aid in complianceCategory-based retention policy

Classification to improve enterprise search systems

Add class to search index

Category-based routing and distribution

Post-process• Classification for metadata correction• Classification of legacy content for data

improvement

Smart Classifier application scenarios

Data migration● Organise content before, during or after migration

Client support ● Category-based prioritisation and routing of client issues shorten response times

eDiscovery● Quickly gather and prepare documents

Mailroom● Automatically select the most suitable processing workflow

E-mail management● Additional metadata facilitates and accelerates routing

31

Smart Classifier benefitsFor all enterprises

Create access to information

Efficient information

managementAid compliance &

risk mitigationCost efficiency

32

Smart Classifier benefitsFor ISVs

Create better customer applications

Quick ROI

33

Smart Classifier benefitsFor BPOs

Accelerate business processes

34

Easier cost calculation

ABBYY ComprenoPlatform for document understanding

Core uses of Compreno technology

● Classify unstructured documents

● Identify and extract entities, facts and events from texts

35Confidential

ABBYY InfoExtractor SDK

● Information extraction module for processing natural language texts ● Natively processes unstructured documents and accesses the embedded textual

information● Identifies different facts, entities and the relationship between them● Automatically extracts critical data ● Combines related data into facts

36Confidential

How InfoExtractor works IFrom text to semantics

Syntactic parsing: Determine the structure of the input text; understand how concepts relate to one another within one or more sentences

Semantic parsing: Contextual analysis = Obtaining and representing the meaning of a sentence

Universal Semantic Hierarchy: Language independent hierarchy of concepts to reflect the meaning and relations of words and sentences

Derive meaning of sentence by understanding of the context and the “speaker's” intent. An ontology is a formal representation of concepts and the relationships between those concepts.

Lexical analysis: Convert sequence of characters into sequence of wordsMorphological analysis: Analyse the structure of words and parts of words

Connect entities with other entities and facts, even if the words that define them are replaced with pronouns or omitted in the text Example: The company has denied reports it is preparing to default on its loans if it cannot reach agreement on its bailout terms with international creditors

38

How InfoExtractor works IIIdentify relationships between words

Get the complete story

39

Gather only relevant facts

How InfoExtractor works IIIDefine the contextual meaning of a word

Example: Some people work with PDF documents but not all employees do.

40

Don’t miss any valuable facts

How InfoExtractor works IVDetect omitted words

InfoExtractor features and values

41Confidential

Natural Language Processing ● Understand the meaning of words and relations between them

Extraction of entities and events● Extract the facts and story lines embedded in unstructured information● Persons, organisations, dates● Deals, purchases, employment details

Identify relationships between entities and events● Contracting parties, subject of the contract, financial figures

InfoExtractor features and values

Basic and custom ontologies● Basic ontologies including widely used words● Custom ontologies for industry solutions

Customized entities for specific cases● Custom ontology dictionaries to extract complicated examples of entities (e.g. Asian names or

companies)

Input document formats and languages● Work with text regardless of source● English, Russian, German● OCR embedded for image processing

42Confidential

IT Integration of InfoExtractor

The information extraction process

44

InfoExtractor application scenarios

Contract Management● Use Case: Mass contract ingestion● Document Type: Contract ● Customer: ISVs, Service Providers ● Benefit: Extend service offering & increase revenues

Customer On-Boarding● Use Case: Capture & upload customer information at point of entry into the system● Document Type: Statuary documents, contracts ● Customer: Banks, insurance companies● Benefit: Accelerate document processing

45

InfoExtractor application scenarios

Applicant Tracking ● Use Case: Tag and upload CVs to improve search ● Document Type: CV ● Customer: HR departments ● Benefit: Minimise resources required to process all the necessary CVs

Credit Risk Mitigation ● Use Case: Decide on providing loans; check various sources of information on potential loan customers. ● Document Type: Contracts, statuary documents, court decisions ● Customer: Banks ● Benefit: Accelerate document processing

46

InfoExtractor benefitsGet decision-critical information with less costs and efforts

Intelligence and insights

Aid predictive decision making

Uncover hidden risks

Cost efficiency

47

Use analytics to create new value out of existing

and new data

Get the big picture by connecting entities, facts

and events across documents

Accelerate and automate content upload and analysis to optimise manual processes

Take critical decisions faster based on relevant

information

48

Good classification and information extraction let organisations solve

tasks they are not capable of solving at the moment

Smart Classifier and InfoExtractor make document classification and

information extraction simple

Summary

Licensing

● Smart Classifier and InfoExtractor are available for testing via time and volume limited trial license

● Different license models● Perpetual with software maintenance● Subscription (yearly)● OEM licensing

● Standard license model based on renewable peak volume

● Backend can be scaled up

49

50

<Name><Name>@abbyy.com

ABBYY Europe GmbHElsenheimerstraße 4980687 MunichGermany

www.abbyy.com

51

Thank You

Date post:	08-Apr-2017
Category:	Technology
Upload:	abbyy
View:	61 times
Download:	1 times