ABBYY FlexiCapture & Designing a High-Performance System at #ABBYYSummit16

Post on 08-Feb-2017

132 views 6 download

transcript

ABBYY TechnologySummit2016

ABBYY NAHQ, 2016

Pierre van der Westhuizen

© ABBYY Confidential

Introduction:Designing High Performance Systems

• Upscaling FlexiCapture, identify bottlenecks, and optimize system performance

• Performance Metrics and testing

• FlexiCapture Performance at a glance

– Scaling

– Up to 3 million pages per day

– High Fault Tolerance and Availability

© ABBYY Confidential 2

Agenda:Designing High Performance Systems

• Introduction• *Architecture of FlexiCapture• Component Interaction Walkthrough• Defining Performance Metrics• Scaling of Systems (Demo, Medium, Large)• *Optimizing: Processing Stations, Scanning Stations and Workflow• Optimal Values of and Limitations on System Performance• System Monitoring and Bottleneck Detection• Performance testing• *Improving your current system

© ABBYY Confidential 3

Architecture

© ABBYY Confidential 4

• Application Level

• Application Server

• Licensing Server

• Processing Level

• Processing Server

• Data Storage

• Database

• File Storage

© ABBYY Confidential 5

Architecture – Server Side

• User Stations

• Scanning

• Verification

• Processing Stations

• Administration/Monitoring Web Console

• Project Setup Station

© ABBYY Confidential 6

Architecture – Client Side

Component Interaction

© ABBYY Confidential 8

Component Interaction

© ABBYY Confidential 9

Performance Metrics

© ABBYY Confidential 11

Defining Performance Metrics

We measure performance in volumes processed per period of time.

Define target performance using performance metrics:

• The required processing time

• Processing volumes

© ABBYY Confidential 12

Parameters that shape workload

• Average batch size in pages

• Image color mode: color, grayscale, black-and-white

• Pages per day (i.e. 24 hours), average/peak

• Pages per hour, average/peak

• Average document size in pages

• Number of scanning operators

• Number of verification operators

• Document storage time

© ABBYY Confidential 13

SCALING

© ABBYY Confidential 14

© ABBYY Confidential 15

Scaling

Demo System

© ABBYY Confidential 16

Medium System

© ABBYY Confidential 17

Medium System:Application Server

© ABBYY Confidential 18

1. Fast network2. Fast connection to FileStorage and Database3. Fast CPU4. RAM

Medium System:Processing, Licensing Servers

© ABBYY Confidential 19

For redundancy see:

FlexiCapture System Administrator Guide

Medium System:Database Server

© ABBYY Confidential 20

• More RAM• Fast HDD• Avoid Mirroring• Separate Data and Logs• Index Updates

Medium System:File Storage

© ABBYY Confidential 21

Read-write and capacity requirements depend on:

• Average and Peak processed per daySpeed Required for 10,000pages/hr2.8pages/sec = 2.8*3MB/sec = 8.4MB/sec

• Amount of time that documents are stored:e.g. 16 x 100,000 grayscale images x 3 MB (average file size for grayscale image) = 4.8 TB of data

NOTE: We strongly recommend something like RAID 10

Large System

© ABBYY Confidential 22

Optimizing Processing Stations

© ABBYY Confidential 23

© ABBYY Confidential 24

Processing Stations

• Tune each station

• Add more stations

Processing Station:Hardware

© ABBYY Confidential 25

• 1 Process per core• 16 cores max• 1GB RAM per core

Processing speed greatly depends on the CPU speed and the Hard Disk read-write speed

Processing Station: TEMP Folder

Scenario: 100 page batches on 8-core Station

• 100 pages x 3 MB = 300MB

• 8 Cores means 8 simultaneous executive processes

• TEMP Folder Space required is 300 x 8 = 2.4GB

• Allocate 2GB per Core + 2.4GB for TEMP = 18.4GB RAM

© ABBYY Confidential 26

Calculate Number of Processing Stations

© ABBYY Confidential 27

Estimate the required number of processing cores

© ABBYY Confidential 28

Measure how long it takes to process one batch for one core

8-core Processing Station

Process:1. Create 24 copies of a typical batch2. Put all batches in the FlexiCapture hotfolder3. Start the timer at the first import task created 4. Stop timer after the last result has been exported to the backend

15 minutes elapsed

Each core has processed 3 batchesTime to process 1 batch is about 5 minutesIf batch has 69 pages => takes 4.35 seconds to process 1 page.

Estimate the required number of processing cores Cont’d

© ABBYY Confidential 29

Estimate desired number of cores

Assume you need to process P pages in T time. We already know from the above that 1 core needs t time to process 1 page. Hence, you need N = (P x t ) / T cores.

Example. 200,000 pages in 8 hours = 28,800 secondsWe know 1 core takes 4.35 seconds to process 1 page200,000 x 4.35/28,800 = 31 cores=>4 Processing Stations with 8 cores (32 cores in total) will be sufficient processing.

Processing Cores – Limiting Factors

© ABBYY Confidential 30

• The total load on the infrastructure that may result in bottlenecks– Server Hardware

– Network

– Shared Resources (Database, External Services)

• The number of processing cores that can be served by the Processing Server– Max 120 cores

Monitor Free Processing Cores on Processing Server

Optimizing: Scanning Stations

© ABBYY Confidential 31

Scanning Stations

© ABBYY Confidential 32

• Performance Limits– Scanner Speed

– Data Transfer Bandwidth

• Separate Network Interface for Scanning

• Setup scan settings– Color Mode

– Remove Blanks

• Schedule Image Uploads

Optimizing: Workflow

© ABBYY Confidential 34

Workflow

© ABBYY Confidential 35

• Avoid too many stages

• The slowest stage limits the performance

• Do not produce tasks that are too small when parallelizing processing at a stage

Optimal Values and Limits

© ABBYY Confidential 36

Optimal Values of and Limitations on the System Performance 1

© ABBYY Confidential 37

Factor Optimal values & limitations

System performance inpages per 24 hours:

Demo

Able to process:

up to 20,000 black-and-white or up to 1000 color pages per 24 hours

Medium up to 1 mln black-and-white or upto 300,000 color pages per 24 hours,using a farm of regular computers

Large up to 3 mln black-and-white or upto 1 mln color pages per 24 hours

Optimal Values of and Limitations on the System Performance 2

© ABBYY Confidential 38

Factor Optimal values & limitations

Number of scanning operators FlexiCapture is able to host 1000scanning operators.

Number of verification operators FlexiCapture is able to host 300verification operators.

Number of processing Stations We used up to 120 cores in total for allProcessing Stations.

Number of cores per Processing Station regular disk drive: up to 8 cores.fast disk drive: up to 16 cores.RAM drive: up to 32 cores.

Optimal Values of and Limitations on the System Performance 3

© ABBYY Confidential 39

Factor Optimal values & limitations

Number of pages in a Batch Optimal value is from 10 to 1000 pagesin a batch

Number of pages in a Document Optimal value is up to 100 pages in adocument

Number of pages, documents, and batchesin the system

This highly depends on hardware used.For a Large configuration, up to 100,000batches, or 1 mln documents, or 10 mlnpages is normal.

Optimal Values of and Limitations on the System Performance 4

© ABBYY Confidential 40

Factor Optimal values & limitations

Data storage time Typically, pages, document, batches and event log records are stored in theSystem for up to 2 weeks.

Statistics for reporting can be stored foryears with no impact on performance

Performance Testing

© ABBYY Confidential 41

Performance Testing

© ABBYY Confidential 42

• Single Entry Point Project– Import from Scanning Station

– Pre-processing

– Recognition

– Export

– Processed

System Monitoring and Bottleneck Detection

© ABBYY Confidential 43

System Monitoring and Bottleneck Detection

• Document processing monitoring via the Administration and Monitoring Console

• Hardware monitoring for each FlexiCapture server component using various Windows Performance Monitor counters. – Memory– CPU– Hard Disk– Network– IIS– SQL Server

© ABBYY Confidential 44

Setting up Performance Counters

© ABBYY Confidential 45

• Monitor FlexiCapture state and search for bottlenecks– Performance Monitor utility

• Recorded by Processing Server

• mmc /32 perfmon.msc

• Enable Counters

Improving Your Current System

© ABBYY Confidential 46

Improving your current System

© ABBYY Confidential 47

• Separate your Servers– App Server

– Processing Server/License Server

– Processing Station

– Database Server

Improving your current System Cont’d

© ABBYY Confidential 48

• Database– More RAM is better – At least 4GB

– Fast Drives

– Data and Logs on Separate Drives

– Autogrowth – 100MB Increments

– Simple Recovery Model

– Maintenance Plans• Backups – Truncate Transaction Log

• Indexes

Improving your current System Cont’d

© ABBYY Confidential 49

• Application Server– IIS Logs

– Caching

– Fast Hard Drive

– App Server Recycling Pool

– Number of Threads

– 2 NICs 1GB/s

Consider Load Balancing

Improving your current System Cont’d

© ABBYY Confidential 50

• File Storage– Disable Search indexing and anti-virus scanning of FileStorage

– Do not store images in SQL database

– 1GB/s access

• Batches– Purging (2 weeks or less)

– Limit Big batches (Less than 100 pages per batch)

• Input and Output– Put input directory on Server

– Separate location for Export and Hotfolders

Improving your current System Cont’d

© ABBYY Confidential 51

• Networking– Network Speed

– Switching

– VLANS

– Network Interfaces

Final Summary

© ABBYY Confidential 52

• Architecture– Medium System

• How to Optimize Processing Stations– RAM

• Improving your current system– Separate your Servers!

Questions

© ABBYY Confidential 53