Date post: | 08-Feb-2017 |
Category: |
Technology |
Upload: | abbyy-usa |
View: | 132 times |
Download: | 6 times |
ABBYY TechnologySummit2016
ABBYY NAHQ, 2016
Pierre van der Westhuizen
© ABBYY Confidential
Introduction:Designing High Performance Systems
• Upscaling FlexiCapture, identify bottlenecks, and optimize system performance
• Performance Metrics and testing
• FlexiCapture Performance at a glance
– Scaling
– Up to 3 million pages per day
– High Fault Tolerance and Availability
© ABBYY Confidential 2
Agenda:Designing High Performance Systems
• Introduction• *Architecture of FlexiCapture• Component Interaction Walkthrough• Defining Performance Metrics• Scaling of Systems (Demo, Medium, Large)• *Optimizing: Processing Stations, Scanning Stations and Workflow• Optimal Values of and Limitations on System Performance• System Monitoring and Bottleneck Detection• Performance testing• *Improving your current system
© ABBYY Confidential 3
Architecture
© ABBYY Confidential 4
• Application Level
• Application Server
• Licensing Server
• Processing Level
• Processing Server
• Data Storage
• Database
• File Storage
© ABBYY Confidential 5
Architecture – Server Side
• User Stations
• Scanning
• Verification
• Processing Stations
• Administration/Monitoring Web Console
• Project Setup Station
© ABBYY Confidential 6
Architecture – Client Side
Component Interaction
© ABBYY Confidential 8
Component Interaction
© ABBYY Confidential 9
Performance Metrics
© ABBYY Confidential 11
Defining Performance Metrics
We measure performance in volumes processed per period of time.
Define target performance using performance metrics:
• The required processing time
• Processing volumes
© ABBYY Confidential 12
Parameters that shape workload
• Average batch size in pages
• Image color mode: color, grayscale, black-and-white
• Pages per day (i.e. 24 hours), average/peak
• Pages per hour, average/peak
• Average document size in pages
• Number of scanning operators
• Number of verification operators
• Document storage time
© ABBYY Confidential 13
SCALING
© ABBYY Confidential 14
© ABBYY Confidential 15
Scaling
Demo System
© ABBYY Confidential 16
Medium System
© ABBYY Confidential 17
Medium System:Application Server
© ABBYY Confidential 18
1. Fast network2. Fast connection to FileStorage and Database3. Fast CPU4. RAM
Medium System:Processing, Licensing Servers
© ABBYY Confidential 19
For redundancy see:
FlexiCapture System Administrator Guide
Medium System:Database Server
© ABBYY Confidential 20
• More RAM• Fast HDD• Avoid Mirroring• Separate Data and Logs• Index Updates
Medium System:File Storage
© ABBYY Confidential 21
Read-write and capacity requirements depend on:
• Average and Peak processed per daySpeed Required for 10,000pages/hr2.8pages/sec = 2.8*3MB/sec = 8.4MB/sec
• Amount of time that documents are stored:e.g. 16 x 100,000 grayscale images x 3 MB (average file size for grayscale image) = 4.8 TB of data
NOTE: We strongly recommend something like RAID 10
Large System
© ABBYY Confidential 22
Optimizing Processing Stations
© ABBYY Confidential 23
© ABBYY Confidential 24
Processing Stations
• Tune each station
• Add more stations
Processing Station:Hardware
© ABBYY Confidential 25
• 1 Process per core• 16 cores max• 1GB RAM per core
Processing speed greatly depends on the CPU speed and the Hard Disk read-write speed
Processing Station: TEMP Folder
Scenario: 100 page batches on 8-core Station
• 100 pages x 3 MB = 300MB
• 8 Cores means 8 simultaneous executive processes
• TEMP Folder Space required is 300 x 8 = 2.4GB
• Allocate 2GB per Core + 2.4GB for TEMP = 18.4GB RAM
© ABBYY Confidential 26
Calculate Number of Processing Stations
© ABBYY Confidential 27
Estimate the required number of processing cores
© ABBYY Confidential 28
Measure how long it takes to process one batch for one core
8-core Processing Station
Process:1. Create 24 copies of a typical batch2. Put all batches in the FlexiCapture hotfolder3. Start the timer at the first import task created 4. Stop timer after the last result has been exported to the backend
15 minutes elapsed
Each core has processed 3 batchesTime to process 1 batch is about 5 minutesIf batch has 69 pages => takes 4.35 seconds to process 1 page.
Estimate the required number of processing cores Cont’d
© ABBYY Confidential 29
Estimate desired number of cores
Assume you need to process P pages in T time. We already know from the above that 1 core needs t time to process 1 page. Hence, you need N = (P x t ) / T cores.
Example. 200,000 pages in 8 hours = 28,800 secondsWe know 1 core takes 4.35 seconds to process 1 page200,000 x 4.35/28,800 = 31 cores=>4 Processing Stations with 8 cores (32 cores in total) will be sufficient processing.
Processing Cores – Limiting Factors
© ABBYY Confidential 30
• The total load on the infrastructure that may result in bottlenecks– Server Hardware
– Network
– Shared Resources (Database, External Services)
• The number of processing cores that can be served by the Processing Server– Max 120 cores
Monitor Free Processing Cores on Processing Server
Optimizing: Scanning Stations
© ABBYY Confidential 31
Scanning Stations
© ABBYY Confidential 32
• Performance Limits– Scanner Speed
– Data Transfer Bandwidth
• Separate Network Interface for Scanning
• Setup scan settings– Color Mode
– Remove Blanks
• Schedule Image Uploads
Optimizing: Workflow
© ABBYY Confidential 34
Workflow
© ABBYY Confidential 35
• Avoid too many stages
• The slowest stage limits the performance
• Do not produce tasks that are too small when parallelizing processing at a stage
Optimal Values and Limits
© ABBYY Confidential 36
Optimal Values of and Limitations on the System Performance 1
© ABBYY Confidential 37
Factor Optimal values & limitations
System performance inpages per 24 hours:
Demo
Able to process:
up to 20,000 black-and-white or up to 1000 color pages per 24 hours
Medium up to 1 mln black-and-white or upto 300,000 color pages per 24 hours,using a farm of regular computers
Large up to 3 mln black-and-white or upto 1 mln color pages per 24 hours
Optimal Values of and Limitations on the System Performance 2
© ABBYY Confidential 38
Factor Optimal values & limitations
Number of scanning operators FlexiCapture is able to host 1000scanning operators.
Number of verification operators FlexiCapture is able to host 300verification operators.
Number of processing Stations We used up to 120 cores in total for allProcessing Stations.
Number of cores per Processing Station regular disk drive: up to 8 cores.fast disk drive: up to 16 cores.RAM drive: up to 32 cores.
Optimal Values of and Limitations on the System Performance 3
© ABBYY Confidential 39
Factor Optimal values & limitations
Number of pages in a Batch Optimal value is from 10 to 1000 pagesin a batch
Number of pages in a Document Optimal value is up to 100 pages in adocument
Number of pages, documents, and batchesin the system
This highly depends on hardware used.For a Large configuration, up to 100,000batches, or 1 mln documents, or 10 mlnpages is normal.
Optimal Values of and Limitations on the System Performance 4
© ABBYY Confidential 40
Factor Optimal values & limitations
Data storage time Typically, pages, document, batches and event log records are stored in theSystem for up to 2 weeks.
Statistics for reporting can be stored foryears with no impact on performance
Performance Testing
© ABBYY Confidential 41
Performance Testing
© ABBYY Confidential 42
• Single Entry Point Project– Import from Scanning Station
– Pre-processing
– Recognition
– Export
– Processed
System Monitoring and Bottleneck Detection
© ABBYY Confidential 43
System Monitoring and Bottleneck Detection
• Document processing monitoring via the Administration and Monitoring Console
• Hardware monitoring for each FlexiCapture server component using various Windows Performance Monitor counters. – Memory– CPU– Hard Disk– Network– IIS– SQL Server
© ABBYY Confidential 44
Setting up Performance Counters
© ABBYY Confidential 45
• Monitor FlexiCapture state and search for bottlenecks– Performance Monitor utility
• Recorded by Processing Server
• mmc /32 perfmon.msc
• Enable Counters
Improving Your Current System
© ABBYY Confidential 46
Improving your current System
© ABBYY Confidential 47
• Separate your Servers– App Server
– Processing Server/License Server
– Processing Station
– Database Server
Improving your current System Cont’d
© ABBYY Confidential 48
• Database– More RAM is better – At least 4GB
– Fast Drives
– Data and Logs on Separate Drives
– Autogrowth – 100MB Increments
– Simple Recovery Model
– Maintenance Plans• Backups – Truncate Transaction Log
• Indexes
Improving your current System Cont’d
© ABBYY Confidential 49
• Application Server– IIS Logs
– Caching
– Fast Hard Drive
– App Server Recycling Pool
– Number of Threads
– 2 NICs 1GB/s
Consider Load Balancing
Improving your current System Cont’d
© ABBYY Confidential 50
• File Storage– Disable Search indexing and anti-virus scanning of FileStorage
– Do not store images in SQL database
– 1GB/s access
• Batches– Purging (2 weeks or less)
– Limit Big batches (Less than 100 pages per batch)
• Input and Output– Put input directory on Server
– Separate location for Export and Hotfolders
Improving your current System Cont’d
© ABBYY Confidential 51
• Networking– Network Speed
– Switching
– VLANS
– Network Interfaces
Final Summary
© ABBYY Confidential 52
• Architecture– Medium System
• How to Optimize Processing Stations– RAM
• Improving your current system– Separate your Servers!
Questions
© ABBYY Confidential 53