Scaling Massive Content
Stores in the CloudCloudExpo New York – June 2016
@johnnewton – Alfresco Founder & CTO
Government Financial Services Healthcare Manufacturing Corporate
Alfresco Customers
Somewhere in a secret underground location
someone is trying to store…
One Billion Documents!!!
http://www.warnerbros.com/austin-powers-international-man-mystery
Some have attempted before … and failed
Content Use Cases at Scale
Enterprise
Document Library
Loans &
Policies
Claims & Case
Processing
Transaction &
Logistics Records
Research &
Analysis
Real-time Video
Internet of Things
Medical & Personnel Records
Government
Records & Archives
Discovery &
Litigation
Content Management Applications
Document
Library
Image
Management
File Sync &
Share
Search & Retrieval
Business
Process
Management
Records
Management
Case
Management
Media Management
Information
Archiving
Content vs. Data vs Files vs. EFSS
Data Files EFSS Content and ECM
Content Architecture as a Big Data Problem
10
Files /Renditions
Metadata
Directory CategoriesRelationships
Indexes
Search
Activities
Security People
APIs
Processes /Tasks
Rules
Semantics
Types
ContentObject
Access Create – Manage – Distribute – Use
Context
DatabaseDistributed
FSDatabaseSolr /
ElasticSearch
Content at Scale in the Enterprise
Users at Scale
Concurrency Content Count
Read/Write
Throughput
Geographic
Distribution
Volume Size
The Problem with Traditional Approaches
Provisioning and
Administration
Geographic Distribution Lack of Agility
Lack of Redundancy Lack of Elasticity
Content Management Architecture
13
Alfresco Share
Alfresco Repository
Alfresco SOLR
Activiti Workflow
Engine
Database
FS Content
Store
Indexes
S3
RDS
EBS or Ephemeral
PIOPS EBS
(or Glacier)
EC2
Scaling in Tiers
Alfresco
Transformation Server
Alfresco
Transformation Server
Alfresco Solr
Alfresco Local Repo
(Index Tracking)
Alfresco Solr
Alfresco Local Repo
(Index Tracking)
Alfresco Repository Alfresco Repository
Alfresco Share Alfresco Share
Alfresco Activiti Suite
Alfresco Activiti Suite
Data Meta-Model
A
B
C
D
Folder
Folder
Doc
Docrendition
Class
Type Aspect
Property
Association
Constraint
Child Association
Folder
Document
contains
name
name
content
Auditable who by
when
rendition
Type
Child Association
Type
Association
Property
Property
PropertyAspect
Model Metadata Organization
1 Billion 15 Billion
Next Generation Relational Architectures
AZ 1 AZ 2
EBSmirror
EBSmirror
Amazon S3
EBS
StandbyInstance
PrimaryInstance
AZ 1 AZ 3
Amazon S3
PrimaryInstance
AZ 2
ReplicaInstance
• Highly-available — synchronous vs. asynchronous replication
• Significantly more efficient use of network I/O
• Self-healing, Fault-tolerant, Instant crash recovery
MySQL with standby Next Generation DBMS
async
4/6 quorum
PiTR
Sequential
write
Sequential
writeDistributed
writes
Amazon Elastic Block Store (EBS)
Index and Search Architecture
Full-Text Query
Metadata Query
Facets & Buckets
Security Filters
Results Processing
Credit: Ryan Tobora
ThinkBig, Teradata
http://thinkbig.teradata.com/solrcl
oud-terminology/
Text Extraction
Metadata Injection
& Path Processing
Shingles
ACL Processing
Results ProcessTerm-hit Highlighting
x 20 instances
Storage Layer
File Storage Architecture
In Place
AWS Import/Export
Direct
Streaming
Aurora EBS
Metadata ContentMetadata
Content
Archive Layer
S3 Amazon Glacier
Metadata Content
File
System
ProtocolsAPIs
BM4 Test Execution Environment – 1.2B Docs
UI Test x 20 m3.2xlarge Simulate 500 Users• Selenium / Firefox• 1 hour constant load• 10 sec think time
UI Test UI Test
Alfresco Alfresco Alfresco x 10 c3.2xlarge Alfresco with Share and Repo
Solr x 20 m3.2xlarge Solr Solr
Aurora x 1 db.r3.xlarge
ELB
Sharded Solr Cloud
sites folders files transactions dbSize GB
10,804 1,168,206 1,168,206,000 15,475,064 3,185
Simulate AWS Import/Export
(in place)
Benchmark Results
• Document load rate 1000 documents per second (with 10 nodes)
• 3 Million per Hour!
• Load rate was consistent even passing the 1B document
• Sub-second login times and good responses for other actions
• Open Library: 4.5s
• Page Results: 1s
• Navigate to Site: 2.3
• Aurora indexes used efficiently at 3.2TB
• No indications of any size-related bottlenecks with 1.1 Billion Documents
• CPU loads:
• Database: 8-10%
• Alfresco (each of 10 nodes): 25-30%
What a Difference
ECM ECM ECM
Search Search Search
FS FS FS
Hardware Hardware Hardware
Load Balancer
DR Plan
HSM HSM HSM
3-6 MonthsQuestionable ScaleLittle Redundancy
Lots of $$$
< 30 mins10x Faster
Fault-TolerantOpen, Cost Effective
ELB
Alfresco Alfresco Alfresco
Solr Solr Solr
S3
EC2 EC2 EC2
AZ1 AZ2 AZ3
EBS EBS EBS
Well, what am I supposed to do
with all this frickin’
hardware?!!