Date post: | 15-Jan-2015 |
Category: |
Technology |
Upload: | regunath-balasubramanian |
View: | 27,175 times |
Download: | 1 times |
Big Data at Aadhaar
Dr. Pramod K [email protected]: @pramodkvarma
Regunath [email protected]: @RegunathB
2
Aadhaar at a Glance
3
India• 1.2 billion residents– 640,000 villages, ~60% lives under $2/day– ~75% literacy, <3% pays Income Tax, <20% banking– ~800 million mobile, ~200-300 mn migrant workers
• Govt. spends about $25-40 bn on direct subsidies– Residents have no standard identity document– Most programs plagued with ghost and multiple
identities causing leakage of 30-40%
4
Vision
• Create a common “national identity” for every “resident”– Biometric backed identity to eliminate duplicates– “Verifiable online identity” for portability
• Applications ecosystem using open APIs– Aadhaar enabled bank account and payment platform – Aadhaar enabled electronic, paperless KYC
5
Aadhaar System• Enrolment– One time in a person’s lifetime– Minimal demographics– Multi-modal biometrics (Fingerprints, Iris)– 12-digit unique Aadhaar number assigned
• Authentication– Verify “you are who you claim to be”– Open API based– Multi-device, multi-factor, multi-modal
Architecture Principles• Design for scale
– Every component needs to scale to large volumes– Millions of transactions and billions of records– Accommodate failure and design for recovery
• Open architecture– Use of open standards to ensure interoperability– Allow the ecosystem to build libraries to standard APIs– Use of open-source technologies wherever prudent
• Security– End to end security of resident data– Use of open source– Data privacy handling (API and data anonymization)
6
Designed for Scale• Horizontal scalability for all components– “Open Scale-out” is the key– Distributed computing on commodity hardware– Distributed data store and data partitioning– Horizontal scaling of “data store” a must!– Use of right data store for right purpose
• No single point of bottleneck for scaling• Asynchronous processing throughout the system– Allows loose coupling various components– Allows independent component level scaling
7
Enrolment Volume
• 600 to 800 million UIDs in 4 years– 1 million a day– 200+ trillion matches every day!!!
• ~5MB per resident– Maps to about 10-15 PB of raw data (2048-bit PKI encrypted!)– About 30 TB I/O every day– Replication and backup across DCs of about 5+ TB of incremental
data every day– Lifecycle updates and new enrolments will continue for ever
• Additional process data– Several million events on an average moving through async
channels (some persistent and some transient)– Needing complete update and insert guarantees across data stores
8
Authentication Volume• 100+ million authentications per day (10 hrs)
– Possible high variance on peak and average– Sub second response– Guaranteed audits
• Multi-DC architecture– All changes needs to be propagated from enrolment data stores to
all authentication sites
• Authentication request is about 4 K– 100 million authentications a day– 1 billion audit records in 10 days (30+ billion a year)– 4 TB encrypted audit logs in 10 days– Audit write must be guaranteed
9
10
Open APIs
• Aadhaar Services– Core Authentication API and supporting Best
Finger Detection, OTP Request APIs– New services being built on top
• Aadhaar Open Standards for Plug-n-play– Biometric Device API– Biometric SDK API– Biometric Identification System API– Transliteration API for Indian Languages
11
Implementation
Patterns & Technologies• Principles
• POJO based application implementation• Light-weight, custom application container• Http gateway for APIs
• Compute Patterns• Data Locality• Distribute compute (within a OS process and across)
• Compute Architectures• SEDA – Staged Event Driven Architecture• Master-Worker(s) Compute Grid
• Data Access types• High throughput streaming : bio-dedupe, analytics• High volume, moderate latency : workflow, UID records• High volume , low latency : auth, demo-dedupe,
search – eAadhaar, KYC
Aadhaar Data Stores (Data consistency challenges..)
Mongo cluster(all enrolment records/documents
– demographics + photo)
Shard 1
Shard 4
Shard 5
Shard 2
Shard 3
Low latency indexed read (Documents per sec), High latency random search (seconds per read)
MySQL (all UID generated records - demographics only,
track & trace, enrolment status )
Low latency indexed read (milli-seconds per read), High latency random search (seconds per read)
UID master (sharded)
Enrolment DB
Solr cluster(all enrolment records/documents
– selected demographics only)
Low latency indexed read (Documents per sec), Low latency random search (Documents per sec)
Shard 0
Shard 2
Shard 6
Shard 9
Shard a
Shard d
Shard f
HDFS(all raw packets)
Data Node 1
Data Node 10
Data Node ..
High read throughput (MB per sec), High latency read (seconds per read)
Data Node 20
HBase(all enrolment
biometric templates)Region Ser. 1
Region Ser. 10
Region Ser. ..
High read throughput (MB per sec), Low-to-Medium latency read (milli-seconds per read)Region
Ser. 20
NFS(all archived raw packets)
Moderate read throughput, High latency read (seconds per read)
LUN 1 LUN 2 LUN 3 LUN 4
Aadhaar Architecture
• Work distribution using SEDA & Messaging
• Ability to scale within JVM and across
• Recovery through check-pointing
• Sync Http based Auth gateway
• Protocol Buffers & XML payloads
• Sharded clusters
• Near Real-time data delivery to warehouse• Nightly data-sets used to build dashboards,
data marts and reports
• Real-time monitoring using Events
Deployment Monitoring
16
Learnings• Make everything API based• Everything fails (hardware, software, network,
storage)– System must recover, retry transactions, and sort of self-
heal
• Security and privacy should not be an afterthought• Scalability does not come from one product• Open scale out is the only way you should go.– Heterogeneous, multi-vendor, commodity compute,
growing linear fashion. Nothing else can adapt!