Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 0 times |
Demo, May 2005
Privacy Preserving Database Application Testing
Xintao Wu, Yongge Wang, Yuliang Zheng, UNC Charlotte
Demo 2
Overview• Milestone
Initial investigation from May 2002 to Dec 2002 Official starting from Sept 2003 and being supported by NSF CCR-
0310974 ( 200k, Sept 2003 – August 2005) The prototype system was finished April 2005. Developed using C+
+, Oracle with 22K lines of source code Demo at several Banks, May 2005 …
• Personnel Faculty: Xintao Wu, Yongge Wang, Yuliang Zheng Current graduate students: Songtao Guo, Ying Wu, Chintan
Sanghvi, Guodong Jiao Previous graduate students: Jing Jin, Amol Kedar Several senior undergraduate students
• More Info http://www.cs.uncc.edu/~xwu/privacy [email protected]
Demo 3
Motivation
• To generate synthetic data for DB application testing, especially performance testing.
Many applications are involving large-scale databases with sensitive information.
Complete testing is essential for database applications to function correctly and to provide acceptable performance.
Demo 4
Our Approach
• To generate synthetic databases based on a-priori knowledge about the current production databases
The needed a-priori knowledge is generally available from ER, DDL, Data Dictionary with schema, data integrity rules as well as basic statistical information
Can extract detailed statistical information if original data or samples from production database are available
The data can be either realistic amounts or any amounts
Better controllability, observability, and privacy
Demo 5
Three Characteristics of Synthetic Data• Valid
The synthetic data need to satisfy all the same constraints and business rules as the live data
Necessary for functional testing
• Privacy preserving No disclosure of any confidential information that need to be protected
• Resembling to real data The synthetic data need to have the similar statistical distributions or
patterns as the live data Necessary for performance testing as the statistical nature of the data
determines query performance
We will show if data distributions are not similar, the execution
time of the same workload may be totally different.
Demo 6
ER
Data
DDL
Catalog
R NR S
Schema & Domain Filter
Schema’ Domain’
Disclosure Assessment
Performance Assessment
Data Generator
Syntheticdatabase
General Location Model
Architecture