Date post: | 11-Jan-2016 |
Category: |
Documents |
Upload: | christine-blair |
View: | 213 times |
Download: | 1 times |
University of NamurFaculté d'informatique
PReCISE Research Center - Database Engineering Groupwww.info.fundp.ac.be/libd
PReCISE
- A (sort of) spatio-temporal view of DB reverse engineering -
Jean-Luc Hainaut
February 5, 2014 Stevens Award lecture WCRE-CSMR 2014
Data matters most
but where has all the semantics gone?
2
• Introduction
• Understanding data semantics
• Data models
• Tracing data semantics
• Recovering hidden data semantic
• Is data semantics recovery that important, actually?
• Summary and conclusions
3
Introduction
4
1. To study the concept of data semantics in business applications
2. To identify and evaluate the techniques used to represent data semantics
3. To observe how these techniques have evolved in time and in different cultures.
4. To discuss the methods used to recover the semantics lost when poor representation techniques have been used.
Objectives of the lecture
5
1. The database is a picture of the application domain
• Its schema is a model of the static structures of the domain
• Its data describe the current state (or suite thereof) of the domain
The role of data in business applications
2. The database is designed independently of the application programs
The database is designed before the application programs
3. The database schema evolution translates the evolution of the functional requirements
Axioms on databases
4. The database is described by (at least) two schemas:
• the conceptual schema: abstract, platform-independent
formalism: ER model, conceptual UML class diagrams
• the logical schema: concrete, platform-dependent
formalism: SQL2, Java classes
There exists a bidirectional mapping between both.
6
1. The axioms often are ignored by developers
- ignore = how interesting! I didn't know them
- ignore = I know them but they do not suit my way of working
The role of data in business applications
3. The biggest violation of the axioms concern the existence and role of the conceptual schema
Meta-axioms on axioms on databases
7
Understanding data semanticsExperimental approach and first conclusions
8Preliminary question
C400B512S144
C1
Darwen Owens Garcia
C2
London NY Madrid
C3
T
C400B512S144
CustID
Darwen Owens Garcia
Name
London NY Madrid
City
CUSTOMER
C400 Darwen LondonB512 Owens NYS144 Garcia Madrid
C
T
To what extent does each of these data setsexpresses the semantics of data?
Same data, different structures
9Motivating example. 1. Reading data from a COBOL file (1970)
application code (COBOL)
WORKING-STORAGE SECTION.01 CUSTOMER. 02 CustID PIC X(12). 02 Name PIC X(60). 02 City PIC X(40).
CustID
Name
City
CUSTOMER
external file
SELECT FILE1 ASSIGN TO "FILE1.DAT"ORGANIZATION IS INDEXEDACCESS MODE IS DYNAMICRECORD KEY IS RKEY.
FD FILE1.01 REC. 02 RKEY PIC X(12). 02 RINFO PIC X(100).
C400B512S144
RKEY
Darwen London Owens NY Garcia Madrid
RINFO
REC
REC
RKEYRINFO
CUSTOMER
CustIDNameCity
B512
CustID
Owens
Name
NY
City
CUSTOMER READ FILE1 INTO CUSTOMER.
10Motivating example: 1. Reading data from a COBOL file (1970)
REC
RKEYRINFO
CUSTOMER
CustIDNameCity
Where has data semantics been defined?
• In file description (10%) - [unique key, key data type]
• In application code (93%).
10%93%
11Motivating example. 2. Reading data from an RDB (1980+)
Relational DB
create table CUSTOMER( CustID char(12) not null, Name char(60) not null, City char(40) not null, primary key (CustID)).
CustID
Name
City
CUSTOMER
C400B512S144
CustID
Darwen Owens Garcia
Name
London NY Madrid
City
CUSTOMER
application code (C)
string v1;string v2;string v3;
v1
v2
v3
select * into v1,v2,v3 from CUSTOMER where CustID = 'B512'v1 v2 v3
B512 Owens NY
v1 CUSTOMER
CustIDNameCity
v2
v3
12Motivating example: 2. Reading data from an RDB (1980+)
Where has data semantics been defined?
• In DB schema (100%)
• In application code (3%) - [data type].
v1 CUSTOMER
CustIDNameCity
v2
v3
3% 100%
13What does data semantics mean?
A tentative practical definition
Data semantics is the knowledge defined by all the
non technical,
domain-dependent,
information
that allows us to understand, to use and to manage the data.
14Where can we find traces of data semantics?
data
DB schema
Applicationprogram
in the application code (reading from file)
in the DB schema (reading from DB)
15
1. Expressiveness: DDL is the most appropriate language to declare data structures and constraints
2. Language independence: DDL is independent of application programming languages
6. Stability. The schema must be changed only when the application domain evolve.
3. Uniqueness: the schema is unique and centralized
4. Integration with data: the schema is a part of the database (no risk to loose it!))
5. Program independence: the schema is independent of application programs
1. Expressiveness: DDL is the most appropriate language to declare data structures and constraints
A first (trivial) observation
2. Language independence: DDL is independent of application programming languages
6. Stability. The schema must be changed only when the application domain evolve.
3. Uniqueness: the schema is unique and centralized
4. Integration with data: the schema is a part of the database (no risk to loose it!))
5. Program independence: the schema is independent of application programs
It is best to express data semantics in the database schema
16
Only data structures are explicit in application programs:
• record name
• field name
• field data type
However, things are not always that simple (e.g.,COBOL files)
Additional constraints generally are controlled by the application code:
• where?
• in which way?
• in all the modules processing the data?
Understanding data semantics by analyzing the program code can be much complex than expected.
17
Only standard integrity constraints can be coded through the DDL (SQL2):
• not null
• uniqueness
• referential integrity
However, things are not always that simple (e.g., RDB)
Additional constraints must be coded through generic means:
• check predicates
• triggers
• store procedures
Understanding data semantics by reading the database schema can be less easy than
expected.
18
Data models
19Data models: abstraction hierarchy
Coding SQL-DDL code
Physical design
Logical design
Information analysis
Userrequirements
Conceptualschema
Logical (RDB)schema
Physical (DB2)schema
Reminder on the database design process - The standard view
20999. Data semantics and data models
Conceptual models
• ER (*)• UML class diagrams
Logical models
• Record oriented models: • files • legacy DBMS (IMS, CODASYL) • RDB (*)
• Key-Value models: • NoSQL (*)• CSV
• Structured object models: • OO• NoSQL• Json (*)• XML
The way data semantics is expressed in a database depends on its data model
21ER conceptual model
Abstract, platform-independent information description
The world is perceived as:- sets of entities,- properties that characterize entities- relationships holding between entities
A conceptual schema can be translated into several logical, DBMS-dependent, schemas
1-10-N place
ORDER
OrdIDDateOrdAccount
id: OrdID
CUSTOMER
CustIDNameCity
id: CustID
22
data
metadata
Relational data model (schema-based, 1NF)
Examples: Oracle, DB2, SQL Server, MySQL, PostgreSQL, etc.
• Domain-dependent schema• Schema and data are hierarchically distinct• Values are aggregated into rows• The semantics is explicit in the schema (part of!)• The semantic is managed/controlled by the DBMS
C400B512S144
CustID
Darwen Owens Garcia
Name
London NY Madrid
City
-124 5509 0
Account
23
meta-metadata
metadata
data
ENTITY
903179031790317903175973159731597315973166830668306683066830
ATTRIBUTE
CustID Name City Account CustID Name City Account CustID Name City Account
VALUE
C400 Darwen London -124 B512 Owens NY 5509 S144 Garcia Madrid 0
Key-Value data model (schema-less, triples, 1NF)
Examples: Oracle NoSQL, BerkeleyDB, Voldemort, Riak, Redis
• Domain-independent schema• Metadata mixed with data • Elementary Key-Value• The semantics is explicit in the data• The semantics is managed/controlled by application programs or middleware
24
data
metadata
{"CustID": "C400", "Name": "Darwen","City": "London", "Account": 124} {"CustID": "B512", "Name": "Owens", "City": "NY", "Account": 5509} {"CustID": "S144", "Name": "Garcia", "City": "Madrid", "Account": 0}
903175973166830
meta-metadata
ENTITY ATTRIBUTES
Structured object data models (schema-less, NF2)
Examples: CouchDB, MongoDB (BSON), SimpleDB
• Domain-independent schema• Metadata mixed with data• Aggregated Key-Value into objects (here in Json) • The semantics is explicit in the data• The semantic is managed/controlled by application programs or middleware
25
Tracing data semantics
26In the real world, where is semantics expressed?
We have identified two places: DB schema and application code.
Are there other places?
27Architectural framework
data
DB schema
Applicationprogram
O/RMapping
class schema
User interface- data structure- labels- help, error messages)
Application code- data structures- procedural code)
Class schema
DB logical schema- global schema- views
Data
Doc
Documentation (text, structured, ontology)
Object/Relational mapping
28Semantics in the documentation
data
DB schema
Applicationprogram
O/RMapping
class schema
Doc
Documentation (text, structured, ontology)
Functional documentation (should include the conceptual schema)
Technical documentation (should include the logical schema)
Drawback the documentation often is
• obsolete, • incomplete, • inconsistent• missing
298. Semantics in the DB schema
data
DB schema
Applicationprogram
O/RMapping
class schema
Doc
DB logical schema- global logical schema- views
The logical schema is DBMS-dependent.
It is a more or less faithful implementation of the conceptual schema.
Some views can be more detailed than the logical schema.
Drawbacks• not a conceptual schema• additional constraints not always trivial to
identify and to understand
3010. Semantics in the class schema
data
DB schema
Applicationprogram
O/RMapping
class schema
Class schema
Doc
DB logical schema
T
Bidirectional relation/object transformation.
Solving the impedance mismatch problem
The class schema seen as the domain model.
It is implemented into a relational database, which ensures object persistence.
The DB schema itself is hidden and may bear little semantics.
Drawbacks• inappropriate formalism• poor change propagation mechanism (if any)• semantics in the application and not in the DB• data model not easily shared by several
applications
3111. Semantics in the application code
data
DB schema
Applicationprogram
O/RMapping
class schema
Application code- data structures- procedural code
Doc
Internal data structures may be more explicit that theDB schema.
Data integrity constraints checked by the application code.
Understanding data semantics from the wayprograms process the data.
However, program analysis is far from trivial:• size (millions of LOC)• architectural complexity• algorithmic complexity• data flow complexity• creative data processing
Drawbacks• redundancies (a constraint may be checked in
many places)• distributed traces (potential inconsistencies)
3212. Semantics in the GUI
data
DB schema
Applicationprogram
O/RMapping
class schema
User interface- data structure- labels- help, error messages)Doc
The UI often is a view on a part of the database.
This view is intended for users user friendly.
Provides useful hints about the constraints and meaning of data:
• data structure (data types, aggregates)
• explicit labels
• sample data
• informative help and error messages
Drawbacks• distributed control (potential inconsistencies)• does not cover all the database objects
3313. Semantics in the data (record-oriented models)
data
DB schema
Applicationprogram
O/RMapping
class schema
Data
Doc
In standard models
Data analysis: finding relationships among data
• uniqueness
• data types
• inclusion properties (foreign keys)
• etc.
Main strategy• validating hypotheses
3413. Semantics in the data (alternative models)
data
DB schema
Applicationprogram
O/RMapping
class schema
Data
Doc
In alternative (schema-less) models
Metadata extraction
But also data analysis as in standard models
Experience• none. Too new.
35
Recovering hidden data semantics:database reverse engineering
36
Definition
DB reverse engineering
Reverse engineering a piece of software consists, among others, in recovering or reconstructing its functional and technical specifications, starting mainly from the source text of the programs. Recovering these specifications is generally intended to redocument, convert, refactor, maintain or extend existing applications.
Database reverse engineering is that part of Information System Engineering that addresses the problems and techniques related to the recovery of the conceptual and logical schemas of files and databases of existing systems.
37
DB reverse engineering methodology
DB reverse engineering
Full project
Pilote
Conceptualization
Logical extraction
Physical extraction
Sourcemanagement
Projectplanning
Conceptualschema
Logical (RDB)schema
38
DB reverse engineering methodology
DB reverse engineering
Full project
Pilote
Conceptualization
Logical extraction
Physical extraction
Sourcemanagement
Projectplanning
Others
UI analysis
Class analysis
Prog. analysis
Data analysis
Sch. analysis
Normalization
Untranslation
De-optimization
39
Is data semantics recovery that important, actually?
40
Yes
Definitely!
41Can you prove it? At least I can show you an example
Example: database application migration
Porting a complete existing application, or some of its components, on another, generally
more modern, platform.
For a database: changing its DMS. A popular example: migrating the legacy set of files of
a business application to a RDBMS.
Two main approaches :
• physical approach
• semantic approach
42Physical database migration
Database migration
The physical, or one-to-one migration strategy is the cheapest but also the worst
approach since it deeply degrades the final structure.
Requires no knowledge on data semantics Very popular
Physicalextraction
Physical (file)schema
COBOL code SQL-DDL code
Coding
Physical (DB2)schemaTransform
43Physical database migration
physical (one-to-one) migration
SELECT CLIENT ASSIGN TO "CUST.DAT"ORGANIZATION IS INDEXEDRECORD KEY IS CUST_ID.FD CUST-FILE.01 CUSTOMER. 02 CUST-ID PIC X(12). 02 CUST-INFO PIC X(80). 02 CUST-HIST PIC X(1000).
Create table CUSTOMER( CUST_ID char(12) not null, CUST_INFO char(80) not null, CUST_HIST char(1000) not null, primary key (CUST_ID));
=
=
CUSTOMER
CUST-ID: char (12)CUST-INFO: char (80)CUST-HIST: char (1000)
id: CUST-ID
CUSTOMER
CUST_ID: char (12)CUST_INFO: char (80)CUST_HIST: char (1000)
id: CUST_ID
no added value
44Semantic database migration
Database migration
Semantic approach: based on an in-depth understanding of the semantics of source data.
Provides a high quality result. Strong basis for the future.
Requires a complete, up to date, knowledge of the DB
Physicalextraction
Physical (IDMS)schema
Logical (DBTG)schema
Conceptualschema
Logical extraction
Conceptual-ization
IDMS-DDL code SQL-DDL code
Coding
Physicaldesign
Logicaldesign
Logical (RDB)schema
Physical (DB2)schema
Conceptualschema
Reverse Engineering
COBOL code SQL-DDL code
Coding
Physicaldesign
Logicaldesign
Logical (RDB)schema
Physical (DB2)schema
45Semantic database migration (1)
semantic migration (refinement)
SELECT CLIENT ASSIGN TO "CUST.DAT"ORGANIZATION IS INDEXEDRECORD KEY IS CUST_ID.FD CUST-FILE.01 CUSTOMER. 02 CUST-ID PIC X(12). 02 CUST-INFO PIC X(80). 02 CUST-HIST PIC X(1000).
+
CUSTOMERCUST-ID: char (12)CUST-INFO: compound (70)
NAME: char (20)ADDRESS: char (40)STATUS: char (10)
CUST-HIST-PURCH[0-100] array: compound (10)ITEM: num (5)TOTAL: num (5)
id: CUST-IDid(CUST-HIST-PURCH):
ITEM
1-10-100 record
CUSTOMER
CUST-ID: char (12)CUST-INFO: compound (70)
NAME: char (20)ADDRESS: char (40)STATUS: char (10)
id: CUST-ID
CUST-HIST-PURCH
Index: index (4)ITEM: num (5)TOTAL: num (5)id: record.CUSTOMER
ITEMid': record.CUSTOMER
Index
46Semantic database migration (2)
semantic migration (SQL translation)
1-10-100 record
CUSTOMER
CUST-ID: char (12)CUST-INFO: compound (70)
NAME: char (20)ADDRESS: char (40)STATUS: char (10)
id: CUST-ID
CUST-HIST-PURCH
ITEM: num (5)Index: index (4)TOTAL: num (5)id: record.CUSTOMER
ITEMid': record.CUSTOMER
Index
No more than 100 CUST_HIST_PURCHper CUSTOMER
CUSTOMER
CUST_IDCUS_NAMECUS_ADDRESSCUS_STATUS
id: CUST_ID
CUST_HIST_PURCH
CUST_IDITEMCINDEXTOTALid: CUST_ID
ITEMid': CUST_ID
CINDEXref: CUST_ID
Create table CUSTOMER( CUST_ID char(12) not null, CUST_NAME char(28) not null, CUST_ADDRESS char(60) not null, CUST_STATUS char(2) not null, primary key (CUST_ID));
Create table CUST_HIST_PURCH( CUST_ID char(12) not null, ITEM char(10) not null, CINDEX smallint not null check(CINDEX <= 100), TOTAL smallint not null, primary key (CUST_ID,ITEM), unique (CUST_ID,CINDEX), foreign key (CUST_ID) reference CUSTOMER);
Normalized DB
47Database migration - Synthesis
Create table CUSTOMER( CUST_ID char(12) not null, CUST_NAME char(28) not null, CUST_ADDRESS char(60) not null, CUST_STATUS char(2) not null, primary key (CUST_ID));
Create table CUST_HIST_PURCH( CUST_ID char(12) not null, ITEM char(10) not null, CINDEX smallint not null check(CINDEX <= 100), TOTAL smallint not null, primary key (CUST_ID,ITEM), unique (CUST_ID,CINDEX), foreign key (CUST_ID) reference CUSTOMER);
Create table CUSTOMER( CUST_ID char(12) not null, CUST_INFO char(80) not null, CUST_HIST char(1000) not null, primary key (CUST_ID));
physical migration
semantic migration
48Evolution
new application: compute total sales per item
CUSTOMER
CUST-ID: char (12)CUST-INFO: char (80)CUST-HIST: char (1000)
id: CUST-ID
?
• where is the required information?
• how to extract it from the CUSTOMER table?
• who will develop the (C, Java, VB) program?
• … and when?
Select ITEM, sum(TOTAL)from CUST_HIST_PURCHgroup by ITEM;
• clearly visible + documentation if needed
• just name the columns
• by any non expert
• immediately, 2 minutes
CUST_HIST_PURCH
CUST_IDITEMCINDEXTOTALid: CUST_ID
ITEMid': CUST_ID
CINDEXref: CUST_ID
CUSTOMER
CUST_IDCUS_NAMECUS_ADDRESSCUS_STATUS
id: CUST_ID
49
Summary and conclusions
50
• Theories (e.g., text books) teach that the conceptual schema must be the unique expression of data semantics. In an ideal world, the conceptual schema exists, and all the other artefacts (DB schemas, UML diagrams, views, class schema, programs, UI) derive from it and capture each a part of this semantics.
Some mundane observations
• Identifying, extracting, understanding and merging these traces to rebuilt the conceptual schema are the very goals of database reverse engineering.
• However, the real world doesn't learn from theories. Most often, the conceptual schema does not exist so that only the other artefacts bear traces of the data semantics.
51Cultural aspects of data semantics expression
1. Small personal application
Mainly non-professional developers. Intuitive, bottom-up, incremental development. Weak culture in DB.
Data semantics: in the UI, in application code
2. Database (record-oriented) data-intensive processing
Professional developers. Disciplined, top-down development. Strong culture in DB.
Data semantics: in the DB schema (including additional constraints).
3. OO data-intensive processing
Professional developers. OO minded. Disciplined, top-down development. Weak culture in DB.
Data semantics: in the class schema (through O/RM middleware).
4. Big data
(Semi-)Professional developers. Low complexity applications.RDB discarded as old-style (however NewSQL DBMS are lurking!)
Data semantics: simple, loose (few constraints); metadata in data
52
1950 - 1975: file-oriented processing
Semantics in record schema and application code
Evolution of data semantics expression
1968 - 1990: hierarchical/network database processing
Semantics in DB schema
1980 - ?: relational database processing
Semantics in DB schema
1990 - 2000: object-oriented DB processing
Semantics in DB schema and application code (methods)
2000 - ?: object-relational DB processing
Semantics in DB schema
2000 - ?: O/RM processing
Semantics in class schema
2011 - ?: NewSQL DB processing
Semantics in DB schema
2005 - ?: NoSQL DB processing
Semantics in data and in application code
prog
DB
DB
prog
DB
prog
prog
DB
Quality of DS representation
53
Quite often, developers see the database as a mere repository for the data used and created by programs:
• "the database offers persistence services for the business logic layer"
• "the database is an implementation of the program classes"
Some conclusions
This view entails much problems when long term maintenance and evolution are concerned. When the program changes, the database schema often must be modified accordingly, even if its semantics does not change.
The view of the database as a model of the application domain ensures a great stability of business systems.
So, the database is directly dependent on the current state of program architecture.
It makes the joy of researchers in system evolution but lets the practitioners less enthousiast.
Is the database culture still living among today developers?
54
Thanks
55
56
57
Abstract of the lecture
The role of databases may sometimes appear controversial since they are mere basic services for a significant part of the the software engineering community (the transparent "persistence layer") while they are the central component of business application for the database community. In this lecture, we examine the evolution of the balance database/program both in time (from the early sixties to a foreseenable future) and in space (technologies, communities) from the data semantics point of view. In particular we analyze and compare how and where data semantics has been located and implemented in each of these contexts. Current development practices tend to migrate semantics from the database (as was usual in the eighties and nineties) to the application logic (e.g., O/RM, NoSQL DB managers), a trend that may be seen of regression that reminds us the infancy of business application development where files were dedicated to one application. Finally, the lecture defines how data semantics can be recovered in these scenarios.