Post on 30-Dec-2015
transcript
Behshid Behkamal2
Data Quality Definition
Quality of Linked Data
Data Quality Dimensions
Data Quality Model
Behshid Behkamal4
Some definition
data qualitydegree to which the characteristics of data satisfy stated and implied needs when used under specified conditions
data quality characteristic category of data quality attributes that bears on data quality
data quality measure variable to which a value is assigned as the result of measurement of a data
quality characteristic
Behshid Behkamal5
Data Quality ProblemData Quality Problem
Single Source ProblemMulti Source Problem
Schema RelatedInstant SpecificSchema Related Instant Specific
•Attribute•Record•Record Type •Source
•Attribute•Record•Record Type •Source
•Attribute•Record•Record Type •Source
•Attribute•Record•Record Type •Source
Multi Source ProblemSingle Source Problem
Instant SpecificSchema Related Schema RelatedInstant SpecificSchema Related Instant SpecificSchema RelatedInstant SpecificSchema Related
Classification of Data Quality problemsClassification of Data Quality problems
Behshid Behkamal10
Data quality Dimensions – 2003 [2]
Task independentReflect states of the data without the contextual knowledge of the application, and can be applied to any data set, regardless of the tasks at hand.
Task dependent Which include the organization’s business rules, company and government regulations, and constraints provided by the database administrator, are developed in specific application contexts.
Behshid Behkamal12
Dimension of Data Quality- 2005 [3]
Process: Dimensions of DQ related to the generation, assembly, description and maintenance of data
- Reliability (with several sub dimensions), Metadata, Security and Confidentiality.
Data: Dimensions of DQ specifically associated with the data themselves.
- Record/table level: Accuracy, Completeness, Consistency and Validity
- Database level dimensions: Identifiably and Join ability.
User: Dimensions of DQ related to use and users
- Accessibility, Interpretability,, Relevance and Timeliness.
Behshid Behkamal13
Dimension of Data quality – 2006 [4]
Depth of Data Quality •Accuracy•Completeness•Validity•Currentness
Width of Data Quality•Consistency•Integration
Behshid Behkamal14
Dimension of Data Quality – 2008 [5]
User BaseConsistent representation, Interpretability, Case of understanding, Concise representation, Timeliness, Completeness, Value-added, relevance, appropriate, Meaningfulness, Lack of confusion, Arrangement, Readable, Reasonable
SystemData Deficiency, Design Deficiencies, Operation Deficiencies
Inherent IQ AccuracyCost, Objectivity, Believability, Reputation, Accessibility, Correctness, Unambiguous, Consistency
IntuitivePrecision, Reliability, freedom from bias
Behshid Behkamal16
ISO/IEC 25012 Data Quality Model – 2008 [6]
The ISO/IEC-25012 data quality model defined quality attributes into fifteen characteristics considered by two points of view:
– Inherent data quality refers to data itself, in particular to:
- data domain values and possible restrictions - relationships of data values - Metadata
– system dependentdata quality depends on the technological domain in which data are used:
- computer systems' components such as: hardware devices (precision)
- computer system software (recoverability)
- other software (portability)
Behshid Behkamal17
Inherent data quality
From the inherent point of view, data quality refers to data itself, in particular to:
data domain values and possible restrictions (e.g. business rules governing the quality required for the characteristic in a given application);
relationships of data values (e.g. consistency);
metadata.
Behshid Behkamal18
System dependent data quality
System dependent data quality refers to the degree to which data quality is reached and preserved within a computer system when data is used under specified conditions.
From this point of view data quality depends on the technological domain in which data are used; it is achieved by the capabilities of
computer systems' components such as: hardware devices (e.g. to make data available or to obtain the required precision),
Computer system software (e.g. backup software to achieve recoverability),
Other software (e.g. migration tools to achieve portability).
Behshid Behkamal20
1. Accuracy
The degree to which data has attributes that correctly represent the true value of the intended attribute of a concept or event in a specific context of use.
– Syntactic accuracy– Semantic accuracy
Measurement Function A/B A: records in which all attributes are accurateB: Total records in a dataset A=number of records with the specified field syntactically accurate B=number of records
A: attribute values that are accurate B: records × attributes
Behshid Behkamal21
2. Completeness
The degree to which subject data associated with an entity has values for all expected attributes and related entity instances in a specific context of use.
Measurement Function A/B – A: records with no missing attribute– B: Total records in a dataset
– A: number of data required for the particular context in the data file– B: number of data in the specified particular context of intended use
– A: attribute fields containing values – B: records × attributes
Behshid Behkamal22
3. Consistency
Free from contradiction and are coherent with other data in a specific context of use.
A particular case of inconsistency is represented by synonyms: a dictionary of terms used to define data could be useful to avoid it.
EXAMPLE An employee's birth date cannot be later than his “recruitment date”.
Behshid Behkamal23
4. Creditability (validity)
Validity is a weakened but more readily measured form of accuracy.
Attribute values may be valid without being correct, but not vice versa.
An attribute value is valid if it falls in some external sources defined and domain-knowledge dependent set of values.
Validity can range from – mechanical (Example:18/19/2002 is not a well-formed and not a valid date)– Logical (Example: -5 is not a valid age)– Domain-derived (Example: 1234 pounds is not a valid weight for a person)– Task dependent: 16:12 may be a valid time in one database but not in
another
Behshid Behkamal24
5. Currentness
The degree to which data has attributes that are of the right age in a specific context of use.
EXAMPLE The timetable of a railway station must be updated with the frequency required to allow passengers to take a train even if the scheduled time or platform change.
Behshid Behkamal25
6. Accessibility
The degree to which data can be accessed in a specific context of use, particularly by people who need supporting technology or special configuration because of some disability.
EXAMPLE Data that should be managed by a screen reader cannot be stored as an image.
Inherent Data Quality Measure for Sound data accessibility• Measurement Function A/BA= number of data stored only as “sound” (e.g. without a textual representation of sound)B= number of data values representing a sound
System Dependent Data Quality Measure for Multi channel data accessibility• Measurement Function A/BA=Number of data that the differently able user successfully accessesB=Number of data available
Behshid Behkamal26
7. Compliance
The degree to which data has attributes that adhere to standards, conventions or regulations in force and similar rules relating to data quality in a specific context of use.
EXAMPLE: Credit risk data of a bank must comply with specific laws and standards.
Inherent Data Quality Measure for Privacy law non-conformity: values• Measurement Function AA=number of items that do not conform to privacy law statements due to data content
System Dependent Data Quality Measure for Privacy law non-conformity: architecture• Measurement Function AA=number of items that do not conform to privacy law statements due to technical architecture failures
Behshid Behkamal27
8. Confidentiality
Ensure that it is only accessible and interpretable by authorized users in a specific context of use.
Confidentiality is an aspect of information security (together with availability, integrity) as defined in ISO/IEC 13335-1:2004.
EXAMPLE: Data that refers to personal or confidential information like health or profit must be accessed only by authorized users or should be written in secret code.
Inherent Data Quality Measure for Encryption usage• Measurement Function A/BA= Number of database fields encryptedB=Number of fields with an encryption requisite
System Dependent Data Quality Measure for Non vulnerability• Measurement Function 1- A/BA=number of successful penetrations during formal penetration testsB=number of penetration attempted
Behshid Behkamal28
9. Efficiency
The degree to which data has attributes that can be processed and provide the expected levels of performance by using the appropriate amounts and types of resources in a specific context of use.
EXAMPLE: Using more space than necessary to store data can cause waste of storage, memory and time.
Inherent Data Quality Measure for Numbers stored as strings• Measurement Function AA=number of data stored as strings
System Dependent Data Quality Measure for Wasted space• Measurement Function Σ(B - A)A=benchmarked average space for efficient data storage of a databaseB=used space for data in any physical files of the database
Behshid Behkamal29
10. Precision
The degree to which data has attributes that are exact or that provide discrimination in a specific context of use.
Look for rounding errors. Exp. precision of 5 decimal places allows different functionalities rather than a precision of 2 decimal places
Precision in location latitude and longitude declarations: must contain seconds in the Degree/Minute/Second system.
Inherent Data Quality Measure Name Precision of data values• Measurement Function A/B
A=number of data values with the requested precisionB=total number of data values
System Dependent Data Quality Measure for Precision of fields of a database• Measurement Function A/B
A=Number of data fields of the database defined with the requested precisionB=total number of data fields of the database
Behshid Behkamal30
11. Traceability
Provide an audit trail of access to the data and of any changes made to the data in a specific context of use.
EXAMPLE: Public administrations must keep information about the access executed by users for investigating who read/wrote confidential data.
Inherent Data Quality Measure for Traceability of values• Measurement Function A/B
A=Number of data for which required traceability of values is availableB=number of data items for which traceability is tested
System Dependent Data Quality Measure for Automatic traceability• Measurement Function AA=number of data items traced automatically (using system capabilities)
Behshid Behkamal31
12. Understand ability
Enable data it to be read and interpreted by users, and are expressed in appropriate languages, symbols and units in a specific context of use.
Some information about data understandability are provided by metadata.
EXAMPLE: To represent a State (within a country), the standard acronym is more understandable than a numeric code.
Inherent Data Quality Measure for Master data understandability due to existing metadata• Measurement Function A/B
A=Number of data of master data files with existing metadataB=Number of data of master data files
System Dependent Data Quality Measure for Master data understandability due to linked metadata• Measurement Function A/B
A=Number of fields having metadata automatically linked to related dataB=Total number of fields
Behshid Behkamal32
13. Availability
Enable data to be retrieved by authorized users and/or applications in a specific context of use.
A particular case of availability is concurrent access (both to read or to update data) by more than one user and/or application.
Another case of availability is the capability of data to be available for a specific period of time.
SYSTEM DEPENDENT Data Quality Measure for Data items availability
• Measurement Function A/BA=Number of data items available during backup/restore activitiesB=Number of data items of backup/restore procedures
Behshid Behkamal33
14. Portability
Enable data to be installed, replaced or moved from one system to another preserving the existing quality in a specific context of use.
SYSTEM DEPENDENT Data Quality Measure for Data portability
• Measurement Function A/BA=number of data that preserved the existing quality attribute after the migration to a different computer systemB=number of data migrated
Behshid Behkamal34
15. Recoverability
Enable data to maintain and preserve a specified level of operations and quality, even in the event of failure, in a specific context of use.
Recoverability can be provided by features like commit/synch point, rollback (fault-tolerance capability) or by backup-recovery mechanisms.
EXAMPLE: When a media device has a failure, data stored in that device should be recoverable.
SYSTEM DEPENDENT Data Quality Measure for Recoverability• Measurement Function A/B
A= number of data items successfully backed up/restored during backup /restore operationB= number of data items of backup/restore procedures
Behshid Behkamal35
Creditability (or validity)
[3] Measurement Function A/B A: records for which all entries are validB: Total records in a dataset
[5] Measurement Function A/BA= Number of data certified by internal audit after obtaining credit risk information dataB=Number of data used to obtain credit risk information
[6] Measurement Function A/BA: attribute values that are valid B: records × attributes
[7] Look for artificial keys, identity values, system generated keys and apply at least one business key to a data grouping say in a data mart or row occurrence for a registry type data group (an inventory list like list of persons, list of vehicles etc)
Behshid Behkamal36
Understand ability
[5] Measurement Function A/BA=Number of data of master data files with existing
metadata
B=Number of data of master data files
[7] Look for lack of referential integrity on the use of same attributes being used in various tables
Look for loss of history data with no record of previous values
Behshid Behkamal37
Understand ability according to Ref#2
Look for consistency of business types that an organization is licensed for and related types of returns or transactional consistencies
Look for lack of referential integrity on the use of same attributes being used in various tables
Applicable to uniquely traceable items like serial numbers or particular licensed item identifiers, look for can the same item be involved with another item at the same time.
Applies to ownership, involvement, and lineage.
Look for loss of history data with no record of previous values
Behshid Behkamal39
Linked Data
39
The goal of Semantic Web or Web of Data:processing data directly or indirectly by machines
Linked Data provides the means to reach the goal
Refers to data published on the Web in such a way – It is machine-readable– Its meaning is explicitly defined– It is linked to other datasets– It can be linked to/from external datasets
Behshid Behkamal40
Quality Characteristics of Linked Data
According to Definition of Linked Data:
– Compliance HTTP URIs to identify resources HTTP Protocol to retrieve resources
– Understand ability It is machine-readable Its meaning is explicitly defined
– Portability RDF data model to represent resources (Any application that
understands the model, can consume any data source published based on the model)
It can be linked to/from other datasets
Behshid Behkamal41
Classification of Quality characteristics in Linked Data
Inherent data quality– Accuracy– Validity– Precision
Context Related– Completeness– Currentness
System Dependent – Accessibility– Traceability– Recoverability– Availability– Efficiency– Confidentiality (Privacy Protection and Licensing in Linked Data)
Consistency– one of the most challenge in Linked Data is Data fusion
Behshid Behkamal42
Data Fusion
42
Process of integrating multiple data items representing the same real-world object into a single, consistent, and clean representation.
Behshid Behkamal43
Co-reference
A single URI identifies more than one resource – Exp. A number of people in DBLP with the same name who are
being incorrectly identified as being the same person.
Multiple URIs identify the same resource– Different datasets use their own URIs to identify the same
resource. People and places are entities which suffer from URI multiplicity.
– Exp. Spain has at least four URIs:1. http://dbpedia.org/resource/Spain2. http://www4.wiwiss.fu-berlin.de/factbook/resource/Spain3. http://sws.geonames.org/25107694. http://www4.wiwiss.fuberlin.de/eurostat/resource/countries/Espa
%C3%B1a
Behshid Behkamal44
Author Disambiguation [7]
1. Single author having multiple identities (variation in the spelling)
– ‘Hugh Glaser’– ‘H. Glaser’– ‘Glaser, H.’
2. Many authors who share the same name
Behshid Behkamal45
Author Disambiguation …
– Solutions: citation matching, name matching, Name equivalence identification
– All of them involve some form of string matching and word sense disambiguation.
– Help in identifying names with different spellings or written in different formats
– Disambiguating authors with exactly the same name remains a challenge.
Behshid Behkamal46
Consistent Reference Services [8]
The CRS introduces the concept of a bundle to group together resources that have been deemed to refer to the same concept within a given context.
Different bundles may be used to group together URIs of the same resource in different contexts.
For example, there may be a bundle containing all of the URIs about a person in the context of institution 1; and another bundle containing all of the URIs about the same person in the context of institution 2.
Each CRS can use different algorithms to identify equivalent resources.
Behshid Behkamal47
An Entity Name System for Linking Semantic Web Data [9]
Entity Name System (ENS), might play for the Semantic Web the role that the DNS played for interlinking hypertexts on the Web.
Behshid Behkamal48
Interlinking Distributed Social Graphs [10]
1. Export social data contained within data silos into the
same semantic form. (FaceBook, Twitter, MySpace )
2. Link person instances from separate social networks
referring to the same real world person.
3. Publish a decentralized linked social graph.
Behshid Behkamal49
1. Markus Helfert, Institute of Information Management, University of St. Gallen, Managing and Measuring Data Quality in Data Warehousing, 2001
2.Leo L. Pipino, Yang W. Lee, and Richard Y. Wang, Data quality Assessment, 2003
3. Alan F. Karr and Ashish P. Sanil , Data Quality: A Statistical Perspective, 2005
4. Kyung-Seok Ryu, Joo-Seok Park, and Jae-Hong Park, A Data Quality Management Maturity Model, ETRI Journal, (2006) Vol. 28, No. 2, 191- 204
5. Ying Su, Zhanming Jin, A Methodology for Information Quality Assessment in Data Warehousing, reviewed at the direction of IEEE Communications Society, Publication in the ICC 2008 proceedings.
6. ISO/IEC 25012 - Data Quality Model, Final Draft: 2008-11-04
7. Afraz Jaffri, Hugh Glaser, Ian C. Millard, URI Disambiguation in the Context of Linked Data, LDOW2008, China.
8. Hugh Glaser, Afraz Jaffri, Ian C. Millard, Managing Co-reference on the Semantic Web, LDOW2009, Spain.
9. Paolo Bouquet, Heiko Stoermer, Daniele Cordioli, An Entity Name System for Linking Semantic Web Data, LDOW2008, China.
10. Matthew Rowe, Interlinking Distributed Social Graphs, LDOW2009, Spain.