Date post: | 13-Jan-2016 |
Category: |
Documents |
Upload: | holly-gardner |
View: | 218 times |
Download: | 0 times |
Partitioning – A Uniform Model for Data Mining
Anne Denton, Qin Ding, William Jockheck, Qiang Ding
and William Perrizo
Motivation Databases and data warehouses
are currently separate systemsWhy? Standard answer:
Details, details, details … Our answer:
Fundamental issue of representation
Relations Revisited R(A1, A2, …, AN) Set of tuples Any choices at a fundamental level?Yes! Duality between
Element-based representation Space-based representation
Duality
Element-based representation:
Standard representation of tuples with all their attributes
Space-based representation:
The existence (count?) of a tuple is represented in its attribute space
Similar Dualities in Physics Particles can be
represented by their position
More fundamental level:
Particle
Particles can be 1 values in a grid of locations
Field
Space-Based Representation Consider standard tuples as vectors
in the space of attribute domains Represent all possible attribute
combinations as one bit: 1 if data item is present 0 if it isn’t
Allowing counts could be useful for projections (?)
Space-Based Representation as a Partition Partitions are mutually exclusive
and collectively exhaustive sets of elements
The Space-Based Representation partitions attribute space into two sets: Data item present in database (1) Data item not present (0)
Usefulness of Space-Based Representation No indexes needed: instant value-based
access Index locking becomes dimensional
locking Aggregation very easy due to value-
based ordering Selections become “and”sWhat experience do we have with space-
based representations?
Data Cube Representation One value (e.g., sales) given in the
space of the key attributes Space-based with respect to key
attributes Element-based with respect to
non-key attributes
Properties of the Domain Space Ideally space should have
distance, norm, etc. Especially important for data mining
Does that make sense for all domains? Can any domain be mapped to
integer?
Can all Domains be Mapped to Integer? Simplistic answer: yes!
All information in a computer is saved as bits Any sequence of bits can be interpreted as an
integer Problems
Order may be irrelevant, e.g., hair-color Order may be wrong, e.g., sign bit for int Even if order is correct, spacing may vary,
e.g., float (solution in paper: intervalization) Domains may be very large, e.g., movies
Categorical attributes (irrelevant order)
We need more than one attribute for an appropriate representation
Data mining solution: 1 attribute per domain value
Our solution: 1 attribute per bit slice Values are corners of a Hypercube in
log(Domain Size) dimensions Distances are given trough MAX metric
Fundamental Partition(Space-Based Representation) # of dimensions = Number of
attributes # of represented points = product
of all domain sizes Exponential in number of
dimensions! We badly need compression!
How Do We Handle Size? Problem exponential in #of attributes
How can we reduce #of attributes?
Review normalization: We can decompose a relation into a set
of relations each of which contains the entire key and one other attribute
This decomposition is loss less dependency preserving (BCNF relations
only)
Compression for Non-Key AttributesFundamental partition contains one non-zero
data-point in any non-key dimension only Represent number by bit-slicesNote: This works for numerical and categorical
attributesOriginal values can be regained by anding Example 5 (binary 101) is bit 0 & bit 1’ &
bit 2
Concept Hierarchies
Bit sliced representation have significant benefits beyond compression:
Bit slices can be combined into concept hierarchies: Highest level: bit 0 Next level: bit 0 & bit 1 Next level: bit 0 & bit 1 & bit 2
Compression for Key Attributes Database state-independent
compression could lead to information loss (counts > 1)
Database state-dependent compression: Tree structure that eliminates pure
subtrees => P-trees
Other Ideas
Compression is better if attribute values are dense within their domain
We could use extent domain Compression good Problems with insertion
Reorganization of storage Index locking has to be reintroduced …
How Good is Compression so far? If all domains are “dense”, i.e. all values
occur Size can easily be smaller than original
relation If non-key attributes are “sparse”
Not usually a problem: good compression Problems only in extreme cases
E.g., movies as attribute values! If key-attributes are “sparse”
Larger potential for problems, but also large potential for benefit (see data cubes)
Are Key-Attributes Usually Sparse? Many key attributes are dense (“structure”
attributes as keys) Automatically generated IDs are usually
sequential x and y in spatial data mining Time in data streams
Keys in tables that represent relationships tend to be sparse (feature attributes as keys) Student / course offering / grade Data cubes!
What Have We Gained?(Database Aspects) Data simultaneously acts as index No separate index locking
(unless extent domain is used) All information saved as bit
patterns Easy “select” Other database operations discussed
in class
What Have We Gained?(Feature Attribute Keys) Direct mining possible on relations with
feature attributes keys E.g., student / course offering / grade
Rollup can be defined, etc. Clustering, classification, ARM can make
use of proximity inherent in representation Bit-wise representation provides concept
hierarchy for non-key attribute Tree structure provides concept hierarchy
for key attributes
What Have We Gained?(Structure Attribute Keys) For relations with structure attribute
keys mining requires “and”ing produces counts for feature attributes
Bit-wise representation provides concept hierarchy for non-key attribute
Duality: Concept hierarchies in this
representation map exactly to tree structure when the attribute is a key
Mapping Concept HierarchiesBit Slices <-> Tree
P-tree: Take key attributes, e.g. x and y, and bit
interleave them: x = 1 0 0 1 y = 1 1 0 1 1 1 0 1 0 0 1 1
Any two of these digits form a level in the P-tree – or a level in a concept hierarchy
How Could We Use That Duality? Join with other relations and project off
key attributes (Meta P-trees) Can we do that?
We lose uniqueness We can use 1 to represent 1 or more tuples
(equivalent to relational algebra) Or we can introduce counts
Can be useful for data mining Need for non-duplicate eliminating counts exists
also in other applications
How Do Hierarchies Benefit us in Databases? Multi-granularity Locking Subtrees form suitable units for
storage in a block Fast access!
Proportional to # of levels in tree # of bits for bit slices
Summary Space-based representation has
many benefits Value-based access and storage No separate index needed Rollups easy
P-Trees Follow from systematic compression Benefits from concept hierarchies