+ All Categories
Home > Documents > Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

Date post: 04-Apr-2018
Category:
Upload: mehul-mahrishi
View: 223 times
Download: 0 times
Share this document with a friend

of 79

Transcript
  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    1/79

    Globally Recorded binary encoded Domain Compression

    algorithm in Column Oriented Databases

    A

    Dissertation on

    Submitted

    In partial fulfillment

    For the award of the Degree of

    Master of Technology

    In Department of Information technology

    (With specialization in Information Communication)

    Supervisor Name Submitted By:

    Mr. Santosh Kumar Singh Mehul Mahrishi

    Associate Prof. Enrollment no: SGVU091543463

    Suresh Gyan Vihar University

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    2/79

    Candidates Declaration

    I hereby declare that the work, which is being presented in the dissertation, entitled

    Globally Recorded binary encoded Domain Compression algorithm in Column

    Oriented Databases in partial fulfillment for the award of Degree of Master of

    Technology in Department of Information Technology with Specialization in

    Information Communication, and submitted to the Department of Information

    Technology, Suresh Gyan vihar University is a record of my own investigations

    carried under the Guidance of Mr. S.K. Singh, Department of Information

    Technology.

    I have not submitted the matter presented in this project/seminar anywhere for

    the award of any other Degree.

    (Name and Signature of Candidate) Counter Signed by:-

    Mehul Mahrishi Mr. Santosh Kumar Singh

    Information Communication Supervisor (M. Tech IC)

    Enrolment No.: SGVU091543463

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    3/79

    DETAILS OF CANDIDATE, SUPERVISOR (S) AND EXAMINER

    Name of Candidate: Mehul Mahrishi. Roll No. 104511

    Deptt. Of Study: M. Tech. (Information Communication).

    Enrolment No. SGVU091543463

    Thesis Title: Globally Recorded binary encoded Domain Compression algorithm in

    Column Oriented Databases...

    .. Supervisor (s) and Examiners Recommended

    (with Office Address including Contact Numbers, email ID)

    Supervisor Co-Superviosr

    Internal Examiner

    1 2 3

    Signature with Date

    Programme Coordinator Dean / Principal

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    4/79

    This certifies that the thesis entitled

    Globally Recorded binary encoded Domain Compression algorithm

    in Column Oriented Databases

    is submitted by

    Mehul MahrishiSGVU091543463

    IV semester , M.Tech (IC) in the year 2011 in partial fulfillment of

    Degree of Master of Technology inInformation Communication

    SURESH GYAN VIHAR UNIVERSITY, JAIPUR.

    Signature of Supervisor

    Date:

    Place:

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    5/79

    Acknowledgement

    Foremost, I would like to express my sincere gratitude to my advisor and mentor Mr.

    S.K. Singh for the continuous support of my study and research, for his patience,

    motivation, enthusiasm, and knowledge. His guidance helped me in all the time of

    research and writing of this thesis. Besides my advisor, I would like to thank the rest

    of my thesis committee, especially Mr. Vibhakar Pathak for their encouragement,

    insightful comments, and hard questions.

    My sincere thanks also goes to Dr. S.L. Surana (Principal, SKIT), Dr. C.M.

    Choudhary(HOD CS,SKIT) and Dr. Anil Chaudhary(HOD IT,SKIT) , for supporting

    my advance studies and providing opportunities in their groups and leading me

    working on diverse exciting projects. My special thanks to Mr. Mukesh Gupta

    (Reader, SKIT) for his invaluable advise which helps me to take this decision.

    I thank my fellow mates Anita Shrotriya, Devendra Kr.Sharma, Vipin Jain, Singh

    Brothers, Kamal Hiran for the stimulating discussions, for the sleepless nights we

    were working together before deadlines, and for all the fun we have had in the last

    two years.

    Last but not the least; I would like to thank my family members: my parents (Mukesh

    & Madhulika Mahrishi), uncle & Aunt (Pushpanshu & Seema Mahrishi), brothers

    (Mridul & Harshit) and my grandmothers for their faith and giving me the first place

    by supporting me throughout my life.

    (Mehul Mahrishi)

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    6/79

    i

    Contents

    List of Tables iv

    List of Figures v

    Notations vi

    Abstract vii

    CHAPTER 1 Introduction 1-4

    1.1 Introduction.................... 11.2 Objective... . 11.3 Motivation..21.4 Research Contribution31.5 Dissertation Outline........ 3

    CHAPTER 2 Theories 5-23

    2.1 Introduction... .. 5

    2.1.1 On-Line Transaction Processing...6

    2.1.2 Query Intensive Applications.......7

    2.2 The Rise of Columnar Database. 8

    2.3 Definitions. ...10

    2.4 Row Oriented Execution.... 12

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    7/79

    ii

    2.4.1 Vertical Partitioning.....12

    2.4.2 Index-Only Plans. .12

    2.4.3 Materialized Views.......13

    2.5 Column Oriented Database......13

    2.5.1 Compression........13

    2.5.2 Late Materialization.........14

    2.5.3 Block Iteration.........14

    2.5.4 Invisible Joins......14

    2.6 Query execution in Row vs. Column oriented database.....................15

    2.7 Compression.... ...17

    2.8 Conventional Compression.18

    2.8.1 Domain Compression..19

    2.8.2 Attribute Compression... 20

    2.9 Layout of Compressed Tuples..........21

    CHAPTER 3 Methodology 24-31

    3.1 Introduction.. ............................ 24

    3.2 Reasons for Data Compression................... 25

    3.3 Compression Scheme.. 28

    3.4 Query Execution........... 30

    3.5 Decompression. .......... 30

    3.6 Prerequisites... ......................... 30

    CHAPTER 4 Results & Discussions 32-44

    4.1 Introduction... ....... 32

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    8/79

    iii

    4.2 Anonymization ..33

    4.2.1 Problem Definition & Contribution........................... 34

    4.2.2 Quality Measure of Anonymization...........................36

    4.2.3 Conclusion..... 36

    4.3 Domain compression through binary conversion................ 36

    4.3.1 Encoding of Distinct Values.......36

    4.3.2 Paired Encoding......................... 38

    4.4 Add-ons on Compression....................... 40

    4.4.1 Functional dependencies... 40

    4.4.2 Primary Keys......42

    4.4.3 Few Distinct values................... 42

    4.5 Limitations.. 43

    4.6 Conclusion....43

    CHAPTER 5 Conclusion & Future Work 45-47

    5.1 Conclusion.....45

    5.2 Future Work.................................................................................................46

    APPENDIX I Infobright 48-62

    References & Bibliography 63-67

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    9/79

    iv

    List of Tables

    TABLES TITLE PAGE

    2.1 A typical Row-oriented Database 6

    2.2 Table representing Column storing of data 10

    3.1 Employee table with type and cardinality 283.2 Code Table Example 29

    3.3 Query execution 30

    4.1 Published Table 34

    4.2 View of published table by Global recording 35

    4.3 An instance of relation Student 37

    4.4 Representing Stage 1 of compression technique 38

    4.5 Representing Stage 1 with binary compression 38

    4.6 Representing Stage 2 compression 39

    4.7 Representing Stage 2 compression coupling 40

    4.8 Representing functional dependency based coupling 41

    4.9 Number of distinct values in each column 41

    4.10 Representing test case 1 42

    4.11 Representing test case 2 42

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    10/79

    v

    List of Figures & Graphs

    FIGURE TITLE PAGE

    Figure 2.1 OLTP Access 6

    Figure 2.2 OLAP Access 7

    Figure 2.3 Column based data storage 11

    Figure 2.4 Layout of Compressed Tuple 23

    Graph I.1 Representing Load time comparison 61

    Graph I.2 Representing Table size comparison 61

    Graph I.3 Representing query execution comparison 61

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    11/79

    vi

    Notations

    DBMS : Database Management System

    RDBMS : Relational Database Management System

    OLTP : Online Transactional Processing

    SQL : Structured Query Language

    ICE : Infobright Community Edition

    IEE : Infobright Enterprise Edition

    TB : TeraBytes

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    12/79

    vii

    Abstract

    Warehouses contain a lot of data and hence any leak or illegal publication of

    information risks the individuals privacy. This research work proposes the

    compression ad abstraction of data using existing compression algorithms. Although

    the technique is general and easier, it is my strong believe that it is particularly

    advantageous for data warehousing. Through this study, we propose two algorithms.

    The first algorithm describes the concept of compression of domains at attribute level

    and we call it as Attribute Domain Compression. This algorithm can be

    implemented on both row and columnar databases. The idea behind the algorithm is to

    reduce the size of large databases as to store them optimally. The second algorithm is

    also applicable for both concepts of databases but will optimally work for columnar

    databases. The idea behind the algorithm is to generalize the tuple domains by giving

    it a value say (n) such that all other n-1 tuples or at least maximum can be identified.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    13/79

    P a g e | 1

    Chapter 1

    Introduction

    1.1IntroductionLarge operational data and information is stored by different vendors and

    organizations in warehouses. Most of which is useful only when it is shared and

    analyzed with other related data. However this kind of data often contains some

    personal details which must be hidden from limited power users. The data can only be

    allowed to be released when individuals are unidentifiable.

    Moreover, if we talk about Business intelligence and analytical applications queries,

    they are generally based on selection of particular attributes of a database. The

    simplicity and performance characteristic of columnar approach provides a cost

    effective implementation.

    1.2ObjectiveThe main aim of the research is to propose a compression algorithm that is based on

    the concepts of Attribute domain compression. The data is recorded globally so that

    the concept of data abstraction can be preserved.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    14/79

    C h a p t e r 1 : I n t r o d u c t i o n : P a g e | 2

    We will use the concept of existing two algorithms:

    The first algorithm describes the concept of compression of domains atattribute level and we call it as Attribute Domain Compression. This

    algorithm can be implemented on both row and columnar databases. The idea

    behind the algorithm is to reduce the size of large databases as to store them

    optimally.

    The second algorithm is also applicable for both concepts of databases but willoptimally work for columnar databases. The idea behind the algorithm is to

    generalize the tuple domains by giving it a value say (n) such that all other n-1

    tuples or at least maximum can be identified.

    1.3 Motivation

    Data compression has been a very popular topic in the research literature and there is

    a large amount of work on this subject. The most obvious reason to consider

    compression in a database context is to reduce the space required in the disk.

    However, the motivation behind the research is whether the processing time of

    queries can be improved by reducing the amount of data that needs to be read from

    disk using a compression technique.

    Recently, there has been a revival of interest on employing compression techniques to

    improve performance in a database which also helps me to choose this as my topic for

    study. The data compression currently exists in the main databases engines, being

    adopted different approaches in each one of them.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    15/79

    C h a p t e r 1 : I n t r o d u c t i o n : P a g e | 3

    1.4 Research Contribution

    In order to evaluate the performance speedup obtained with the compression

    performed a subset of the queries were executed with the following configurations:

    1. No compression

    2. Proposed compression

    3. Categories compression and descriptions compression

    We then study about the two major compression algorithms present in row oriented

    database i.e. n-anonymization and domain encoding by binary compression.

    Finally the report studies two complex algorithms and embeds them to form a final

    optimal algorithm for domain compression. The report will also represent the

    examples that are performed practically on a columnar oriented platform named-

    Infobright.

    1.5 Dissertation Outline

    This research work focus on the development of compression algorithm for columnar

    database over a tool Infobright. We start in Chapter 2 by documenting the theories

    that are relevant for understanding columnar databases and how compression is

    implemented on databases by various techniques that are given to us. In chapter 3, we

    study a compression technique and implemented it by query execution over MYSQL

    database. This work concludes the Dissertation Part I. Chapter 4 discusses the

    framework to facilitate the development of algorithm for columnar database and

    introduces two concepts Global recording anonymization and binary encoded domain

    compression. We conclude this chapter by developing a compression algorithm by

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    16/79

    C h a p t e r 1 : I n t r o d u c t i o n : P a g e | 4

    combining these two concepts. After successful implementation of the compression

    algorithm, it is then tested and the output is displayed graphically. Finally, Chapter 5

    illustrates the familiarity with the tool Infobright. Some basic queries and their

    execution are learned on an existing columnar database. It is not just a database but

    contains an inbuilt platform for compression algorithms that can be implemented on a

    DB.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    17/79

    P a g e | 5

    Chapter 2

    Theories

    2.1. Introduction

    Most information systems available today are implemented by using commercially

    available database management system (DBMS) products. It is software which

    manages data stored in an information system, provides privacy and privileges to

    users, facilitates concurrent access to multiple users and provides recovery from

    system failures without the loss of system integrity. Relational database is most

    commonly used DBMS which organizes the data into different relations.

    Each relational database is a collection of inter-related data which is organized in a

    matrix with rows and columns. Each column represents the attribute of that particular

    entity which is converted into the database table, while each row of the matrix

    generally called a tuple represents the different values that an attribute can possess.

    Each row in a table represents a set of related data, and every row in the table has the

    same structure.

    For example, in a table that represents employee, each row would represent a single

    employee. Columns might represent things like employee name, employee street

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    18/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 6

    address, his SSN etc. In a table that represents the relationship of employees with

    departments, each row would relate one employee with one department.

    Table 2.1 A Typical Row oriented Database

    Column 1 Column 2 Column 3

    Row 1 Row1 & Column 1 Row1 & Column 2 Row1 & Column 3

    Row 2 Row2 & Column 1 Row2 & Column 2 Row2 & Column 3

    2.1.1 On-Line Transactional Processing

    The popularity of RDBMS is mainly due to the support of on-line transactional

    processing (OLTP). Typically the OLTP system includes Student Management

    System, Bank Database etc. The queries includes, insert the new record for a new

    subject that is assigned to a student. These applications involve either no or very less

    analysis of data and serve the use of an information system for data preservation and

    querying. An OLTP query is for a short duration and requires minimal database

    resources. [3]

    Figure 2.1 represents an OLTP process in which two queries insert and lookup are

    executed on a student table.

    http://en.wikipedia.org/wiki/Value_added_tax_identification_numberhttp://en.wikipedia.org/wiki/Value_added_tax_identification_number
  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    19/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 7

    Figure 2.1 OLTP Access

    2.1.2 Query Intensive Applications

    In the mid of 1990s a new era of data management arises which was query specific

    and involves large complex data volumes. Example of such query specific DBMS are

    OLAP and Data mining.

    OLAP

    This tool summarizes the data from large data volumes and represents the query into

    results using 2-D or 3-D graphics to visualize the answer. The OLAP query is like

    Give the % comparison between the marks of all students in B. Tech and in M.

    Tech. The answer to this query would be generally in the form of graph or chart.

    Such 3-D and 2-D visualization of data is called as Data Cubes.

    Figure 2.2 represents the access pattern of OLAP which requires a few attributes to be

    process and access to huge volume of data. It must be noted that the execution of

    number of queries per second in OLAP is very less in comparison to OLTP.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    20/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 8

    Figure 2.2 OLAP Access

    Data Mining

    Data mining is now more demanding application of databases. It is also known as

    Repeated OLAP. The objective of data mining is to locate the sub groups that

    require some mean values or statistical analysis of data to get result. The typical

    example of data mining query is Find the dangerous drivers from a car insurance

    customer database. It is left to the data mining tool to determine what the

    characteristics are of those dangerous customers group [3]. This is done typically by

    combining statically analysis and automated search techniques as similar to artificial

    intelligence.

    2.2. The rise of Columnar Database

    The roots of column-store DBMSs can be traced in the 1970s, when transposed files

    were first studied, followed by investigations on vertical partitioning as a form of a

    table attribute clustering technique. By the mid 1980s, the advantages of a fully

    decomposed storage model (DSM a predecessor to column stores) over NSM

    (traditional row-based storage) were documented.[4]

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    21/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 9

    The relational databases present today are designed predominantly to handle online

    transactional processing (OLTP) applications. A transaction (e.g. an online purchasing

    a laptop through internet dealer) typically maps to one or more rows in a relational

    database, and all traditional RDBMS designs are based on a per row paradigm. For

    transactional-based systems, this architecture is well suited to handle the input of

    incoming data.

    Data warehouses are used in almost every large organizations and research states that

    their size doubles after every third year. Moreover the hourly workload of these

    warehouses is huge and approximately 20lakhs SQL statements are encountered

    hourly. [7]

    Warehouses contain a lot of data and hence any leak or illegal publication of

    information risks the individuals privacy. However, for applications that are very

    read intensive and selective in the information being requested, the OLTP database

    design isnt a model that typically holds up well. [6] Business intelligence and

    analytical applications queries often analyze selected attributes in a database. The

    simplicity and performance characteristic of columnar approach provides a cost

    effective implementation.

    Column oriented database generally known as columnar database reinvents how

    data is stored in databases. Storing data in such a fashion increases the probability of

    storing adjacent records on disk and hence odds of compression. This architecture

    suggests a different model in which inserting and deleting transactional data are done

    by a row-based system, but selective queries that are only interested in a few columns

    of a table are handled by columnar approach.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    22/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 10

    Different methodologies such as indexing, materialistic views, horizontal partitioning

    etc. are provided by row oriented databases which are rather better ways of query

    execution, but they also have some disadvantages of their own. For example, in

    business intelligence/analytic environments, the ad-hoc nature of such scenarios

    makes it nearly impossible to predict which columns will need indexing, so tables end

    up either being over-indexed (which causes load and maintenance issues) or not

    properly indexed and so many queries end up running much slower than desired.

    2.3. Definitions

    A column-oriented DBMS is a database management system (DBMS) that stores its

    content by column rather than by row.Wiki [23]

    It must always be remembered that columnar database is only an approach of how

    data is stored in memory, it doesnt defined any architectural implementation of

    database, and rather it follows the traditional database architecture.

    Table 2.2 Table representing Column storing of data

    SNO SNAME SSN CITY

    S1 MEHUL 200 JAIPUR

    S2 VIPIN 201 HINDON

    S3 DEVENDRA 300 KEKRI

    http://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Database_management_system
  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    23/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 11

    S4 ANITA 302 BHILWARA

    The data would be stored on disk or in memory something like:

    S1S2S3S4S5MEHULVIPINDEVENDRAANITAPALWIN200201300302202JAIPU

    RHINDONKEKRIAJMERGANGANAGAR

    This is in contrast to a traditional row based approach in which the data more like this:

    S1MEHUL200JAIPURS2VIPIN201HINDONS3DEVENDRA300KEKRIS4ANITA3

    02AJMERS5PALWIN202GANGANAGAR

    The above example also explains that columnar database can be highly compressed,

    moreover it is self-indexed and hence aggregate functions such as MIN, MAX, AVG,

    and COUNT can be efficiently performed.

    Figure 2.3 Column based data storage

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    24/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 12

    As it is clearly that goal of a columnar database is to perform the write and read

    operations efficiently to and from hard disk storage in order to speed up the time it

    takes to return a query. In the above example, all the column 1 values are physically

    together followed by all the column 2 values, etc. The data is stored in record order,

    so the 100th entry for column 1 and the 100th entry for column 2 belong to the same

    input record [1]. This allows individual data elements, such as customer name for

    instance, to be accessed in columns as a group, rather than individually row-by-row.

    2.4. Row Oriented Execution

    In this section, we discuss several different techniques that can be used to implement

    a column-database design in a commercial row-oriented DBMS.

    2.4.1 Vertical Partitioning

    The most straightforward way to emulate a column-store approach in a row-store is to

    fully vertically partition each relation. This approach creates one physical table for

    each column in the logical schema, where the ith

    table has two columns, one with

    values from column i of the logical schema and one with the corresponding value in

    the position column. Queries are then rewritten to perform joins on the position

    attribute when fetching multiple columns from the same relation.

    2.4.2 Index-only plans

    The vertical partitioning approach has two problems. Firstly, it requires the position

    attribute to be stored in every column, which wastes space and disk bandwidth and

    secondly, most row-stores store a relatively large header on every tuple, which further

    wastes space. [7] Therefore to remove these problems we use another approach called

    http://searchdatamanagement.techtarget.com/sDefinition/0,290660,sid91_gci211894,00.html
  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    25/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 13

    as Index only plans. In this approach the base relations are stored using a standard,

    row-oriented design, but an additional dispersed B+Tree index is added on every

    column of every table.

    2.4.3 Materialized Views

    The third approach we consider uses materialized views. In this approach, we create

    an optimal set of materialized views for every query flight in the workload, where the

    optimal

    View for a given flight has only the columns needed to answer queries in that flight.

    We do not pre-join columns from different tables in these views.

    2.5 Column Oriented Execution

    In this section, we review three common optimizations used to improve performance

    in column-oriented database systems.

    2.5.1 Compression

    Compressing data using column-oriented compression algorithms and keeping data in

    this compressed format as it is operated upon has been shown to improve query

    performance by up to an order of magnitude. Storing data in columns allows all of the

    names to be stored together, all of the phone numbers together, etc. Certainly phone

    numbers are more similar to each other than surrounding text fields like e-mail

    addresses or names. Further, if the data is sorted by one of the columns, that column

    will be super-compressible.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    26/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 14

    2.5.2 Late Materialization

    In a column-store, information about a logical entity (e.g., a person) is stored in

    multiple locations on disk (e.g. name, e-mail address, phone number, etc. are all

    stored in separate columns), whereas in a row store such information is usually co-

    located in a single row of a table. [7]

    At some point in most query plans, data from multiple columns must be combined

    together into rows of information about an entity. Consequently, this join-like

    materialization of tuples (also called tuple construction) is an extremely common

    operation in a column store.

    2.5.3 Block Iteration

    In order to process a series of tuples, row-stores first iterate through each tuple, and

    then need to extract the needed attributes from these tuples through a tuple

    representation interface.

    In contrast to row-stores, in all column-stores, blocks of values from the same column

    are sent to an operator in a single function call. Further, no attribute extraction is

    needed, and if the column is fixed-width, these values can be iterated through directly

    as an array. Operating on data as an array not only minimizes per-tuple overhead, but

    it also exploits potential for parallelism on modern CPUs, as loop-pipelining

    techniques can be used. [2-5]

    2.5.4 Invisible joins

    Queries over data warehouses, particularly over data warehouses, often have the

    following structure:

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    27/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 15

    Restrict the set of tuples in the fact table using selection predicates on one (or

    many) dimension tables.

    Then, perform some aggregation on the restricted fact table, often grouping

    by other dimension table attributes.

    Thus, joins between the fact table and dimension tables need to be performed for each

    selection predicate and for each aggregate grouping.

    As an alternative to these query plans, we introduce a technique we call the invisible

    join that can be used in column-oriented databases for foreign-key/primary-key joins.

    It works by rewriting joins into predicates on the foreign key columns in the fact

    table. These predicates can be evaluated either by using a hash lookup (in which case

    a hash join is simulated), or by using more advanced methods which are beyond the

    scope of our study. [1]

    2.6. Query execution in Row vs. Column oriented database

    When talking about the performance of databases, query execution is the most

    important and indistinct factor which can individually determines the performance of

    the database either it is row based or column based. We understand the concept by a

    simple example:

    Suppose there are 1000 rows in a database table and the following query is executed

    over it.

    Until no more {

    Get a row out of the buffer manager

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    28/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 16

    Evaluate the row

    Pass onward if it satisfies the predicate}

    Notice that the inner loop of the executor is called 1000 times for our query above,

    once per row. Since the overhead of the inner loop largely determines performance, a

    row store executor will take CPU time proportional to the number of runs required to

    evaluate the query.

    In contrast, in a column store executor the inner loop is:

    Until no more {

    Pick up a column

    Evaluate the column

    Pass on a row range

    }

    Notice that the inner loop is called once per column, not once per row. Also, notice

    that the algorithm complexity of processing a row is about the same as processing a

    column. [17]

    Hence, the column store will consume vastly less CPU resources, because its inner

    loop is executed once per column, and there are a lot less columns than rows in

    evaluating a typical query.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    29/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 17

    2.7. Compression

    Data compression in databases is always been a very popular and interesting topic for

    the database researchers and there is a lot of work on this context. The most obvious

    reason for compression in any context is to reduce the space required in the disk and

    so is in databases. However, another important issue is to improve the processing time

    of queries by reducing the amount of data that needs to be read from disk using a

    compression technique.

    After a long time since the evolution of databases, there is a revival in the field of

    compression to improve the quality and performance of databases. The data

    compression currently exists in the main databases engines, being adopted different

    approaches in each one of them. It is generally accepted that due to the greater

    similarity and redundancy of data within columns, column stores provide superior

    compression, and therefore require less storage hardware and perform faster because,

    among other things, they read less data from the disk [17]. Moreover, the compression

    ratio is higher in columnar database because the entries in the columns are similar to

    each other.

    Both Huffman encoding and Arithmetic encoding are based on the statistical

    distribution of the frequencies of symbols appearing in the data. Huffman coding

    assigns a shorter compression code to a frequent symbol and a longer compression

    code to an infrequent symbol. For example, if there are four symbols a, b, c, and d,

    each with probability1 3/16, 1/16, 1/16, and 1/16, then 2 bits are needed to represent

    each symbol without compression.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    30/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 18

    A possible Huffman coding is the following:

    a = 0, b = 10, c = 110, d = 111.

    As a result, the average length of a compressed symbol equals:

    1 13/16 + 2 1/16 + 3 1/16 + 3 1/16 = 1.3 bits.

    Arithmetic encoding is similar to Huffman encoding except that it assigns an interval

    to the whole input string based on the statistical distribution. [7]

    2.8.Conventional Compression

    Database Compression techniques are applied to gain the performance by decreasing

    the size and increasing Input and output functional/query performance of a database.

    The basic concept behind compression is that it delimits the storage and keeps the

    data adjoining to each other and therefore it reduces the size and number of transfers.

    This section demonstrates the two different classes of compression in databases.

    a. Domain Compression

    b. Attribute Compression

    The classes are equally implementable in column or row based database approach.

    Queries that are executed on compressed data are seen more efficient than the queries

    that are executed over a decompressed database [8]. In the section below, we will

    discuss each of the above section in detail.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    31/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 19

    2.8.1 Domain Compression

    In this type of compression technique, we will discuss three compression techniques:

    numeric compression in the presence of NULL values, string compression, and

    dictionary-based compression. Since all three compression techniques are applicable

    in domain compression, we obviously will be sticking with the compression of

    domain of the attributes.

    Numeric Compression in the presence of NULL values

    This compression technique is used to compress those attributes which are of numeric

    type such as integer and contains some NULL values in their domain. The basic idea

    is that consecutive zeros or blanks of a tuple in the table are removed and a

    description of how many there were and where they existed is given at the end [13].

    To eliminate the difference in size of the attribute because of null values, it is

    sometimes recommended to encode the data bit wise i.e. integer of 4 bytes is replaced

    by 4 bits.

    For example:

    Bit value for 1= 0001

    Bit value for 2= 0011

    Bit value for 3= 0111

    Bit value for 4= 1111

    And all 0s for the value 0

    String Compression

    String in database is represented by char data type and its compression is already

    proposed and implemented in SQL by providing varchar data type. An extension of

    conventional string compression is provided in this technique. The suggestion is that

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    32/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 20

    after converting the char type to varchar, it is further compressed in the second stage

    by any of the given compression algorithm such as Huffman coding, LZW algorithm

    etc. [24]

    Dictionary Encoding

    This type of encoding technique uses a special type of data structure called

    Dictionary. It is very much effective in the circumstances when the database takes

    limited values that repeat a lot more time [14]. Dictionary encoding algorithm first

    calculates the number of bits, X, needed to encode a single attribute of the column

    (which can be calculated directly from the number of unique values of the attribute).

    It then calculates how many of these X-bit encoded values cannot in 1, 2, 3, or 4bytes.

    For example, if an attribute has 32 values, it can be encoded in 5 bits, so 1 of these

    values cannot in 1 byte, 3 in 2 bytes, 4 in 3 bytes, or 6 in 4 bytes.

    2.8.2 Attribute Compression

    As we know that all the compression techniques are designed especially for data

    warehouses where a huge amount of data is stored which are usually composed by a

    large number of textual attributes with low cardinality. But in such section we will

    demonstrate those techniques which can also be used in conventionally old databases

    such as MYSQL, SQL SERVER etc.[5]

    The main objective of this technique is to allow the encryption for reduction of the

    space occupied by dimension tables with number of rows, reducing the total space

    occupied and leading to a consequent gains on performance.

    In this type of compression technique, we will discuss two compression techniques:

    compression of categories and compression of comments.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    33/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 21

    Compression of Categories

    Categories are textual attributes with low cardinality. Examples of category attributes

    are: city, country, type of product, etc.

    Categories coding is done through the following steps:

    1. The data in the attribute is analysed and a frequency histogram is build.

    2. The table of codes is build based on the frequency histogram: the most frequent

    values are encoded with a one byte code; the least frequent values are coded using a

    two bytes code. In principle, two bytes are enough, but a third byte could be used if

    needed.

    3. The codes table and necessary metadata is written to the database.

    4. The attribute is updated, replacing the original values by the corresponding codes

    (the compressed values).

    2.9 Layout of Compressed Tuples

    Figure 2.4 shows the overall layout of a compressed tuple [7]. The figure shows that a

    tuple can be composed of up to five parts:

    1. The first part of a tuple keeps the (compressed) values of all fields that are

    compressed using dictionary-based compression or any other fixed length

    compression technique. [5-7]

    2. The second part keeps the encoded length information of all fields compressed

    using a variable-length compression technique such as the numerical

    compression techniques described above.

    3. The third part contains the values of (uncompressed) fields of fixed length;

    e.g., integers, doubles, CHARs, but not VARCHARs or CHARs that were

    turned into VARCHARs as a result of compression.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    34/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 22

    4. The fourth part contains the compressed values of fields that were compressed

    using a variable-length compression technique; for example, compressed

    integers, doubles, or dates. The fourth part would also contain the compressed

    value of the size of a VARCHAR field if this value was chosen to be

    compressed. (If the size information of a VARCHAR field is not compressed,

    then it is stored in the third part of a tuple as a fixed-length, uncompressed

    integer value.)

    5. The fifth part of a tuple, finally, contains the string values (compressed or not

    compressed) of VARCHAR fields.

    While all this sounds quite complicated, the separation in five different parts is very

    natural. First of all, it makes sense to separate fixed-sized and variable-sized parts of

    tuples, and this separation is standard in most database systems today. The first three

    parts of a tuple are fixed-sized which means that they have the same size for every

    tuple of a table. As a result, compression information and/or the value of a field can

    directly be retrieved from these parts without further address calculations [24]. In

    particular, uncompressed integer, double, date . . . fields can directly be accessed

    regardless of whether other fields are compressed or not [5]. Furthermore, it makes

    sense to pack all the length codes of compressed fields together because we will

    exploit this bundling in our fast decoding algorithm, as we will see soon.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    35/79

    C h a p t e r 2 : T h e o r i e s : P a g e | 23

    Figure 2.4 Layout of Compressed Tuple

    Finally, we separate small variable-length (compressed) fields from potentially large

    variable-length string fields because the length information of small fields can be

    encoded into less than a byte whereas the length information of large fields is encoded

    in a two step process. Obviously, not every tuple of the database consists of these five

    parts [5]. For example, tuples that have no compressed fields consist only of the third

    and, maybe, the fifth part. Furthermore keep in mind that all tuples of the same table

    have the same layout and consist of the same number of parts because all the tuples of

    a table are compressed using the same techniques.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    36/79

    P a g e | 24

    Chapter 3

    Methodology

    3.1 Introduction

    As we discuss the compression techniques in chapter 2, by apply these techniques,

    queries are executed on a platform in which query rewriting and data decompression

    is done when necessary. In fact, the query execution is on a very small basis, it rather

    produces a very better result when compared with the uncompressed queries on the

    same platform. This chapter demonstrates the different compression methods that are

    applied on the tables and then compare the results graphically as well as in tabular

    forms.

    It must be noted that the queries with WHERE clause must only be rewritten because

    selection and projection operations dont requires searching of particular tuple of a

    particular attribute.

    Despite the fact that the development of data storage has increases, a similar increase

    of disk access development has not happened. On the other hand, speed of RAM

    memories and CPUs has improved. This technological trend led to the use of data

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    37/79

    C h a p t e r 3 : M e t h o d o l o g y : P a g e | 25

    compression, trading some execution overhead (to compress and decompress data) for

    the reduction of space occupied by data.

    The compression techniques works both statically and dynamically i.e. data is

    compressed when it is read from the disk or compressed when it is executed in the

    form of queries. In databases, and particularly in warehouses, the reduction in the

    size of the data obtained by compression normally gains speed, as the extra cost in

    execution time (to compress and decompress the data) is compensated by the

    reduction in size of the data that have to be read/stored in the disks. [1]

    3.2 Reasons for Data Compression

    Data compression in data warehouses is particularly interesting for two main reasons:

    1) The quantity of data in a warehouse is huge and hence compression is suitable and

    preferred over normal databases.

    2) The data warehouses are used for querying only (i.e., only read accesses, as the

    data warehouse updates are done offline),

    This means that compression overhead is not relevant. Furthermore, if data is

    compressed using techniques that allow searching over the compressed data, then the

    gains in performance could be quite significant, as the decompression operation are

    only done when is strictly necessary.

    In spite of the potential advantages of compression in databases, most of the

    commercial relational database management systems (DBMS) either do not have

    compression or just provide data compression at the physical layer (i.e., database

    blocks), which is not flexible enough to become a real advantage. Flexibility in

    database compression is essential, as the data that could be advantageously

    compressed is frequently mixed in the same table with data whose compression is not

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    38/79

    C h a p t e r 3 : M e t h o d o l o g y : P a g e | 26

    particularly helpful. Nonetheless, recent work on attribute-level compression methods

    has shown that compression can improve the performance of database systems in

    read-intensive environments such as data warehouses. [18]

    Data compression and data coding techniques transform a given set of data into a new

    set of data containing the same information, but occupying less space than the original

    data (ideally, the minimum space possible). Data compression is heavily used in data

    transmission and data storage. In fact, reducing the amount of data to be transmitted

    (or stored) is equivalent to the increase of the bandwidth of the transmission channel

    (or the size of the storage device).

    The first data compression proposals appeared in the 40s, namely proposed by D.

    Huffman, but these earlier proposals have evolved dramatically since then [7]. The

    main emphasis of previous work has been on the compression of numerical attributes,

    where coding techniques have been employed to reduce the length of integers,

    floating point numbers, and dates. However, string attributes (i.e., attributes of type

    CHAR (n) or VARCHAR (n) in SQL) often comprise a large portion of database

    records and thus have significant impact on query performance.

    The compression of data in databases offers two main advantages:

    1. less space occupied by data and2. Potentially better query response time.

    If the benefit in terms storage is easily understandable, the gain in performance is not

    so obvious. This gain is due to the fact that less data had to be read of the storage,

    which is clearly the most time-consuming operation during the query processing. The

    most interesting use of data compression and codification techniques in Databases are

    surely in data warehouses, given the huge amount of data normally involved and its

    clear orientation for the query processing. As in the data warehouses all the insertions

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    39/79

    C h a p t e r 3 : M e t h o d o l o g y : P a g e | 27

    and updates are done during the update window , when the data warehouse is not

    available for users, off-line compression algorithms are more adequate, as the gain in

    query response time usually compensates the extra costs to codify the data before

    being loaded into the data warehouse. In fact, off-line compression algorithms

    optimize the decompression time, which normally implies more costs in the

    compression process. The technique presented in this report follow these ideas, as it

    takes advantage of the specific features of data warehouses to optimize the use of

    traditional text compression techniques.

    In addition to the observations regarding when to use each of the various compression

    schemes, our results also illustrate the following important points:

    Physical database design should be aware of the compression subsystem.Performance is improved by compression schemes that take advantage of data

    locality. Queries on columns in projections with secondary and tertiary sort

    orders perform well, and it is generally beneficial to have low cardinality

    columns serve as the leftmost sort orders in the projection (to increase the

    average run-lengths of columns to the right). The more order and locality in a

    column, the better the database is. It is a good idea to operate directly on

    compressed data.

    The optimizer needs to be aware of the performance implications of operatingdirectly on compressed data in its cost models. Further, cost models that only

    take into account I/O costs will likely perform poorly in the context of

    column-oriented systems since CPU cost is often the dominant factor.

    3.3 Compression Scheme

    Compression is done through the following steps:

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    40/79

    C h a p t e r 3 : M e t h o d o l o g y : P a g e | 28

    1. Attributes are analyzed and a frequency histogram is build.

    2. The table of codes is build based on the frequency histogram: the most frequent

    values are encoded with a one byte code; the least frequent values are coded using a

    two bytes code. In principle, two bytes are enough, but a third byte could be used if

    needed.[5]

    3. The codes table and necessary metadata is written to the database.

    4. The attribute is updated, replacing the original values by the corresponding codes

    (the compressed values).

    The below example of an employee table represents the compression technique:

    Table 3.1 Employee table with type and cardinality

    Attribute name Attribute Type Cardinility

    SSN TEXT 1000000

    EMP_NAME VARCHAR(20) 500

    EMP_ADD TEXT 200

    EMP_SEX CHAR 2

    EMP_SAL INTEGER 5000

    EMP_DOB DATE 50

    EMP_CITY TEXT 95000

    EMP_REMARKS TEXT 600

    Table 3.1 presents an example of typical attributes of a client dimension in a data

    warehouse, which may be a large dimension in many businesses (e.g., e-business).

    For example, we can find several attributes that are candidates to coding, such as:

    EMP_NAME, EMP_ADD, EMP_SEX, EMP_SAL, EMP_DOB, EMP_CITY, and

    EMP_REMARKS.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    41/79

    C h a p t e r 3 : M e t h o d o l o g y : P a g e | 29

    Table 3.2 Code Table Example

    City name City Postal Code Code

    DELHI 011 00000010

    MUMBAI 022 00000100

    KOLKATA 033 00000110

    CHENNAI 044 00001000

    BANGALORE 080 00001000 00001000

    JAIPUR 0141 00000110 00000110

    COIMBATORE 0422 00001000 00001000

    00001000

    COCHIN 0484 00010000 00010000

    00010000

    Assuming that we want to code the EMP_CITY attribute, an example of possible

    resulting codes table is shown in Table 3.2. The codes are represented in binary to

    better understand the idea. As the attribute has more than 256 distinct values, we will

    have codes of one byte to represent the 256 most frequent values (e.g. Delhi and

    Mumbai) and codes of two bytes to represent the least frequent values (e.g. Jaipur and

    Bangalore). The values shown in Table 2 (represented in binary) would be the ones

    stored in the database, instead of the larger values. For example, instead of storing

    Jaipur, which corresponds to 6 ASCII chars, we just stores one byte with the binary

    cone 00000110 00000110.

    3.4 Query Execution

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    42/79

    C h a p t e r 3 : M e t h o d o l o g y : P a g e | 30

    Query rewriting is necessary in queries where the coded attributes are used in the

    WHERE clause for filtering. In these queries the values used for filter the result must

    be replaced by the correspondent coded values. Following are some simple examples

    of the type of query rewriting needed.

    The value JAIPUR is replaced by the corresponded code, fetched from t he codes

    table, shown in Table 3.2.

    Table 3.3 Query execution

    Original Query Modified Query

    Select EMP_NAME

    From EMPLOYEE

    Where EMP_CITY = JAIPUR

    Select EMP_NAME

    From EMPLOYEE

    Where EMP_CITY = 00000110 00000110

    3.5 Decompression

    The decompression of the attributes is only made when the coded attributes are in the

    query select list. In these cases the query is executed and after that the result set is

    processed in order to decompress the attributes that contain compressed values. As the

    typical data warehousing queries return small result sets the decompression time will

    represent a very small amount of the total query execution time.

    3.6 Prerequisites

    The goal of the experiments performed is to measure experimentally the gains in

    storage and performance obtained using the proposed technique.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    43/79

    C h a p t e r 3 : M e t h o d o l o g y : P a g e | 31

    The experiments were divided in two phases. In the first phase only categories

    compression was used. In the second phase we used categories compression in

    conjunction with descriptions compression.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    44/79

    P a g e | 32

    Chapter 4

    Results & Discussions

    4.1 Introduction

    Many theories regarding improvements in CPU speed have focused in last decades

    which overtaken improvements in disk access rates by orders of magnitude and thus

    inspiring the us for generating new data compression techniques in database systems

    to trade reduced disk I/O against additional CPU overhead for compression and

    decompression of data.

    After the development of compression technique in chapter 3 I propose a compression

    algorithm which integrates domain & attribute compression based on dictionary based

    anonymization and implementing global recording generalization.

    In this chapter, I demonstrate how to compress data that achieve better performance

    than conventional database systems. We address the following two issues.

    First, we implement a new proposed N-Anonymization technique embedded with

    global recording generalization. After evaluating, the report presents the algorithm for

    data compression and finally demonstrates that our approach gives a comparable

    result over the existing algorithms.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    45/79

    C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 33

    Second, we use a Binary Encoded pairing of attributes for data compression that we

    discuss in the previous chapter for string compression in the database and modify it so

    that it intelligently selects the most effective compression method for string-valued

    attributes.

    Moreover we also use the concept of data hiding and equivalent sets before

    compressing the data so that the private information of the users is not revealed

    publically.

    4.2 Anonymization

    Warehouses contain a lot of data and hence any leak or illegal publication of

    information risks the individuals privacy. N-Anonymity is a major technique to

    deidentify a data set. The idea behind the technique is to determine the value of a

    tuple, say n, such that other remaining n-1 tuples or at least maximum tuples can be

    identified by the value of n.

    The intensity of protection increases with increase the number of n. One way to

    produce n identical tuples within the identifiable attributes is to generalize values

    within the attributes, for example, removing city and street information in a address

    attribute. [6]

    There are many ways through which data unidentification can be done and one of the

    most appropriate approaches is generalization. Various generalization techniques

    include global recoding generalization multidimensional recoding generalization, and

    local recoding generalization [15].

    Global recoding generalization maps the current domain of an attribute to a more

    general domain. For example, ages are mapped from years to 10-year intervals.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    46/79

    C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 34

    Multidimensional recoding generalization maps a set of values to another set of

    values, some or all of which are more general than the corresponding premapping

    values. For example, {male, 32, divorce} is mapped to {male, [30, 40), unknown}.

    Local recoding generalization modifies some values in one or more attributes to

    values in more general domains [6].

    4.2.1 Problem definition and Contribution

    From the very beginning we have cleared that our objective is to make every tuple of

    a published table identical to at least n-1 other tuples. Identity-related attributes are

    those which potentially identify individuals in a table. For example, the record of an

    old-aged male in the rural area with the postcode of 302033 is unique in Table 4.1,

    and hence, his problem of asthma may be revealed if the table is published. To

    preserve his privacy, we may generalize Gender and Postcode attribute values such

    that each tuple in attribute set {Gender, Age, Postcode} has at least two occurrences.

    Table 4.1 Published Table

    No. Gender Age Postcode Problem

    01 Male Young 302020 Heart

    02 Male Old 302033 Asthma

    03 Female Young 302015 Obesity

    04 Female Young 302015 Obesity

    A view after this generalization is given in Table 4.2. Since various countries use

    different postcode schemes, we adopt a simplified postcode scheme, where its

    hierarchy {302033, 3020*, 30**, 3***, *} corresponds to {rural, city, region, state,

    unknown}, respectively.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    47/79

    C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 35

    Table 4.2 View of published table by Global recording

    No. Gender Age Postcode Problem

    01 * Young 3020* Heart

    02 * Old 3020* Asthma

    03 * Young 3020* Obesity

    04 * Young 3020* Obesity

    Identifier attribute setA set of attributes that potentially identifies the individuals in a table is a set of

    identifier attribute. For example, attribute set {Gender, Age, Postcode} in Table 1a is

    an identifier attribute set.

    Equivalent Set ()An equivalent set of a table with respect to an attribute set is the set of all tuples in the

    table containing identical values for the attribute set. Table 4.1 forms a equivalent set

    with respect to attributes {Gender, Age, Postcode, Problem}. Therefore table 4.2 is

    the 2-Anonymity view of the table 4.1 since two attribute are used to deidentify the

    published table.

    4.2.2 Quality measure of Anonymization

    After the study we can easily conclude that larger the size of equivalent set easier the

    compression and obviously cost of anonymization is a factor of equivalent set. On the

    basis of this theory, we can determine that:

    RECORDSCAVG

    (4.1)

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    48/79

    C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 36

    4.2.3 Conclusion

    Another name for global recoding is domain generalization as because generalization

    happens at the domain level. A specific domain is replaced by a more general domain.

    There are no mixed values from different domains in a table generalized by global

    recoding. When an attribute value is generalized, every occurrence of the value is

    replaced by the new generalized value. A global recoding method may over

    generalize a table. An example of global recoding is given in Table 4.2. Two

    attributes Gender and Postcode are generalized. All gender information has been lost.

    It is not necessary to generalize the Gender and the Postcode attribute as a whole. So,

    we say that the global recoding method over generalizes this table.

    4.3 Domain compression through binary conversion

    We integrate two key methods, namely binary encoding of distinct values and pair

    wise encoding of attributes, to build our compression technique.

    4.3.1 Encoding of Distinct values

    This compression technique is based on the assumption that the table we have

    published contains minimum distinct domain of attributes and these values repeat

    over the huge number of tuples present in the database. Therefore, binary encoding of

    the distinct values of each attribute, followed by representation of the tuple values in

    each column of the relation with the corresponding encoded values would transform

    the entire relation into bits and thus compress it [16].

    We will find out the number of distinct values in each column and encode the data

    into bits accordingly. For example consider an instant given below which represents

    the two major attributes of a relation Patients.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    49/79

    C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 37

    Table 4.3 an instance of relation Student

    Age Problem

    10 Cough & Cold

    20 Cough & Cold

    30 Obesity

    50 Diabetes

    70 Asthma

    Now if we adopt the concept of N-Anonymization with global recording (refer 4.2),

    we can map the current domain of attributes to more general domain. For example

    Age can be mapped into 10-Age interval as shown in the Table 4.4.

    To examine the compression benefits achieved by this method assume that Age is of

    integer type and has 5 distinct values as in Table 4.3. Suppose if there are 50 patients

    then the total storage required by Age attribute will be 50*size of (int) = 50*4 = 200

    bytes [9].

    With our compression technique, we find that there are 9 distinct values for age

    therefore we need the upper bound of log (9) i.e. 4 bits to represent each data value in

    the Age field. It is easy to calculate that we would need 50*4 (bits) = 200 bits = 25

    bytes which are reasonably less [9].

    We call this as our stage 1 of our compression which just transforms one column into

    bits. If we apply this compression to all columns of the table, the result will be

    significant.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    50/79

    C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 38

    Table 4.4 Representing Stage 1 of compression technique

    Age Problem

    10-20 Cough & Cold

    30-40 Obesity

    50-60 Diabetes

    70-100 Asthma

    Table 4.5 Representing Stage 1 with binary compression

    Age Problem

    00 Cough & Cold

    01 Obesity

    10 Diabetes

    11 Asthma

    4.3.2 Paired Encoding

    It can be easily seen from the above example that besides optimizing the memory

    requirement of the relations, above encoding technique is also helpful in reducing

    redundancy (repetition values) from the relation. That is, it is likely that they are few

    distinct values of even (column1, column2) taken together, in addition to just

    column1s distinct values or column2s distinct values. We then represent the two

    columns together as a single column with pair values transformed according to the

    encoding. This constitutes Stage 2 of our compression in which we use the bit-

    encoded database from Stage 1 as input and further compress it by coupling columns

    in pairs of two, applying the distinct-pairs technique outlined. To examine the further

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    51/79

    C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 39

    compression advantage achieved, suppose that we couple Age and Problem

    columns. We can see in our table 4.3 that there are 5 distinct pairs (10, Cough &

    Cold), (20, cough & cold), (30, obesity), (50, Diabetes), (70, Asthma) and hence our

    upper bound is log (5) = 2 bits approx. Table 4.6 shows the result of stage 2

    compression.

    Table 4.6 Representing Stage 2 compression

    Age Problem

    00 00

    01 01

    10 10

    11 11

    After compressing the attribute, pairing or coupling of attributes is done. All the

    columns are coupled in pair of two in a similar manner. If the database contains even

    number of columns it is straightforward. If the columns are odd, we can intelligently

    choose any of the columns to be uncompressed.

    Table 4.7 Representing Stage 2 compression coupling

    Age- Problem

    00

    01

    10

    11

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    52/79

    C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 40

    After this compression technique is applied we can easily calculate the space required

    i.e.

    Before compression: 5*(4) +4*(4) = 36 bytes

    After Compression and coupling: 4*2 = 8 bits.

    4.4 Add-ons to compression

    After performing successful compression over relation and domains, some of the

    conclusions were derived by varying the coupling of attributes with each other. Some

    of those possibilities are shown by the following points.

    4.4.1 Functional Dependencies

    Functional dependencies exists between attributes and states that:

    Given a relation R, a set of attributes Y in R is said to be functional dependent on

    another attribute X if and only if each value of X is associated with at most one value

    of Y. This implies that attributes in set X can correspondingly determine the value of

    attributes in set Y [15]. By rearranging the attributes we determine that clubbing

    columns with relationships similar to functional dependencies proves better results in

    compression.

    Table 4.8 shows an example of functional dependencies based compression.

    Table 4.8 Representing functional dependency based coupling

    Name Gender Age Problem

    Harshit M 10 Cough & Cold

    Naman M 20 Cough & Cold

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    53/79

    C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 41

    Aman M 30 Obesity

    Rajiv M 50 Diabetes

    Rajni F 70 Asthma

    Two different test cases were used to check the level of compression. Test case

    couples the attributes {(name, age), (Gender, problem)} then individual and coupled

    distinct values are checked as shown in figure 4.9. Whereas in test case 2, coupling is

    done with the given attributes {(name, gender), (Age, Problem)}.

    Table 4.9 representing the number of distinct values in each column

    Column name Distinct values

    Name 19

    Gender 2

    Age 19

    Problem 19

    Table 4.10 representing test case 1

    Column name Distinct values

    Name, Age 285

    Gender, Problem 35

    Table 4.11 representing test case 2

    Column name Distinct values

    Name, Gender 22

    Age, Problem 312

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    54/79

    C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 42

    4.4.2 Primary Key

    A primary key is an attribute which uniquely identifies a row in a table. The

    observation regarding the primary key is that coupling of the primary key column

    with a column having a large number of distinct values would be advantageous

    because each primary key value gets associated with each distinct value in the table

    and hence the resulting number of distinct tuples of the combination of the two will

    always be equal to the number of primary key values in the table.

    4.4.3 Few distinct values

    Sometimes database contains columns with a very few distinct values. For example

    Gender attribute will always contain either male or female as domain. Therefore it is

    recommended that such type of attributes must be coupled with those attributes which

    contains a large number of distinct values. For example consider 4 attributes {name,

    gender, age, problem} where name= 200, gender= 2, age=200, problem= 20

    Consider the coupling, {gender, name} and {age, problem}. The result would be

    200*2 + 200*20= 4400 distinct tuples. Whereas coupling {gender, problem} and

    {name, age}. The result would be 2*20 + 200*200= 40040 distinct tuples.

    4.5 Limitations

    Two of the most-often cited disadvantages of our approach are write operations and

    tuple construction. Write operations are generally considered problematic for two

    reasons:

    Inserted tuples have to be broken up into their component attributes and eachattribute must be written separately, and

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    55/79

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    56/79

    C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 44

    determined, i.e. we need to decide the point at which the extra compression achieved

    is not worth the performance overhead involved.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    57/79

    P a g e | 45

    Chapter 5

    Conclusion & Future Work

    5.1 Conclusion

    In this thesis we study how to use compression techniques so that the performance of

    database can be improved. Moreover after comparing we also propose an algorithm for

    compressing columnar databases. We studied the following research issues:

    Compression different domains of databases: We studied how different domains of a database

    such as varchar, int, NULL values can be dealt while compressing a database. Compared to

    existing compression methods, our approach considers the heterogeneous nature of string

    attributes, and uses a comprehensive strategy to choose the most effective encoding level

    for each string attribute. Our experimental results show that using HDE methods achieves

    better compression ratio than using a single existing method, and using HDE also

    achieves the best balance between I/O saving and decompression overhead.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    58/79

    C h a p t e r 5 : C o n c l u s i o n & F u t u r e W o r k : P a g e | 46

    Compression-aware query optimization: We observedthat when to decompress string

    attributes is a very crucial issue for query performance. A traditional optimizer enhanced

    with a cost model that takes both Input/output benefits of compression and the CPU

    overhead of decompression into account, does not necessarily achieve good plans. My

    experiments show that the combination of effective compression methods and

    compression-aware query optimization is crucial for query performance therefore use of

    our compression methods and optimization algorithms achieves up to an order

    improvement in query performance over existing techniques. The significant gain in

    performance suggests that a compressed database system should have the query optimizer

    modified for better performance.

    Compressing query results: We proposed how to use domain knowledge about the

    query to improve the effect of compression on query results. Our approach uses a

    combination of compression methods and we represented such combination using an

    algebraic framework.

    5.2 Future Work

    There are several interesting future dimensions for this research work.

    Compression-aware query optimization: First, it would be interesting to study how

    caching of intermediate (decompressed) results can reduce the overhead of transient

    decompression. Second, we plan to study how our compression techniques can handle

    updates. Third, we will study the impact of hash join on our query optimization work.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    59/79

    C h a p t e r 5 : C o n c l u s i o n & F u t u r e W o r k : P a g e | 47

    Result compression: We plan to explore the joint optimization problem of query plans

    and compression plans. Currently, the compression optimization is based on the query

    plan returned by the query optimization. However, the overall cost of a combination of a

    query plan and a compression plan is different from the cost of the query plan. For

    instance, a more expensive query plan may sort the result in an order such that the sorted-

    normalization method can be applied and the overall cost will be lower.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    60/79

    P a g e | 48

    APPENDIX I

    Infobright

    I.1 Introduction

    The demand for business analytics and intelligence has grown dramatically across all

    industries. This demand is outpacing the availability of technical expertise and

    budgets to successfully implement. Infobright helps solve these problems by

    providing a solution those implements and manages a scalable analytic database.

    Infobright offers two versions of their software: Infobright Community Edition (ICE)

    and Infobright Enterprise Edition (IEE). ICE is an open source product that can be

    freely downloaded. IEE is the commercial version of the software. It offers enhanced

    features that are often necessary for production and operational support.

    The Infobright database is designed as an analytic database. It can handle business

    driven, ad-hoc queries in a fraction of the time the same queries would take on a

    transaction database. Infobright achieves its high analytic performance by organizing

    the data in columns instead of rows.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    61/79

    A p p e n d i x I : I n f o b r i g h t : P a g e | 49

    Infobright combines a columnar database with our Knowledge Grid architecture to

    deliver a self-managing, self-tuning database optimized for analytics. Infobright

    eliminates the need to create indexes, partition data, or do any manual tuning to

    achieve fast response for queries and reports.

    The Infobright database resolves complex analytic queries without the need for

    traditional indexes, data partitioning, projections, manual tuning or specific schemas.

    Instead, the Knowledge Grid architecture automatically creates and stores the

    information needed to quickly resolve these queries. Infobright organizes the data into

    2 layers: the compressed data itself that is stored in segments called Data Packs, and

    information about the data which comprises the components of the Knowledge Grid.

    For each query, the Infobright Granular Engine uses the information in the

    Knowledge Grid to determine which Data Packs are relevant to the query before

    decompressing any data.

    Infobright technology is based on the following concepts:

    Column orientation

    Data Packs

    Knowledge Grid

    The Granular Computing Engine

    I.2 Infobright Architecture

    Column Orientation

    Infobright is, at its core, is a highly compressed column-oriented database. This

    means that instead of the data being stored row-by-row, it is stored column-by-

    column. There are many advantages to column-orientation, including the ability to do

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    62/79

    A p p e n d i x I : I n f o b r i g h t : P a g e | 50

    more efficient data compression because each column stores a single data type (as

    opposed to rows that typically contain several data types), and allowing compression

    to be optimized for each particular data type. Infobright, which organizes each column

    into Data Packs (as described below) has greater compression than other column-

    oriented databases, as it applies a compression algorithm based on the content of each

    Data pack, not just column.

    Most queries only involve a subset of the columns of the tables and so a column-

    oriented database focuses on retrieving only the data that is required.

    Data Packs and the Knowledge Grid

    Data is stored in 65K Data Packs. Data Pack Nodes contain a set of statistics about the

    data that is stored and compressed in each of the Data Packs. Knowledge Nodes

    provide a further set of metadata related to Data Packs or column relationships.

    Together, Data Pack Nodes and Knowledge Nodes form the Knowledge Grid. Unlike

    traditional database indexes, they are not manually created, and require no ongoing

    "care and feeding". Instead, they are created and managed automatically by the

    system. In essence, they create a high level view of the entire content of the database.

    This is what makes Infobright so well-suited for ad hoc analytics, unlike other

    databases that require pre-work such as indexes, projections, partitioning or aggregate

    tables in order to deliver fast query performance.

    Granular Computing Engine

    The Granular Engine processes queries uses the Knowledge Grid information to

    optimize query processing. The goal is to eliminate or significantly reduce the amount

    of data that needs to be decompressed and accessed to answer a query. IEE can often

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    63/79

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    64/79

    A p p e n d i x I : I n f o b r i g h t : P a g e | 52

    Infobright is compatible with major Business Intelligence tools such as

    Jaspersoft, Actuate/BIRT, Cognos, Business Objects, Microstrategy,

    Pentaho and others.

    High performance and scalability

    Infobright loads data extremely fast - up to 280GB/hour.

    Infobright's columnar approach results in fast response times for

    complex analytic queries.

    As you database goes, the query and load performance remains

    constant.

    Infobright scales up to 50TB of data.

    Low Cost

    The cost of Infobright is very low compared to closed source,

    proprietary solutions.

    Using Infobright eliminates the need for complex hardware

    infrastructure.

    Infobright runs on low cost, industry standard servers. A single server

    can scale to support 50TB of data.

    Infobright's industry-leading data compression (10:1 up to 40:1)

    significantly reduces the amount of storage required.

    I.4 MySQL Integration

    MySQL is the world's most popular open source database software, with over 11

    million active installations. Infobright brings scalable analytics to MySQL users

    through its integration as a MySQL storage engine. If your MySQL database is

    growing and query performance is suffering, Infobright is the ideal choice.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    65/79

    A p p e n d i x I : I n f o b r i g h t : P a g e | 53

    Many users of MySQL turn to Infobright as their data volumes and analytic needs

    grow since Infobright offers exceptional query performance for analytic applications

    against large amounts of data. Migrating from MySQLs MyISAM storage engine, or

    other MySQL storage engines, to the Infobright column-oriented analytic database

    is quite straightforward.

    Infobright contains a bundled version of MySQL and installing Infobright installs a

    new instance of MySQL along with Infobright's Optimizer, Knowledge Grid, the

    Infobright Loader and the underlying columnar storage architecture. This installation

    also includes MySQLs MyISAM storage engine. Unlike other storage engines that

    work with MySQL, it is not necessary to have an existing MySQL installation nor can

    Infobright be added to an existing MySQL Server installation. When installing

    Infobright, the assumption is that any previously existing MySQL or MyISAM

    database will exist in a separate installation of MySQL, installed in a different

    directory with a unique data path, configuration files, socket and port values.

    In the data warehouse marketplace, the database must integrate with a variety of tools.

    By integrating with MySQL, Infobright leverages the extensive tool connectivity

    provided by MySQL connectors (C, JDBC, ODBC, .NET, Perl, etc.).

    It also enables MySQL users to leverage the mature, tested BI tools with which

    they're already familiar. You'll also benefit from MySQL's legendary ease of use and

    low maintenance requirements.

    Infobright-MySQL integration includes the following features:

    Industry standard interfaces that include ODBC, JDBC, C API, PHP,

    Visual Basic, Ruby, Perl and Python;

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    66/79

    A p p e n d i x I : I n f o b r i g h t : P a g e | 54

    Comprehensive management services and utilities;

    Robust connectivity with BI tools such as Actuate/BIRT, Business

    Objects, Cognos, Microstrategy, Pentaho, Jaspersoft and SAS.

    I.5 Practical Implementation

    Infobright neither needs nor allows the manual creation of performance structures

    with duplicated data such as indexes or table partitioning based on expected usage

    patterns of the data. When preparing the MySQL schemadefinition for execution in

    Infobright, the first thing to do is simplify the schema. This means removing all

    references to indexes and otherconstraints expressed as indexes including PRIMARY

    and FOREIGN KEYs, andUNIQUE and CHECK constraints.

    In addition, due to Infobrights extremely high query performance levels on large

    volumes of data, one should consider removing all aggregate, reporting and summary

    tables that may be in the data model as they are unnecessary.

    I have done a little work with an existing airline database which has tables with many

    columns. The database contains a number of columns. Basic SQL queries are

    executed to check the performance of the database, but these are ad-hoc queries i.e.

    any column can be accessed by it.

    The Airline database is then tested with two existing database management softwares

    INFOBRIGHT and MYSQL. I created a table with large number of columns (around

    50) with different data types. Then I tried the SQL statements to fill the data in the

    columns by using LOAD DATA INFILE instead.

  • 7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

    67/79

    A p p e n d i x I : I n f o b r i g h t : P a g e | 55

    Creating table airline_info

    CREATE TABLE `airline_info` (

    `Year` year(4) DEFAULT NULL,

    `Quarter` tinyint(4) DEFAULT NULL,

    `Month` tinyint(4) DEFAULT NULL,

    `DayofMonth` tinyint(4) DEFAULT NULL,

    `DayOfWeek` tinyint(4) DEFAULT NULL,

    `FlightDate` date DEFAULT NULL,

    `UniqueCarrier` char(7) DEFAULT NULL,

    `AirlineID` int(11) DEFAULT NULL,

    `Carrier` char(2) DEFAULT NULL,

    `TailNum` varchar(50) DEFAULT NULL,

    `FlightNum` varchar(10) DEFAULT NULL,

    `Origin` char(5) DEFAULT NULL,

    `OriginCityName` varchar(100) DEFAULT NULL,

    `OriginState` char(2) DEFAULT NULL,

    `OriginStateFips` varchar(10) DEFAULT NULL,

    `OriginStateName` varchar(100) DEFAULT NULL,

    `OriginWac` int(11) DEFAULT NULL,

    `Dest` char(5) DEFAULT NULL,

    `DestCityName` varchar(100) DEFAULT NULL,

    `DestState` char(2) DEFAULT NULL,

    `DestStateFips` varchar(10) DEFAULT NULL,

    `DestStateName` varchar(100) DEFAULT NULL,

    `DestWac` int(11) DEFAULT NULL,

    `CRSDepTime` int(11) DEFAULT NULL,

    `DepTime` int(11) DEFAULT NULL,

    `DepDelay` int(11) DEFAULT NULL,

    `DepDelayMinutes` int(11) DEFAULT NULL,

    `DepDel15` int(11) DEFAULT NULL,

    `DepartureDelayGroups` int(11) DEFAULT NULL,

  • 7/30/2019 Globally Recorded Binary Encode


Recommended