Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

7/30/2019 Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

1/79

Globally Recorded binary encoded Domain Compression

algorithm in Column Oriented Databases

A

Dissertation on

Submitted

In partial fulfillment

For the award of the Degree of

Master of Technology

In Department of Information technology

(With specialization in Information Communication)

Supervisor Name Submitted By:

Mr. Santosh Kumar Singh Mehul Mahrishi

Associate Prof. Enrollment no: SGVU091543463

Suresh Gyan Vihar University


2/79

Candidates Declaration

I hereby declare that the work, which is being presented in the dissertation, entitled

Globally Recorded binary encoded Domain Compression algorithm in Column

Oriented Databases in partial fulfillment for the award of Degree of Master of

Technology in Department of Information Technology with Specialization in

Information Communication, and submitted to the Department of Information

Technology, Suresh Gyan vihar University is a record of my own investigations

carried under the Guidance of Mr. S.K. Singh, Department of Information

Technology.

I have not submitted the matter presented in this project/seminar anywhere for

the award of any other Degree.

(Name and Signature of Candidate) Counter Signed by:-

Mehul Mahrishi Mr. Santosh Kumar Singh

Information Communication Supervisor (M. Tech IC)

Enrolment No.: SGVU091543463


3/79

DETAILS OF CANDIDATE, SUPERVISOR (S) AND EXAMINER

Name of Candidate: Mehul Mahrishi. Roll No. 104511

Deptt. Of Study: M. Tech. (Information Communication).

Enrolment No. SGVU091543463

Thesis Title: Globally Recorded binary encoded Domain Compression algorithm in

Column Oriented Databases...

.. Supervisor (s) and Examiners Recommended

(with Office Address including Contact Numbers, email ID)

Supervisor Co-Superviosr

Internal Examiner

1 2 3

Signature with Date

Programme Coordinator Dean / Principal


4/79

This certifies that the thesis entitled

Globally Recorded binary encoded Domain Compression algorithm

in Column Oriented Databases

is submitted by

Mehul MahrishiSGVU091543463

IV semester , M.Tech (IC) in the year 2011 in partial fulfillment of

Degree of Master of Technology inInformation Communication

SURESH GYAN VIHAR UNIVERSITY, JAIPUR.

Signature of Supervisor

Date:

Place:


5/79

Acknowledgement

Foremost, I would like to express my sincere gratitude to my advisor and mentor Mr.

S.K. Singh for the continuous support of my study and research, for his patience,

motivation, enthusiasm, and knowledge. His guidance helped me in all the time of

research and writing of this thesis. Besides my advisor, I would like to thank the rest

of my thesis committee, especially Mr. Vibhakar Pathak for their encouragement,

insightful comments, and hard questions.

My sincere thanks also goes to Dr. S.L. Surana (Principal, SKIT), Dr. C.M.

Choudhary(HOD CS,SKIT) and Dr. Anil Chaudhary(HOD IT,SKIT) , for supporting

my advance studies and providing opportunities in their groups and leading me

working on diverse exciting projects. My special thanks to Mr. Mukesh Gupta

(Reader, SKIT) for his invaluable advise which helps me to take this decision.

I thank my fellow mates Anita Shrotriya, Devendra Kr.Sharma, Vipin Jain, Singh

Brothers, Kamal Hiran for the stimulating discussions, for the sleepless nights we

were working together before deadlines, and for all the fun we have had in the last

two years.

Last but not the least; I would like to thank my family members: my parents (Mukesh

& Madhulika Mahrishi), uncle & Aunt (Pushpanshu & Seema Mahrishi), brothers

(Mridul & Harshit) and my grandmothers for their faith and giving me the first place

by supporting me throughout my life.

(Mehul Mahrishi)


6/79

i

Contents

List of Tables iv

List of Figures v

Notations vi

Abstract vii

CHAPTER 1 Introduction 1-4

1.1 Introduction.................... 11.2 Objective... . 11.3 Motivation..21.4 Research Contribution31.5 Dissertation Outline........ 3

CHAPTER 2 Theories 5-23

2.1 Introduction... .. 5

2.1.1 On-Line Transaction Processing...6

2.1.2 Query Intensive Applications.......7

2.2 The Rise of Columnar Database. 8

2.3 Definitions. ...10

2.4 Row Oriented Execution.... 12


7/79

ii

2.4.1 Vertical Partitioning.....12

2.4.2 Index-Only Plans. .12

2.4.3 Materialized Views.......13

2.5 Column Oriented Database......13

2.5.1 Compression........13

2.5.2 Late Materialization.........14

2.5.3 Block Iteration.........14

2.5.4 Invisible Joins......14

2.6 Query execution in Row vs. Column oriented database.....................15

2.7 Compression.... ...17

2.8 Conventional Compression.18

2.8.1 Domain Compression..19

2.8.2 Attribute Compression... 20

2.9 Layout of Compressed Tuples..........21

CHAPTER 3 Methodology 24-31

3.1 Introduction.. ............................ 24

3.2 Reasons for Data Compression................... 25

3.3 Compression Scheme.. 28

3.4 Query Execution........... 30

3.5 Decompression. .......... 30

3.6 Prerequisites... ......................... 30

CHAPTER 4 Results & Discussions 32-44

4.1 Introduction... ....... 32


8/79

iii

4.2 Anonymization ..33

4.2.1 Problem Definition & Contribution........................... 34

4.2.2 Quality Measure of Anonymization...........................36

4.2.3 Conclusion..... 36

4.3 Domain compression through binary conversion................ 36

4.3.1 Encoding of Distinct Values.......36

4.3.2 Paired Encoding......................... 38

4.4 Add-ons on Compression....................... 40

4.4.1 Functional dependencies... 40

4.4.2 Primary Keys......42

4.4.3 Few Distinct values................... 42

4.5 Limitations.. 43

4.6 Conclusion....43

CHAPTER 5 Conclusion & Future Work 45-47

5.1 Conclusion.....45

5.2 Future Work.................................................................................................46

APPENDIX I Infobright 48-62

References & Bibliography 63-67


9/79

iv

List of Tables

TABLES TITLE PAGE

2.1 A typical Row-oriented Database 6

2.2 Table representing Column storing of data 10

3.1 Employee table with type and cardinality 283.2 Code Table Example 29

3.3 Query execution 30

4.1 Published Table 34

4.2 View of published table by Global recording 35

4.3 An instance of relation Student 37

4.4 Representing Stage 1 of compression technique 38

4.5 Representing Stage 1 with binary compression 38

4.6 Representing Stage 2 compression 39

4.7 Representing Stage 2 compression coupling 40

4.8 Representing functional dependency based coupling 41

4.9 Number of distinct values in each column 41

4.10 Representing test case 1 42

4.11 Representing test case 2 42


10/79

v

List of Figures & Graphs

FIGURE TITLE PAGE

Figure 2.1 OLTP Access 6

Figure 2.2 OLAP Access 7

Figure 2.3 Column based data storage 11

Figure 2.4 Layout of Compressed Tuple 23

Graph I.1 Representing Load time comparison 61

Graph I.2 Representing Table size comparison 61

Graph I.3 Representing query execution comparison 61


11/79

vi

Notations

DBMS : Database Management System

RDBMS : Relational Database Management System

OLTP : Online Transactional Processing

SQL : Structured Query Language

ICE : Infobright Community Edition

IEE : Infobright Enterprise Edition

TB : TeraBytes


12/79

vii

Abstract

Warehouses contain a lot of data and hence any leak or illegal publication of

information risks the individuals privacy. This research work proposes the

compression ad abstraction of data using existing compression algorithms. Although

the technique is general and easier, it is my strong believe that it is particularly

advantageous for data warehousing. Through this study, we propose two algorithms.

The first algorithm describes the concept of compression of domains at attribute level

and we call it as Attribute Domain Compression. This algorithm can be

implemented on both row and columnar databases. The idea behind the algorithm is to

reduce the size of large databases as to store them optimally. The second algorithm is

also applicable for both concepts of databases but will optimally work for columnar

databases. The idea behind the algorithm is to generalize the tuple domains by giving

it a value say (n) such that all other n-1 tuples or at least maximum can be identified.


13/79

P a g e | 1

Chapter 1

Introduction

1.1IntroductionLarge operational data and information is stored by different vendors and

organizations in warehouses. Most of which is useful only when it is shared and

analyzed with other related data. However this kind of data often contains some

personal details which must be hidden from limited power users. The data can only be

allowed to be released when individuals are unidentifiable.

Moreover, if we talk about Business intelligence and analytical applications queries,

they are generally based on selection of particular attributes of a database. The

simplicity and performance characteristic of columnar approach provides a cost

effective implementation.

1.2ObjectiveThe main aim of the research is to propose a compression algorithm that is based on

the concepts of Attribute domain compression. The data is recorded globally so that

the concept of data abstraction can be preserved.


14/79

C h a p t e r 1 : I n t r o d u c t i o n : P a g e | 2

We will use the concept of existing two algorithms:

The first algorithm describes the concept of compression of domains atattribute level and we call it as Attribute Domain Compression. This

algorithm can be implemented on both row and columnar databases. The idea

behind the algorithm is to reduce the size of large databases as to store them

optimally.

The second algorithm is also applicable for both concepts of databases but willoptimally work for columnar databases. The idea behind the algorithm is to

generalize the tuple domains by giving it a value say (n) such that all other n-1

tuples or at least maximum can be identified.

1.3 Motivation

Data compression has been a very popular topic in the research literature and there is

a large amount of work on this subject. The most obvious reason to consider

compression in a database context is to reduce the space required in the disk.

However, the motivation behind the research is whether the processing time of

queries can be improved by reducing the amount of data that needs to be read from

disk using a compression technique.

Recently, there has been a revival of interest on employing compression techniques to

improve performance in a database which also helps me to choose this as my topic for

study. The data compression currently exists in the main databases engines, being

adopted different approaches in each one of them.


15/79


1.4 Research Contribution

In order to evaluate the performance speedup obtained with the compression

performed a subset of the queries were executed with the following configurations:

1. No compression

2. Proposed compression

3. Categories compression and descriptions compression

We then study about the two major compression algorithms present in row oriented

database i.e. n-anonymization and domain encoding by binary compression.

Finally the report studies two complex algorithms and embeds them to form a final

optimal algorithm for domain compression. The report will also represent the

examples that are performed practically on a columnar oriented platform named-

Infobright.

1.5 Dissertation Outline

This research work focus on the development of compression algorithm for columnar

database over a tool Infobright. We start in Chapter 2 by documenting the theories

that are relevant for understanding columnar databases and how compression is

implemented on databases by various techniques that are given to us. In chapter 3, we

study a compression technique and implemented it by query execution over MYSQL

database. This work concludes the Dissertation Part I. Chapter 4 discusses the

framework to facilitate the development of algorithm for columnar database and

introduces two concepts Global recording anonymization and binary encoded domain

compression. We conclude this chapter by developing a compression algorithm by


16/79


combining these two concepts. After successful implementation of the compression

algorithm, it is then tested and the output is displayed graphically. Finally, Chapter 5

illustrates the familiarity with the tool Infobright. Some basic queries and their

execution are learned on an existing columnar database. It is not just a database but

contains an inbuilt platform for compression algorithms that can be implemented on a

DB.


17/79

P a g e | 5

Chapter 2

Theories

2.1. Introduction

Most information systems available today are implemented by using commercially

available database management system (DBMS) products. It is software which

manages data stored in an information system, provides privacy and privileges to

users, facilitates concurrent access to multiple users and provides recovery from

system failures without the loss of system integrity. Relational database is most

commonly used DBMS which organizes the data into different relations.

Each relational database is a collection of inter-related data which is organized in a

matrix with rows and columns. Each column represents the attribute of that particular

entity which is converted into the database table, while each row of the matrix

generally called a tuple represents the different values that an attribute can possess.

Each row in a table represents a set of related data, and every row in the table has the

same structure.

For example, in a table that represents employee, each row would represent a single

employee. Columns might represent things like employee name, employee street


18/79

C h a p t e r 2 : T h e o r i e s : P a g e | 6

address, his SSN etc. In a table that represents the relationship of employees with

departments, each row would relate one employee with one department.

Table 2.1 A Typical Row oriented Database

Column 1 Column 2 Column 3

Row 1 Row1 & Column 1 Row1 & Column 2 Row1 & Column 3

Row 2 Row2 & Column 1 Row2 & Column 2 Row2 & Column 3

2.1.1 On-Line Transactional Processing

The popularity of RDBMS is mainly due to the support of on-line transactional

processing (OLTP). Typically the OLTP system includes Student Management

System, Bank Database etc. The queries includes, insert the new record for a new

subject that is assigned to a student. These applications involve either no or very less

analysis of data and serve the use of an information system for data preservation and

querying. An OLTP query is for a short duration and requires minimal database

resources. [3]

Figure 2.1 represents an OLTP process in which two queries insert and lookup are

executed on a student table.
http://en.wikipedia.org/wiki/Value_added_tax_identification_numberhttp://en.wikipedia.org/wiki/Value_added_tax_identification_number


19/79


Figure 2.1 OLTP Access

2.1.2 Query Intensive Applications

In the mid of 1990s a new era of data management arises which was query specific

and involves large complex data volumes. Example of such query specific DBMS are

OLAP and Data mining.

OLAP

This tool summarizes the data from large data volumes and represents the query into

results using 2-D or 3-D graphics to visualize the answer. The OLAP query is like

Give the % comparison between the marks of all students in B. Tech and in M.

Tech. The answer to this query would be generally in the form of graph or chart.

Such 3-D and 2-D visualization of data is called as Data Cubes.

Figure 2.2 represents the access pattern of OLAP which requires a few attributes to be

process and access to huge volume of data. It must be noted that the execution of

number of queries per second in OLAP is very less in comparison to OLTP.


20/79


Figure 2.2 OLAP Access

Data Mining

Data mining is now more demanding application of databases. It is also known as

Repeated OLAP. The objective of data mining is to locate the sub groups that

require some mean values or statistical analysis of data to get result. The typical

example of data mining query is Find the dangerous drivers from a car insurance

customer database. It is left to the data mining tool to determine what the

characteristics are of those dangerous customers group [3]. This is done typically by

combining statically analysis and automated search techniques as similar to artificial

intelligence.

2.2. The rise of Columnar Database

The roots of column-store DBMSs can be traced in the 1970s, when transposed files

were first studied, followed by investigations on vertical partitioning as a form of a

table attribute clustering technique. By the mid 1980s, the advantages of a fully

decomposed storage model (DSM a predecessor to column stores) over NSM

(traditional row-based storage) were documented.[4]


21/79


The relational databases present today are designed predominantly to handle online

transactional processing (OLTP) applications. A transaction (e.g. an online purchasing

a laptop through internet dealer) typically maps to one or more rows in a relational

database, and all traditional RDBMS designs are based on a per row paradigm. For

transactional-based systems, this architecture is well suited to handle the input of

incoming data.

Data warehouses are used in almost every large organizations and research states that

their size doubles after every third year. Moreover the hourly workload of these

warehouses is huge and approximately 20lakhs SQL statements are encountered

hourly. [7]


information risks the individuals privacy. However, for applications that are very

read intensive and selective in the information being requested, the OLTP database

design isnt a model that typically holds up well. [6] Business intelligence and

analytical applications queries often analyze selected attributes in a database. The

simplicity and performance characteristic of columnar approach provides a cost

effective implementation.

Column oriented database generally known as columnar database reinvents how

data is stored in databases. Storing data in such a fashion increases the probability of

storing adjacent records on disk and hence odds of compression. This architecture

suggests a different model in which inserting and deleting transactional data are done

by a row-based system, but selective queries that are only interested in a few columns

of a table are handled by columnar approach.


22/79


Different methodologies such as indexing, materialistic views, horizontal partitioning

etc. are provided by row oriented databases which are rather better ways of query

execution, but they also have some disadvantages of their own. For example, in

business intelligence/analytic environments, the ad-hoc nature of such scenarios

makes it nearly impossible to predict which columns will need indexing, so tables end

up either being over-indexed (which causes load and maintenance issues) or not

properly indexed and so many queries end up running much slower than desired.

2.3. Definitions

A column-oriented DBMS is a database management system (DBMS) that stores its

content by column rather than by row.Wiki [23]

It must always be remembered that columnar database is only an approach of how

data is stored in memory, it doesnt defined any architectural implementation of

database, and rather it follows the traditional database architecture.

Table 2.2 Table representing Column storing of data

SNO SNAME SSN CITY

S1 MEHUL 200 JAIPUR

S2 VIPIN 201 HINDON

S3 DEVENDRA 300 KEKRI
http://en.wikipedia.org/wiki/Database_management_systemhttp://en.wikipedia.org/wiki/Database_management_system


23/79


S4 ANITA 302 BHILWARA

The data would be stored on disk or in memory something like:

S1S2S3S4S5MEHULVIPINDEVENDRAANITAPALWIN200201300302202JAIPU

RHINDONKEKRIAJMERGANGANAGAR

This is in contrast to a traditional row based approach in which the data more like this:

S1MEHUL200JAIPURS2VIPIN201HINDONS3DEVENDRA300KEKRIS4ANITA3

02AJMERS5PALWIN202GANGANAGAR

The above example also explains that columnar database can be highly compressed,

moreover it is self-indexed and hence aggregate functions such as MIN, MAX, AVG,

and COUNT can be efficiently performed.

Figure 2.3 Column based data storage


24/79


As it is clearly that goal of a columnar database is to perform the write and read

operations efficiently to and from hard disk storage in order to speed up the time it

takes to return a query. In the above example, all the column 1 values are physically

together followed by all the column 2 values, etc. The data is stored in record order,

so the 100th entry for column 1 and the 100th entry for column 2 belong to the same

input record [1]. This allows individual data elements, such as customer name for

instance, to be accessed in columns as a group, rather than individually row-by-row.

2.4. Row Oriented Execution

In this section, we discuss several different techniques that can be used to implement

a column-database design in a commercial row-oriented DBMS.

2.4.1 Vertical Partitioning

The most straightforward way to emulate a column-store approach in a row-store is to

fully vertically partition each relation. This approach creates one physical table for

each column in the logical schema, where the ith

table has two columns, one with

values from column i of the logical schema and one with the corresponding value in

the position column. Queries are then rewritten to perform joins on the position

attribute when fetching multiple columns from the same relation.

2.4.2 Index-only plans

The vertical partitioning approach has two problems. Firstly, it requires the position

attribute to be stored in every column, which wastes space and disk bandwidth and

secondly, most row-stores store a relatively large header on every tuple, which further

wastes space. [7] Therefore to remove these problems we use another approach called
http://searchdatamanagement.techtarget.com/sDefinition/0,290660,sid91_gci211894,00.html


25/79


as Index only plans. In this approach the base relations are stored using a standard,

row-oriented design, but an additional dispersed B+Tree index is added on every

column of every table.

2.4.3 Materialized Views

The third approach we consider uses materialized views. In this approach, we create

an optimal set of materialized views for every query flight in the workload, where the

optimal

View for a given flight has only the columns needed to answer queries in that flight.

We do not pre-join columns from different tables in these views.

2.5 Column Oriented Execution

In this section, we review three common optimizations used to improve performance

in column-oriented database systems.

2.5.1 Compression

Compressing data using column-oriented compression algorithms and keeping data in

this compressed format as it is operated upon has been shown to improve query

performance by up to an order of magnitude. Storing data in columns allows all of the

names to be stored together, all of the phone numbers together, etc. Certainly phone

numbers are more similar to each other than surrounding text fields like e-mail

addresses or names. Further, if the data is sorted by one of the columns, that column

will be super-compressible.


26/79


2.5.2 Late Materialization

In a column-store, information about a logical entity (e.g., a person) is stored in

multiple locations on disk (e.g. name, e-mail address, phone number, etc. are all

stored in separate columns), whereas in a row store such information is usually co-

located in a single row of a table. [7]

At some point in most query plans, data from multiple columns must be combined

together into rows of information about an entity. Consequently, this join-like

materialization of tuples (also called tuple construction) is an extremely common

operation in a column store.

2.5.3 Block Iteration

In order to process a series of tuples, row-stores first iterate through each tuple, and

then need to extract the needed attributes from these tuples through a tuple

representation interface.

In contrast to row-stores, in all column-stores, blocks of values from the same column

are sent to an operator in a single function call. Further, no attribute extraction is

needed, and if the column is fixed-width, these values can be iterated through directly

as an array. Operating on data as an array not only minimizes per-tuple overhead, but

it also exploits potential for parallelism on modern CPUs, as loop-pipelining

techniques can be used. [2-5]

2.5.4 Invisible joins

Queries over data warehouses, particularly over data warehouses, often have the

following structure:


27/79


Restrict the set of tuples in the fact table using selection predicates on one (or

many) dimension tables.

Then, perform some aggregation on the restricted fact table, often grouping

by other dimension table attributes.

Thus, joins between the fact table and dimension tables need to be performed for each

selection predicate and for each aggregate grouping.

As an alternative to these query plans, we introduce a technique we call the invisible

join that can be used in column-oriented databases for foreign-key/primary-key joins.

It works by rewriting joins into predicates on the foreign key columns in the fact

table. These predicates can be evaluated either by using a hash lookup (in which case

a hash join is simulated), or by using more advanced methods which are beyond the

scope of our study. [1]

2.6. Query execution in Row vs. Column oriented database

When talking about the performance of databases, query execution is the most

important and indistinct factor which can individually determines the performance of

the database either it is row based or column based. We understand the concept by a

simple example:

Suppose there are 1000 rows in a database table and the following query is executed

over it.

Until no more {

Get a row out of the buffer manager


28/79


Evaluate the row

Pass onward if it satisfies the predicate}

Notice that the inner loop of the executor is called 1000 times for our query above,

once per row. Since the overhead of the inner loop largely determines performance, a

row store executor will take CPU time proportional to the number of runs required to

evaluate the query.

In contrast, in a column store executor the inner loop is:

Until no more {

Pick up a column

Evaluate the column

Pass on a row range

}

Notice that the inner loop is called once per column, not once per row. Also, notice

that the algorithm complexity of processing a row is about the same as processing a

column. [17]

Hence, the column store will consume vastly less CPU resources, because its inner

loop is executed once per column, and there are a lot less columns than rows in

evaluating a typical query.


29/79


2.7. Compression

Data compression in databases is always been a very popular and interesting topic for

the database researchers and there is a lot of work on this context. The most obvious

reason for compression in any context is to reduce the space required in the disk and

so is in databases. However, another important issue is to improve the processing time

of queries by reducing the amount of data that needs to be read from disk using a

compression technique.

After a long time since the evolution of databases, there is a revival in the field of

compression to improve the quality and performance of databases. The data

compression currently exists in the main databases engines, being adopted different

approaches in each one of them. It is generally accepted that due to the greater

similarity and redundancy of data within columns, column stores provide superior

compression, and therefore require less storage hardware and perform faster because,

among other things, they read less data from the disk [17]. Moreover, the compression

ratio is higher in columnar database because the entries in the columns are similar to

each other.

Both Huffman encoding and Arithmetic encoding are based on the statistical

distribution of the frequencies of symbols appearing in the data. Huffman coding

assigns a shorter compression code to a frequent symbol and a longer compression

code to an infrequent symbol. For example, if there are four symbols a, b, c, and d,

each with probability1 3/16, 1/16, 1/16, and 1/16, then 2 bits are needed to represent

each symbol without compression.


30/79


A possible Huffman coding is the following:

a = 0, b = 10, c = 110, d = 111.

As a result, the average length of a compressed symbol equals:

1 13/16 + 2 1/16 + 3 1/16 + 3 1/16 = 1.3 bits.

Arithmetic encoding is similar to Huffman encoding except that it assigns an interval

to the whole input string based on the statistical distribution. [7]

2.8.Conventional Compression

Database Compression techniques are applied to gain the performance by decreasing

the size and increasing Input and output functional/query performance of a database.

The basic concept behind compression is that it delimits the storage and keeps the

data adjoining to each other and therefore it reduces the size and number of transfers.

This section demonstrates the two different classes of compression in databases.

a. Domain Compression

b. Attribute Compression

The classes are equally implementable in column or row based database approach.

Queries that are executed on compressed data are seen more efficient than the queries

that are executed over a decompressed database [8]. In the section below, we will

discuss each of the above section in detail.


31/79


2.8.1 Domain Compression

In this type of compression technique, we will discuss three compression techniques:

numeric compression in the presence of NULL values, string compression, and

dictionary-based compression. Since all three compression techniques are applicable

in domain compression, we obviously will be sticking with the compression of

domain of the attributes.

Numeric Compression in the presence of NULL values

This compression technique is used to compress those attributes which are of numeric

type such as integer and contains some NULL values in their domain. The basic idea

is that consecutive zeros or blanks of a tuple in the table are removed and a

description of how many there were and where they existed is given at the end [13].

To eliminate the difference in size of the attribute because of null values, it is

sometimes recommended to encode the data bit wise i.e. integer of 4 bytes is replaced

by 4 bits.

For example:

Bit value for 1= 0001




And all 0s for the value 0

String Compression

String in database is represented by char data type and its compression is already

proposed and implemented in SQL by providing varchar data type. An extension of

conventional string compression is provided in this technique. The suggestion is that


32/79


after converting the char type to varchar, it is further compressed in the second stage

by any of the given compression algorithm such as Huffman coding, LZW algorithm

etc. [24]

Dictionary Encoding

This type of encoding technique uses a special type of data structure called

Dictionary. It is very much effective in the circumstances when the database takes

limited values that repeat a lot more time [14]. Dictionary encoding algorithm first

calculates the number of bits, X, needed to encode a single attribute of the column

(which can be calculated directly from the number of unique values of the attribute).

It then calculates how many of these X-bit encoded values cannot in 1, 2, 3, or 4bytes.

For example, if an attribute has 32 values, it can be encoded in 5 bits, so 1 of these

values cannot in 1 byte, 3 in 2 bytes, 4 in 3 bytes, or 6 in 4 bytes.

2.8.2 Attribute Compression

As we know that all the compression techniques are designed especially for data

warehouses where a huge amount of data is stored which are usually composed by a

large number of textual attributes with low cardinality. But in such section we will

demonstrate those techniques which can also be used in conventionally old databases

such as MYSQL, SQL SERVER etc.[5]

The main objective of this technique is to allow the encryption for reduction of the

space occupied by dimension tables with number of rows, reducing the total space

occupied and leading to a consequent gains on performance.

In this type of compression technique, we will discuss two compression techniques:

compression of categories and compression of comments.


33/79


Compression of Categories

Categories are textual attributes with low cardinality. Examples of category attributes

are: city, country, type of product, etc.

Categories coding is done through the following steps:

1. The data in the attribute is analysed and a frequency histogram is build.

2. The table of codes is build based on the frequency histogram: the most frequent

values are encoded with a one byte code; the least frequent values are coded using a

two bytes code. In principle, two bytes are enough, but a third byte could be used if

needed.

3. The codes table and necessary metadata is written to the database.

4. The attribute is updated, replacing the original values by the corresponding codes

(the compressed values).

2.9 Layout of Compressed Tuples

Figure 2.4 shows the overall layout of a compressed tuple [7]. The figure shows that a

tuple can be composed of up to five parts:

1. The first part of a tuple keeps the (compressed) values of all fields that are

compressed using dictionary-based compression or any other fixed length

compression technique. [5-7]

2. The second part keeps the encoded length information of all fields compressed

using a variable-length compression technique such as the numerical

compression techniques described above.

3. The third part contains the values of (uncompressed) fields of fixed length;

e.g., integers, doubles, CHARs, but not VARCHARs or CHARs that were

turned into VARCHARs as a result of compression.


34/79


4. The fourth part contains the compressed values of fields that were compressed

using a variable-length compression technique; for example, compressed

integers, doubles, or dates. The fourth part would also contain the compressed

value of the size of a VARCHAR field if this value was chosen to be

compressed. (If the size information of a VARCHAR field is not compressed,

then it is stored in the third part of a tuple as a fixed-length, uncompressed

integer value.)

5. The fifth part of a tuple, finally, contains the string values (compressed or not

compressed) of VARCHAR fields.

While all this sounds quite complicated, the separation in five different parts is very

natural. First of all, it makes sense to separate fixed-sized and variable-sized parts of

tuples, and this separation is standard in most database systems today. The first three

parts of a tuple are fixed-sized which means that they have the same size for every

tuple of a table. As a result, compression information and/or the value of a field can

directly be retrieved from these parts without further address calculations [24]. In

particular, uncompressed integer, double, date . . . fields can directly be accessed

regardless of whether other fields are compressed or not [5]. Furthermore, it makes

sense to pack all the length codes of compressed fields together because we will

exploit this bundling in our fast decoding algorithm, as we will see soon.


35/79


Figure 2.4 Layout of Compressed Tuple

Finally, we separate small variable-length (compressed) fields from potentially large

variable-length string fields because the length information of small fields can be

encoded into less than a byte whereas the length information of large fields is encoded

in a two step process. Obviously, not every tuple of the database consists of these five

parts [5]. For example, tuples that have no compressed fields consist only of the third

and, maybe, the fifth part. Furthermore keep in mind that all tuples of the same table

have the same layout and consist of the same number of parts because all the tuples of

a table are compressed using the same techniques.


36/79

P a g e | 24

Chapter 3

Methodology

3.1 Introduction

As we discuss the compression techniques in chapter 2, by apply these techniques,

queries are executed on a platform in which query rewriting and data decompression

is done when necessary. In fact, the query execution is on a very small basis, it rather

produces a very better result when compared with the uncompressed queries on the

same platform. This chapter demonstrates the different compression methods that are

applied on the tables and then compare the results graphically as well as in tabular

forms.

It must be noted that the queries with WHERE clause must only be rewritten because

selection and projection operations dont requires searching of particular tuple of a

particular attribute.

Despite the fact that the development of data storage has increases, a similar increase

of disk access development has not happened. On the other hand, speed of RAM

memories and CPUs has improved. This technological trend led to the use of data


37/79

C h a p t e r 3 : M e t h o d o l o g y : P a g e | 25

compression, trading some execution overhead (to compress and decompress data) for

the reduction of space occupied by data.

The compression techniques works both statically and dynamically i.e. data is

compressed when it is read from the disk or compressed when it is executed in the

form of queries. In databases, and particularly in warehouses, the reduction in the

size of the data obtained by compression normally gains speed, as the extra cost in

execution time (to compress and decompress the data) is compensated by the

reduction in size of the data that have to be read/stored in the disks. [1]

3.2 Reasons for Data Compression

Data compression in data warehouses is particularly interesting for two main reasons:

1) The quantity of data in a warehouse is huge and hence compression is suitable and

preferred over normal databases.

2) The data warehouses are used for querying only (i.e., only read accesses, as the

data warehouse updates are done offline),

This means that compression overhead is not relevant. Furthermore, if data is

compressed using techniques that allow searching over the compressed data, then the

gains in performance could be quite significant, as the decompression operation are

only done when is strictly necessary.

In spite of the potential advantages of compression in databases, most of the

commercial relational database management systems (DBMS) either do not have

compression or just provide data compression at the physical layer (i.e., database

blocks), which is not flexible enough to become a real advantage. Flexibility in

database compression is essential, as the data that could be advantageously

compressed is frequently mixed in the same table with data whose compression is not


38/79


particularly helpful. Nonetheless, recent work on attribute-level compression methods

has shown that compression can improve the performance of database systems in

read-intensive environments such as data warehouses. [18]

Data compression and data coding techniques transform a given set of data into a new

set of data containing the same information, but occupying less space than the original

data (ideally, the minimum space possible). Data compression is heavily used in data

transmission and data storage. In fact, reducing the amount of data to be transmitted

(or stored) is equivalent to the increase of the bandwidth of the transmission channel

(or the size of the storage device).

The first data compression proposals appeared in the 40s, namely proposed by D.

Huffman, but these earlier proposals have evolved dramatically since then [7]. The

main emphasis of previous work has been on the compression of numerical attributes,

where coding techniques have been employed to reduce the length of integers,

floating point numbers, and dates. However, string attributes (i.e., attributes of type

CHAR (n) or VARCHAR (n) in SQL) often comprise a large portion of database

records and thus have significant impact on query performance.

The compression of data in databases offers two main advantages:

1. less space occupied by data and2. Potentially better query response time.

If the benefit in terms storage is easily understandable, the gain in performance is not

so obvious. This gain is due to the fact that less data had to be read of the storage,

which is clearly the most time-consuming operation during the query processing. The

most interesting use of data compression and codification techniques in Databases are

surely in data warehouses, given the huge amount of data normally involved and its

clear orientation for the query processing. As in the data warehouses all the insertions


39/79


and updates are done during the update window , when the data warehouse is not

available for users, off-line compression algorithms are more adequate, as the gain in

query response time usually compensates the extra costs to codify the data before

being loaded into the data warehouse. In fact, off-line compression algorithms

optimize the decompression time, which normally implies more costs in the

compression process. The technique presented in this report follow these ideas, as it

takes advantage of the specific features of data warehouses to optimize the use of

traditional text compression techniques.

In addition to the observations regarding when to use each of the various compression

schemes, our results also illustrate the following important points:

Physical database design should be aware of the compression subsystem.Performance is improved by compression schemes that take advantage of data

locality. Queries on columns in projections with secondary and tertiary sort

orders perform well, and it is generally beneficial to have low cardinality

columns serve as the leftmost sort orders in the projection (to increase the

average run-lengths of columns to the right). The more order and locality in a

column, the better the database is. It is a good idea to operate directly on

compressed data.

The optimizer needs to be aware of the performance implications of operatingdirectly on compressed data in its cost models. Further, cost models that only

take into account I/O costs will likely perform poorly in the context of

column-oriented systems since CPU cost is often the dominant factor.

3.3 Compression Scheme

Compression is done through the following steps:


40/79


1. Attributes are analyzed and a frequency histogram is build.

2. The table of codes is build based on the frequency histogram: the most frequent

values are encoded with a one byte code; the least frequent values are coded using a

two bytes code. In principle, two bytes are enough, but a third byte could be used if

needed.[5]

3. The codes table and necessary metadata is written to the database.

4. The attribute is updated, replacing the original values by the corresponding codes

(the compressed values).

The below example of an employee table represents the compression technique:

Table 3.1 Employee table with type and cardinality

Attribute name Attribute Type Cardinility

SSN TEXT 1000000

EMP_NAME VARCHAR(20) 500

EMP_ADD TEXT 200

EMP_SEX CHAR 2

EMP_SAL INTEGER 5000

EMP_DOB DATE 50

EMP_CITY TEXT 95000

EMP_REMARKS TEXT 600

Table 3.1 presents an example of typical attributes of a client dimension in a data

warehouse, which may be a large dimension in many businesses (e.g., e-business).

For example, we can find several attributes that are candidates to coding, such as:

EMP_NAME, EMP_ADD, EMP_SEX, EMP_SAL, EMP_DOB, EMP_CITY, and

EMP_REMARKS.


41/79


Table 3.2 Code Table Example

City name City Postal Code Code

DELHI 011 00000010

MUMBAI 022 00000100

KOLKATA 033 00000110

CHENNAI 044 00001000

BANGALORE 080 00001000 00001000

JAIPUR 0141 00000110 00000110

COIMBATORE 0422 00001000 00001000

00001000

COCHIN 0484 00010000 00010000

00010000

Assuming that we want to code the EMP_CITY attribute, an example of possible

resulting codes table is shown in Table 3.2. The codes are represented in binary to

better understand the idea. As the attribute has more than 256 distinct values, we will

have codes of one byte to represent the 256 most frequent values (e.g. Delhi and

Mumbai) and codes of two bytes to represent the least frequent values (e.g. Jaipur and

Bangalore). The values shown in Table 2 (represented in binary) would be the ones

stored in the database, instead of the larger values. For example, instead of storing

Jaipur, which corresponds to 6 ASCII chars, we just stores one byte with the binary

cone 00000110 00000110.

3.4 Query Execution


42/79


Query rewriting is necessary in queries where the coded attributes are used in the

WHERE clause for filtering. In these queries the values used for filter the result must

be replaced by the correspondent coded values. Following are some simple examples

of the type of query rewriting needed.

The value JAIPUR is replaced by the corresponded code, fetched from t he codes

table, shown in Table 3.2.

Table 3.3 Query execution

Original Query Modified Query

Select EMP_NAME

From EMPLOYEE

Where EMP_CITY = JAIPUR

Select EMP_NAME

From EMPLOYEE

Where EMP_CITY = 00000110 00000110

3.5 Decompression

The decompression of the attributes is only made when the coded attributes are in the

query select list. In these cases the query is executed and after that the result set is

processed in order to decompress the attributes that contain compressed values. As the

typical data warehousing queries return small result sets the decompression time will

represent a very small amount of the total query execution time.

3.6 Prerequisites

The goal of the experiments performed is to measure experimentally the gains in

storage and performance obtained using the proposed technique.


43/79


The experiments were divided in two phases. In the first phase only categories

compression was used. In the second phase we used categories compression in

conjunction with descriptions compression.


44/79

P a g e | 32

Chapter 4

Results & Discussions

4.1 Introduction

Many theories regarding improvements in CPU speed have focused in last decades

which overtaken improvements in disk access rates by orders of magnitude and thus

inspiring the us for generating new data compression techniques in database systems

to trade reduced disk I/O against additional CPU overhead for compression and

decompression of data.

After the development of compression technique in chapter 3 I propose a compression

algorithm which integrates domain & attribute compression based on dictionary based

anonymization and implementing global recording generalization.

In this chapter, I demonstrate how to compress data that achieve better performance

than conventional database systems. We address the following two issues.

First, we implement a new proposed N-Anonymization technique embedded with

global recording generalization. After evaluating, the report presents the algorithm for

data compression and finally demonstrates that our approach gives a comparable

result over the existing algorithms.


45/79

C h a p t e r 4 : R e s u l t s & D i s c u s s i o n s : P a g e | 33

Second, we use a Binary Encoded pairing of attributes for data compression that we

discuss in the previous chapter for string compression in the database and modify it so

that it intelligently selects the most effective compression method for string-valued

attributes.

Moreover we also use the concept of data hiding and equivalent sets before

compressing the data so that the private information of the users is not revealed

publically.

4.2 Anonymization


information risks the individuals privacy. N-Anonymity is a major technique to

deidentify a data set. The idea behind the technique is to determine the value of a

tuple, say n, such that other remaining n-1 tuples or at least maximum tuples can be

identified by the value of n.

The intensity of protection increases with increase the number of n. One way to

produce n identical tuples within the identifiable attributes is to generalize values

within the attributes, for example, removing city and street information in a address

attribute. [6]

There are many ways through which data unidentification can be done and one of the

most appropriate approaches is generalization. Various generalization techniques

include global recoding generalization multidimensional recoding generalization, and

local recoding generalization [15].

Global recoding generalization maps the current domain of an attribute to a more

general domain. For example, ages are mapped from years to 10-year intervals.


46/79


Multidimensional recoding generalization maps a set of values to another set of

values, some or all of which are more general than the corresponding premapping

values. For example, {male, 32, divorce} is mapped to {male, [30, 40), unknown}.

Local recoding generalization modifies some values in one or more attributes to

values in more general domains [6].

4.2.1 Problem definition and Contribution

From the very beginning we have cleared that our objective is to make every tuple of

a published table identical to at least n-1 other tuples. Identity-related attributes are

those which potentially identify individuals in a table. For example, the record of an

old-aged male in the rural area with the postcode of 302033 is unique in Table 4.1,

and hence, his problem of asthma may be revealed if the table is published. To

preserve his privacy, we may generalize Gender and Postcode attribute values such

that each tuple in attribute set {Gender, Age, Postcode} has at least two occurrences.

Table 4.1 Published Table

No. Gender Age Postcode Problem

01 Male Young 302020 Heart

02 Male Old 302033 Asthma

03 Female Young 302015 Obesity

04 Female Young 302015 Obesity

A view after this generalization is given in Table 4.2. Since various countries use

different postcode schemes, we adopt a simplified postcode scheme, where its

hierarchy {302033, 3020*, 30**, 3***, *} corresponds to {rural, city, region, state,

unknown}, respectively.


47/79


Table 4.2 View of published table by Global recording

No. Gender Age Postcode Problem

01 * Young 3020* Heart

02 * Old 3020* Asthma

03 * Young 3020* Obesity

04 * Young 3020* Obesity

Identifier attribute setA set of attributes that potentially identifies the individuals in a table is a set of

identifier attribute. For example, attribute set {Gender, Age, Postcode} in Table 1a is

an identifier attribute set.

Equivalent Set ()An equivalent set of a table with respect to an attribute set is the set of all tuples in the

table containing identical values for the attribute set. Table 4.1 forms a equivalent set

with respect to attributes {Gender, Age, Postcode, Problem}. Therefore table 4.2 is

the 2-Anonymity view of the table 4.1 since two attribute are used to deidentify the

published table.

4.2.2 Quality measure of Anonymization

After the study we can easily conclude that larger the size of equivalent set easier the

compression and obviously cost of anonymization is a factor of equivalent set. On the

basis of this theory, we can determine that:

RECORDSCAVG

(4.1)


48/79


4.2.3 Conclusion

Another name for global recoding is domain generalization as because generalization

happens at the domain level. A specific domain is replaced by a more general domain.

There are no mixed values from different domains in a table generalized by global

recoding. When an attribute value is generalized, every occurrence of the value is

replaced by the new generalized value. A global recoding method may over

generalize a table. An example of global recoding is given in Table 4.2. Two

attributes Gender and Postcode are generalized. All gender information has been lost.

It is not necessary to generalize the Gender and the Postcode attribute as a whole. So,

we say that the global recoding method over generalizes this table.

4.3 Domain compression through binary conversion

We integrate two key methods, namely binary encoding of distinct values and pair

wise encoding of attributes, to build our compression technique.

4.3.1 Encoding of Distinct values

This compression technique is based on the assumption that the table we have

published contains minimum distinct domain of attributes and these values repeat

over the huge number of tuples present in the database. Therefore, binary encoding of

the distinct values of each attribute, followed by representation of the tuple values in

each column of the relation with the corresponding encoded values would transform

the entire relation into bits and thus compress it [16].

We will find out the number of distinct values in each column and encode the data

into bits accordingly. For example consider an instant given below which represents

the two major attributes of a relation Patients.


49/79


Table 4.3 an instance of relation Student

Age Problem

10 Cough & Cold

20 Cough & Cold

30 Obesity

50 Diabetes

70 Asthma

Now if we adopt the concept of N-Anonymization with global recording (refer 4.2),

we can map the current domain of attributes to more general domain. For example

Age can be mapped into 10-Age interval as shown in the Table 4.4.

To examine the compression benefits achieved by this method assume that Age is of

integer type and has 5 distinct values as in Table 4.3. Suppose if there are 50 patients

then the total storage required by Age attribute will be 50*size of (int) = 50*4 = 200

bytes [9].

With our compression technique, we find that there are 9 distinct values for age

therefore we need the upper bound of log (9) i.e. 4 bits to represent each data value in

the Age field. It is easy to calculate that we would need 50*4 (bits) = 200 bits = 25

bytes which are reasonably less [9].

We call this as our stage 1 of our compression which just transforms one column into

bits. If we apply this compression to all columns of the table, the result will be

significant.


50/79


Table 4.4 Representing Stage 1 of compression technique

Age Problem

10-20 Cough & Cold

30-40 Obesity

50-60 Diabetes

70-100 Asthma

Table 4.5 Representing Stage 1 with binary compression

Age Problem

00 Cough & Cold

01 Obesity

10 Diabetes

11 Asthma

4.3.2 Paired Encoding

It can be easily seen from the above example that besides optimizing the memory

requirement of the relations, above encoding technique is also helpful in reducing

redundancy (repetition values) from the relation. That is, it is likely that they are few

distinct values of even (column1, column2) taken together, in addition to just

column1s distinct values or column2s distinct values. We then represent the two

columns together as a single column with pair values transformed according to the

encoding. This constitutes Stage 2 of our compression in which we use the bit-

encoded database from Stage 1 as input and further compress it by coupling columns

in pairs of two, applying the distinct-pairs technique outlined. To examine the further


51/79


compression advantage achieved, suppose that we couple Age and Problem

columns. We can see in our table 4.3 that there are 5 distinct pairs (10, Cough &

Cold), (20, cough & cold), (30, obesity), (50, Diabetes), (70, Asthma) and hence our

upper bound is log (5) = 2 bits approx. Table 4.6 shows the result of stage 2

compression.

Table 4.6 Representing Stage 2 compression

Age Problem

00 00

01 01

10 10

11 11

After compressing the attribute, pairing or coupling of attributes is done. All the

columns are coupled in pair of two in a similar manner. If the database contains even

number of columns it is straightforward. If the columns are odd, we can intelligently

choose any of the columns to be uncompressed.

Table 4.7 Representing Stage 2 compression coupling

Age- Problem

00

01

10

11


52/79


After this compression technique is applied we can easily calculate the space required

i.e.

Before compression: 5*(4) +4*(4) = 36 bytes

After Compression and coupling: 4*2 = 8 bits.

4.4 Add-ons to compression

After performing successful compression over relation and domains, some of the

conclusions were derived by varying the coupling of attributes with each other. Some

of those possibilities are shown by the following points.

4.4.1 Functional Dependencies

Functional dependencies exists between attributes and states that:

Given a relation R, a set of attributes Y in R is said to be functional dependent on

another attribute X if and only if each value of X is associated with at most one value

of Y. This implies that attributes in set X can correspondingly determine the value of

attributes in set Y [15]. By rearranging the attributes we determine that clubbing

columns with relationships similar to functional dependencies proves better results in

compression.

Table 4.8 shows an example of functional dependencies based compression.

Table 4.8 Representing functional dependency based coupling

Name Gender Age Problem

Harshit M 10 Cough & Cold

Naman M 20 Cough & Cold


53/79


Aman M 30 Obesity

Rajiv M 50 Diabetes

Rajni F 70 Asthma

Two different test cases were used to check the level of compression. Test case

couples the attributes {(name, age), (Gender, problem)} then individual and coupled

distinct values are checked as shown in figure 4.9. Whereas in test case 2, coupling is

done with the given attributes {(name, gender), (Age, Problem)}.

Table 4.9 representing the number of distinct values in each column

Column name Distinct values

Name 19

Gender 2

Age 19

Problem 19

Table 4.10 representing test case 1


Name, Age 285

Gender, Problem 35

Table 4.11 representing test case 2


Name, Gender 22

Age, Problem 312


54/79


4.4.2 Primary Key

A primary key is an attribute which uniquely identifies a row in a table. The

observation regarding the primary key is that coupling of the primary key column

with a column having a large number of distinct values would be advantageous

because each primary key value gets associated with each distinct value in the table

and hence the resulting number of distinct tuples of the combination of the two will

always be equal to the number of primary key values in the table.

4.4.3 Few distinct values

Sometimes database contains columns with a very few distinct values. For example

Gender attribute will always contain either male or female as domain. Therefore it is

recommended that such type of attributes must be coupled with those attributes which

contains a large number of distinct values. For example consider 4 attributes {name,

gender, age, problem} where name= 200, gender= 2, age=200, problem= 20

Consider the coupling, {gender, name} and {age, problem}. The result would be

200*2 + 200*20= 4400 distinct tuples. Whereas coupling {gender, problem} and

{name, age}. The result would be 2*20 + 200*200= 40040 distinct tuples.

4.5 Limitations

Two of the most-often cited disadvantages of our approach are write operations and

tuple construction. Write operations are generally considered problematic for two

reasons:

Inserted tuples have to be broken up into their component attributes and eachattribute must be written separately, and


55/79


56/79


determined, i.e. we need to decide the point at which the extra compression achieved

is not worth the performance overhead involved.


57/79

P a g e | 45

Chapter 5

Conclusion & Future Work

5.1 Conclusion

In this thesis we study how to use compression techniques so that the performance of

database can be improved. Moreover after comparing we also propose an algorithm for

compressing columnar databases. We studied the following research issues:

Compression different domains of databases: We studied how different domains of a database

such as varchar, int, NULL values can be dealt while compressing a database. Compared to

existing compression methods, our approach considers the heterogeneous nature of string

attributes, and uses a comprehensive strategy to choose the most effective encoding level

for each string attribute. Our experimental results show that using HDE methods achieves

better compression ratio than using a single existing method, and using HDE also

achieves the best balance between I/O saving and decompression overhead.


58/79

C h a p t e r 5 : C o n c l u s i o n & F u t u r e W o r k : P a g e | 46

Compression-aware query optimization: We observedthat when to decompress string

attributes is a very crucial issue for query performance. A traditional optimizer enhanced

with a cost model that takes both Input/output benefits of compression and the CPU

overhead of decompression into account, does not necessarily achieve good plans. My

experiments show that the combination of effective compression methods and

compression-aware query optimization is crucial for query performance therefore use of

our compression methods and optimization algorithms achieves up to an order

improvement in query performance over existing techniques. The significant gain in

performance suggests that a compressed database system should have the query optimizer

modified for better performance.

Compressing query results: We proposed how to use domain knowledge about the

query to improve the effect of compression on query results. Our approach uses a

combination of compression methods and we represented such combination using an

algebraic framework.

5.2 Future Work

There are several interesting future dimensions for this research work.

Compression-aware query optimization: First, it would be interesting to study how

caching of intermediate (decompressed) results can reduce the overhead of transient

decompression. Second, we plan to study how our compression techniques can handle

updates. Third, we will study the impact of hash join on our query optimization work.


59/79

C h a p t e r 5 : C o n c l u s i o n & F u t u r e W o r k : P a g e | 47

Result compression: We plan to explore the joint optimization problem of query plans

and compression plans. Currently, the compression optimization is based on the query

plan returned by the query optimization. However, the overall cost of a combination of a

query plan and a compression plan is different from the cost of the query plan. For

instance, a more expensive query plan may sort the result in an order such that the sorted-

normalization method can be applied and the overall cost will be lower.


60/79

P a g e | 48

APPENDIX I

Infobright

I.1 Introduction

The demand for business analytics and intelligence has grown dramatically across all

industries. This demand is outpacing the availability of technical expertise and

budgets to successfully implement. Infobright helps solve these problems by

providing a solution those implements and manages a scalable analytic database.

Infobright offers two versions of their software: Infobright Community Edition (ICE)

and Infobright Enterprise Edition (IEE). ICE is an open source product that can be

freely downloaded. IEE is the commercial version of the software. It offers enhanced

features that are often necessary for production and operational support.

The Infobright database is designed as an analytic database. It can handle business

driven, ad-hoc queries in a fraction of the time the same queries would take on a

transaction database. Infobright achieves its high analytic performance by organizing

the data in columns instead of rows.


61/79

A p p e n d i x I : I n f o b r i g h t : P a g e | 49

Infobright combines a columnar database with our Knowledge Grid architecture to

deliver a self-managing, self-tuning database optimized for analytics. Infobright

eliminates the need to create indexes, partition data, or do any manual tuning to

achieve fast response for queries and reports.

The Infobright database resolves complex analytic queries without the need for

traditional indexes, data partitioning, projections, manual tuning or specific schemas.

Instead, the Knowledge Grid architecture automatically creates and stores the

information needed to quickly resolve these queries. Infobright organizes the data into

2 layers: the compressed data itself that is stored in segments called Data Packs, and

information about the data which comprises the components of the Knowledge Grid.

For each query, the Infobright Granular Engine uses the information in the

Knowledge Grid to determine which Data Packs are relevant to the query before

decompressing any data.

Infobright technology is based on the following concepts:

Column orientation

Data Packs

Knowledge Grid

The Granular Computing Engine

I.2 Infobright Architecture

Column Orientation

Infobright is, at its core, is a highly compressed column-oriented database. This

means that instead of the data being stored row-by-row, it is stored column-by-

column. There are many advantages to column-orientation, including the ability to do


62/79


more efficient data compression because each column stores a single data type (as

opposed to rows that typically contain several data types), and allowing compression

to be optimized for each particular data type. Infobright, which organizes each column

into Data Packs (as described below) has greater compression than other column-

oriented databases, as it applies a compression algorithm based on the content of each

Data pack, not just column.

Most queries only involve a subset of the columns of the tables and so a column-

oriented database focuses on retrieving only the data that is required.

Data Packs and the Knowledge Grid

Data is stored in 65K Data Packs. Data Pack Nodes contain a set of statistics about the

data that is stored and compressed in each of the Data Packs. Knowledge Nodes

provide a further set of metadata related to Data Packs or column relationships.

Together, Data Pack Nodes and Knowledge Nodes form the Knowledge Grid. Unlike

traditional database indexes, they are not manually created, and require no ongoing

"care and feeding". Instead, they are created and managed automatically by the

system. In essence, they create a high level view of the entire content of the database.

This is what makes Infobright so well-suited for ad hoc analytics, unlike other

databases that require pre-work such as indexes, projections, partitioning or aggregate

tables in order to deliver fast query performance.

Granular Computing Engine

The Granular Engine processes queries uses the Knowledge Grid information to

optimize query processing. The goal is to eliminate or significantly reduce the amount

of data that needs to be decompressed and accessed to answer a query. IEE can often


63/79


64/79


Infobright is compatible with major Business Intelligence tools such as

Jaspersoft, Actuate/BIRT, Cognos, Business Objects, Microstrategy,

Pentaho and others.

High performance and scalability

Infobright loads data extremely fast - up to 280GB/hour.

Infobright's columnar approach results in fast response times for

complex analytic queries.

As you database goes, the query and load performance remains

constant.

Infobright scales up to 50TB of data.

Low Cost

The cost of Infobright is very low compared to closed source,

proprietary solutions.

Using Infobright eliminates the need for complex hardware

infrastructure.

Infobright runs on low cost, industry standard servers. A single server

can scale to support 50TB of data.

Infobright's industry-leading data compression (10:1 up to 40:1)

significantly reduces the amount of storage required.

I.4 MySQL Integration

MySQL is the world's most popular open source database software, with over 11

million active installations. Infobright brings scalable analytics to MySQL users

through its integration as a MySQL storage engine. If your MySQL database is

growing and query performance is suffering, Infobright is the ideal choice.


65/79


Many users of MySQL turn to Infobright as their data volumes and analytic needs

grow since Infobright offers exceptional query performance for analytic applications

against large amounts of data. Migrating from MySQLs MyISAM storage engine, or

other MySQL storage engines, to the Infobright column-oriented analytic database

is quite straightforward.

Infobright contains a bundled version of MySQL and installing Infobright installs a

new instance of MySQL along with Infobright's Optimizer, Knowledge Grid, the

Infobright Loader and the underlying columnar storage architecture. This installation

also includes MySQLs MyISAM storage engine. Unlike other storage engines that

work with MySQL, it is not necessary to have an existing MySQL installation nor can

Infobright be added to an existing MySQL Server installation. When installing

Infobright, the assumption is that any previously existing MySQL or MyISAM

database will exist in a separate installation of MySQL, installed in a different

directory with a unique data path, configuration files, socket and port values.

In the data warehouse marketplace, the database must integrate with a variety of tools.

By integrating with MySQL, Infobright leverages the extensive tool connectivity

provided by MySQL connectors (C, JDBC, ODBC, .NET, Perl, etc.).

It also enables MySQL users to leverage the mature, tested BI tools with which

they're already familiar. You'll also benefit from MySQL's legendary ease of use and

low maintenance requirements.

Infobright-MySQL integration includes the following features:

Industry standard interfaces that include ODBC, JDBC, C API, PHP,

Visual Basic, Ruby, Perl and Python;


66/79


Comprehensive management services and utilities;

Robust connectivity with BI tools such as Actuate/BIRT, Business

Objects, Cognos, Microstrategy, Pentaho, Jaspersoft and SAS.

I.5 Practical Implementation

Infobright neither needs nor allows the manual creation of performance structures

with duplicated data such as indexes or table partitioning based on expected usage

patterns of the data. When preparing the MySQL schemadefinition for execution in

Infobright, the first thing to do is simplify the schema. This means removing all

references to indexes and otherconstraints expressed as indexes including PRIMARY

and FOREIGN KEYs, andUNIQUE and CHECK constraints.

In addition, due to Infobrights extremely high query performance levels on large

volumes of data, one should consider removing all aggregate, reporting and summary

tables that may be in the data model as they are unnecessary.

I have done a little work with an existing airline database which has tables with many

columns. The database contains a number of columns. Basic SQL queries are

executed to check the performance of the database, but these are ad-hoc queries i.e.

any column can be accessed by it.

The Airline database is then tested with two existing database management softwares

INFOBRIGHT and MYSQL. I created a table with large number of columns (around

50) with different data types. Then I tried the SQL statements to fill the data in the

columns by using LOAD DATA INFILE instead.


67/79


Creating table airline_info

CREATE TABLE àirline_info` (

`Year` year(4) DEFAULT NULL,

`Quarter` tinyint(4) DEFAULT NULL,

`Month` tinyint(4) DEFAULT NULL,

`DayofMonth` tinyint(4) DEFAULT NULL,

`DayOfWeek` tinyint(4) DEFAULT NULL,

`FlightDate` date DEFAULT NULL,

ÙniqueCarrier` char(7) DEFAULT NULL,

ÀirlineID` int(11) DEFAULT NULL,

`Carrier` char(2) DEFAULT NULL,

`TailNum` varchar(50) DEFAULT NULL,

`FlightNum` varchar(10) DEFAULT NULL,

Òrigin` char(5) DEFAULT NULL,

ÒriginCityName` varchar(100) DEFAULT NULL,

ÒriginState` char(2) DEFAULT NULL,

ÒriginStateFips` varchar(10) DEFAULT NULL,

ÒriginStateName` varchar(100) DEFAULT NULL,

ÒriginWac` int(11) DEFAULT NULL,

`Dest` char(5) DEFAULT NULL,

`DestCityName` varchar(100) DEFAULT NULL,

`DestState` char(2) DEFAULT NULL,

`DestStateFips` varchar(10) DEFAULT NULL,

`DestStateName` varchar(100) DEFAULT NULL,

`DestWac` int(11) DEFAULT NULL,

`CRSDepTime` int(11) DEFAULT NULL,

`DepTime` int(11) DEFAULT NULL,

`DepDelay` int(11) DEFAULT NULL,

`DepDelayMinutes` int(11) DEFAULT NULL,

`DepDel15` int(11) DEFAULT NULL,

`DepartureDelayGroups` int(11) DEFAULT NULL,

7/30/2019 Globally Recorded Binary Encode

Date post:	04-Apr-2018
Category:	Documents
Upload:	mehul-mahrishi
View:	223 times
Download:	0 times

Globally Recorded Binary Encoded Domain Compression Algorithm in Column Oriented Databases

Documents