Exploring Correlated Subspaces for Efficient

8/6/2019 Exploring Correlated Subspaces for Efficient

1/51

Exploring Correlated Subspaces for Efficient

Query Processing in Sparse Databases


2/51

Abstract

The sparse data is becoming increasingly common and available in many

real-life applications. However, relative little attention has been paid to

effectively model the sparse data and existing approaches such as the

conventional "horizontal" and "vertical" representations fail to provide

satisfactory performance for both storage and query processing, as such

approaches are too rigid and generally do not consider the dimension

correlations. In this project, we propose a new approach, named HoVer, to

store and conduct query for sparse datasets in an unmodified RDBMS,

where HoVer stands for Horizontal representation over Vertically

partitioned subspaces. According to the dimension correlations of sparse

datasets, a novel mechanism has been developed to vertically partition a

high-dimensional sparse dataset into multiple lower dimensional subspaces,

and all the dimensions are highly correlated intra-subspace and highly

unrelated inter-subspace respectively. Therefore, original data objects can be

represented by the horizontal format in respective subspaces. With the novel

HoVer representation, users can write SQL queries over the original

horizontal view, which can be easily rewritten into queries over the subspace

tables. Experiments over synthetic and real-life datasets show that our


3/51

approach is effective in finding correlated subspaces and yields superior

performance for the storage and query of sparse data.


4/51

Introduction

With continuous advances in the network and storage technology, there is

dramatic growth in the amount of very high-dimensional sparse data from a

variety of new application domains, such as bioinformatics, time series, and

perhaps, most importantly e-commerce, which pose significant challenges to

RDBMS. The main characteristics of these sparse data sets may be

summarized as follows: High dimensionality: The dimensionality of feature

vectors may be very high, i.e., the number of possible attributes for all

objects is huge. For example, in some e-commerce applications, each

participant may declare their own idiosyncratic attributes for the products,

which results in data sets that have thousands of attributes. Sparsity: Each

object may have only a small subset of attributes, which is called active

dimensions, i.e., significant values appears only in few active dimensions; in

addition, different objects may have different active dimensions. For

example, an e-commerce data set may have thousands of attributes, but most

of which are null and only a few of which apply to a particular product.

Correlation: Since each object may have only few active dimensions, more

likely, similar objects share same or similar active dimensions. For example,

in recommendation systems, it is important to find homogeneous groups of


5/51

users with similar ratings in subsets of the attributes. Therefore, it is possible

to find certain subspaces shared by similar objects.

In existing RDBMSs, objects are conventionally stored using a

horizontal format called the horizontal representation in this project. For this

format, one column corresponds to an attribute, and if an object misses a

particular attribute, the corresponding column in the row for the object will

be null and storing a sparse data set using the horizontal format, which is

straightforward and can be easily implemented. However, the format is not

suitable for sparse database, for it may suffer from sparsity and frequent

schema evolution, hence, the space and time performance may not be

satisfactory; in addition, the number of columns in a horizontal table is

typically limited to 1,000 in general commercial DBMSs, which is not

enough for many real-life applications. If the number of columns in a

horizontal table is more than 1,000, a record may not reside in a single disk

page, and the page overflow will significantly degrade the performance. In

the last decades, commercial RDBMSs, such as DB2, SQL Server, and

Oracle, have improved the null storing and handling capabilities, which

results in smaller horizontal tables and better query performance. Our


6/51

approach proposed in this project still uniformly outperforms the horizontal

representation.

An alternative is known as the vertical format, and is called the

vertical representation in this project. This format storing a sparse data set

using the vertical format; each active dimension of an object is represented

by the object identifier, the attribute name, and the value. The vertical format

can scale to thousands of attributes, avoid storage of null values, and support

evolving schemas; however, writing queries over the format is cumbersome

and error-prone, and an expensive multi way self-join needs to be conducted

if the objects in the query result need to be returned in the conventional

horizontal format.

From the above introduction, we know that both the horizontal and the

vertical representations have advantages and disadvantages. The horizontal

representation has lots of nulls but simple queries, and the vertical

representation has no nulls but more complex queries. Therefore, an optimal

representation should benefit from the advantages and alleviate the

drawbacks. In this project, we propose a new approach which combines the

horizontal and the vertical representations, and can store and conduct query


7/51

for sparse data sets in an unmodified RDBMS. This novel representation is

named HoVer, which stands for Horizontal representation over vertically

partitioned subspaces. The HoVer representation can efficiently find a better

intermediate ground between the horizontal representation and the vertical

representation if there are dimension correlations to be exploited. In HoVer,

we first vertically partition the data set into multiple lower- dimensional

subspaces, and objects are represented in horizontal format in the subspace

tables. Partitioning the sparse data space into meaningful subspaces is a

nontrivial task; however, sparse data sets generally have some good

properties in nature, such as sparsity and correlation.

Therefore, we can design an effective mechanism to split the data

space into multiple subspaces. We define the correlated degree between

dimensions and cluster highly correlated dimensions as a subspace. After

partitioning, the original sparse data set can be transformed into the HoVer

format. The combination of different approaches demonstrates the

frameworks of the horizontal, HoVer, and vertical approaches based on an

unmodified RDBMS. Users write SQL queries over the conventional

horizontal view; for the HoVer and vertical approaches, the SQL queries can

be rewritten into queries over the subspace tables and the vertical table


8/51

stored by the unmodified RDBMS, respectively, and the query results

returned by the RDBMS are all in the horizontal format. A comprehensive

experimental study demonstrates the superiority of our approach, as our

approach fully utilizes the properties of the sparse data.


9/51

Existing system:

The existing RDBMSs, objects are conventionally stored using a

horizontal format called the horizontal representation in this paper. For this

format, one column corresponds to an attribute, and if an object misses a

particular attribute, the corresponding column in the row for the object will

be null. Fig. 1 shows an example of storing a sparse data set using the

horizontal format, which is straightforward and can be easily implemented.

The format is not suitable for sparse database, for it may suffer from sparsity

and frequent schema evolution, hence, the space and time performance may

not be satisfactory; in addition, the number of columns in a horizontal table

is typically limited to 1,000 in general commercial DBMSs, which is not

enough for many real-life applications. If the number of columns in a

horizontal table is more than 1,000, a record may not reside in a single disk

page, and the page overflow will significantly degrade the performance. In

the last decades, commercial RDBMSs, such as DB2, SQL Server, and

Oracle, have improved the null storing and handling capabilities, which

results in smaller horizontal tables and better query performance.


10/51

An alternative is known as the vertical format and is called the vertical

representation in this paper. Fig. 2 shows an example of storing a sparse data

set using the vertical format; each active dimension of an object is

represented by the object identifier, the attribute name, and the value. The

vertical format can scale to thousands of attributes, avoid storage of null

values, and support evolving schemas; however, writing queries over the

format is cumbersome and error-prone, and an expensive multi way self-join

needs to be conducted if the objects in the query result need to be returned in

the conventional horizontal format. From the above introduction, we know

that both the horizontal and the vertical representations have advantages and

disadvantages. The horizontal representation has lots of nulls but simple

queries, and the vertical representation has no nulls but more complex


11/51

queries. Therefore, an optimal representation should benefit from the

advantages and alleviate the drawbacks. In this paper, we propose a new

approach which combines the horizontal and the vertical representations,

and can store and conduct query for sparse data sets in an unmodified

RDBMS. This novel representation is named HoVer, which stands for

Horizontal representation over vertically partitioned subspaces.


12/51

Proposed system:

THE HoVer REPRESENTATION

As we introduced previously, the pure horizontal or vertical

representation may yield unsatisfactory performance in sparse databases.

Therefore, we propose a new representation called HoVer, which can

effectively exploit the characteristics of sparse data sets, such as sparsity and

dimension correlation. We aim at achieving good space and time

performance for storing and querying high-dimensional sparse data sets.

Although the dimensionality of sparse data sets could be very high, up

to thousands, a single data object typically has only a few active dimensions,

and similar objects have a better chance to share similar active dimensions.

A closer inspection of many e-commerce sparse data sets shows that typical

e-commerce data sets have a wide variety of items which can be organized

into categories and the categories themselves are hierarchically grouped;

items that belong to a common category are likely to have common

attributes, while those within the same subcategory are likely to have more

common attributes. The RDF data also shows that the attributes of similar


13/51

subjects tend to be defined together. This motivates us to find certain

subspaces which are shared by similar data groups, and to split the full space

into some lower-dimensional subspaces.

There are some previous research works which focus on subspace

clustering. In general, subspace clustering is the task of automatically

detecting all clusters in the original feature space, either by directly

computing the subspace clusters or by selecting interesting subspaces for

clustering. However, such approaches are very time-inefficient, and hence,

cannot scale well to high-dimensional space. For example, in the proposed

algorithm takes 5 hours for a 30-dimensional data set, while jumping to 30

hours for a 50-dimensional data set.


14/51

Considering sparse data sets with thousands of dimensions, such

approaches are unacceptable in real-life applications. On the other hand, our

purpose is to split the full space into subspaces which can yield superior

performance for the storage and query of sparse data. These approaches are

not suitable for this scenario. Here we introduce how to present sparse data

sets using the novel HoVer representation. First, we design an efficient and

effective approach to find correlated dimensions. After that, we partition the

original full space into subspaces and store the original sparse data set using

multiple tables where each table corresponds to a certain subspace.

Correlated Degree Determination

Before subspace selection, we first consider how to measure the

correlation between two dimensions. Suppose that the sparse data set is

dimensional, and has N tuples, we generate a table to represent the

relation of inter dimensions of the data set. We call this table the correlation

table for ease of presentation.

Definition 1 (Correlation Table). The correlation table represents the

correlation of dimensions in a sparse data set, which is a super triangle

matrix. An entry , where , counts for the times that dimensions i and j

are active simultaneously.


15/51

Given the sparse data set as shown in Fig. 1, we can generate the

corresponding correlation table as shown in Fig. 5. For example,

which means that dimension 1 is active in four tuples; , which means

that dimensions 1 and 3 are active simultaneously once. Algorithm 1

illustrates an efficient way to generate the correlation table for a sparse data

set. We first initialize the correlation table , where each entry in the super

triangle matrix is set to 0. After that, the sparse data set is scanned, and

tuples in the data set are processed one by one. For each tuple, we convert it

into an array with length in this array, the value of a certain entry is set

to 1 if the corresponding dimension is active; otherwise, and it is set to 0.

With this array, we can accumulate the correlation information into the

correlation table. For each active dimension i, we access the i th row of the

super triangle matrix, scan the array, and increase by 1 if dimensions i

and j are both active. The algorithm is very efficient since the sparse data set

only needs to be scanned once. In addition, it is also time-efficient because

no distance computation is involved.


16/51

After the correlation table is created, it can be incrementally

maintained in the presence of updates, and we only need to revise the values

of the entries in the correlation table which correspond to two columns of a

row which tend to be active simultaneously or begin to be active

simultaneously. In the presence of insertions and deletions, the table can be

maintained in a similar way.

The information in the correlation table can be utilized to evaluate the

correlation between any two dimensions. We first define the correlated

degree between two dimensions which can facilitate subspace partitioning of

high-dimensional sparse data.


17/51

Definition 2 (Correlated Degree). The correlated degree measures the

correlation between two dimensions i and j, where i


18/51

where means that dimension i is active, seems to be a good choice for

measuring the correlation (with a slight abuse of terminology,

characterizes the ratio instead of the probability that dimension i is

active) According to the probability theory, with the increase of the value of

, the correlation between the two dimensions increases at the same time,

and the two dimensions are independent in case of But, the value of

is highly influenced by active densities of the two dimensions, which

makes it not eligible to measure the correlation. A variation of ,in

which is redefined as the ratio of dimension i is active among tuples in

which at least one of the two dimensions is active, i.e.,

seems to be a good choice for measuring the correlation

Since the two dimensions within the rows try not

to be active simultaneously, we can prove that all the time, which

means that the two dimensions cannot be positive-correlated. The problem

of is that it cannot accurately measure the correlation between two

dimensions in some cases. According to our correlation measure criteria and

above analysis, we select the correlated degree defined in Definition 2


19/51

to measure the correlation between two dimensions in sparse data sets,

where 0 , and with the increase of the value

of , the correlation between the two dimensions i and j increases at the

same time.

Subspace Selection

An optimal subspace partitioning should enjoy two properties, i.e., all

dimensions are highly correlated intra subspaces while being highly

unrelated inters subspaces. If the number of subspaces determined by the

user is smaller, dimensions which are not highly correlated may be clustered

into the same subspace; hence, the subspace tables are still very sparse. On

the other hand, if the number of subspaces determined by the user is larger,

dimensions which are highly correlated may be distributed into different

subspaces; since the highly correlated dimensions are often defined and

accessed together, the join operations for accessing the dimensions, which

are distributed into different subspaces, are rather expensive.

According to our above analysis, the number of subspaces should be

determined by the subspace selection algorithm according to the dimension


20/51

correlations of the sparse data set. Because the underlying storage and query

processing details of the RDBMS may have some influence on the

performance, there may not exist perfect subspace clustering typically.

Therefore, the main aim of our subspace selection algorithm is to find the

subspaces in an efficient way, and yield superior performance for the storage

and query of sparse data. First of all, any two dimensions in a subspace

should be highly correlated, which can ensure that the subspace tables no

longer suffer from sparsity. Next, in order to ensure that the highly

correlated dimensions, which are often defined and accessed together, can be

clustered into the same subspace, the number of the subspaces should be as

small as possible.

Therefore, our subspace selection problem can be formally defined as

follows: Given the correlated degree threshold and a sparse data set, which

contains dimensions we partition the original full space into m

subspaces , where and for any two

subspaces si and sj, where our objective is that the

correlated degree between any two dimensions in a subspace is no less than

and the number of subspaces m is minimized.


21/51

Our subspace selection problem can be mapped to the Minimum

Clique Partition problem. Given a graph the Minimum Clique

Partition problem partitions V into disjoint subsets the objective

is that for , the sub graph induced by Vi is a complete graph, and the

number of partitions m is minimized. If we map each dimension in the

sparse data set to a node in the graph, and if the correlated degree between

two dimensions is no less than the correlated degree threshold, we add an

edge to link the two corresponding nodes in the graph, and our subspace

selection problem is exactly same as the Minimum Clique Partition problem.

Unfortunately, the Minimum Clique Partition problem is NP-complete,

which means that we should use a heuristic algorithm to approximate

optimal partitions which tries to group together correlated dimensions.

Algorithm 2 presents how to generate subspaces from a given

correlation table in a heuristic manner. If there exist unclassified dimensions,

i.e., not included into any existing subspace, we pick the unclassified

dimension D with the highest correlation table value, which is the most

active dimension left in the sparse data set. Then, a new subspace will

be generated, and all unclassified dimensions are examined. If the correlated

degree between an unclassified dimension d and each dimension d0 in s is


22/51

not less than the given correlated degree threshold , then d is added to

subspace s. It is apparent that the algorithm ensures that the correlated

degree between any two dimensions in a subspace is no less than the

correlated degree threshold, and the algorithm minimizes the number of

the subspaces in a greedy manner, i.e., tries to add the unclassified

dimensions to current subspace.


23/51

The correlated degree threshold has great influence on the subspace

generation. With a larger threshold, the non null density of each sub space

will be larger, i.e., the dimensions in the subspace are highly correlated, but

more subspaces will be generated. With a smaller threshold, fewer subspaces

will be generated, but the non null density of each sub space will be smaller.

Actually, the optimal correlated degree threshold varies for different data

sets.

Given the correlation table as shown in Fig. 5, we are able to partition

the original 8-dimensional space into multiple subspaces. At the beginning,

D2 is selected as the first dimension of subspace s1, for D2 has the maximal

correlation table value, i.e., 5. If , D1, D3, and D4 will be

subsequently added to subspace s1, i.e., . After that, we can


24/51

use the same strategy to generate other two subspaces, and

if , four subspaces will be generated, i.e.

and . We can see that with the

increase of the correlated degree threshold, the number of subspaces

increases at the same time.

Vertical Partition

The HoVer representation means vertical partition of the

corresponding horizontal representation. The OID attribute exists in each

subspace for linking the data items which are partitioned into multiple

subspaces. The course of transforming horizontal representations into HoVer

representations is lossless, since the candidate key, i.e., OID is contained by

each subspace table. Fig. 6 shows the HoVer representation corresponding to

the horizontal representation shown in Fig. 1 with correlated degree

threshold ; as shown in Fig. 7, if we increase to 0.5, the subspace

D1234 will be further split into two subspaces D12 with objects {1, 2, 3, 4,

6} and D34 with objects {3, 5, 6}. We can see that if none of the dimensions

in a subspace is active in a horizontally represented tuple, the tuple will be

absent in the subspace table after vertical partition. It should be easy to


25/51

convert horizontally represented data to the HoVer representation; for each

horizontally represented tuple, if at least one dimension (not including OID)

of a subspace is active, OID along with the subspace dimensions are

projected and inserted into the subspace table. For example, converting the

horizontally represented table H shown in Fig. 1 to the subspace table shown

in Fig. 6b can be characterized by a relational algebraic expression as

Schema Evolution

When a new column is added, a new subspace which only contains

the new column will be created, and the correlation table should also be

updated accordingly. Since the correlation table is incrementally maintained,

the new column may be merged to a subspace when subspaces are

reorganized. When a column is deleted, we only need to delete the column

from the corresponding subspace and update the correlation table

accordingly.

QUERY PROCESSING IN HoVer


26/51

In this section, we introduce how the queries over the horizontal

representation can be processed over the HoVer representation.

Query Rewriting

Our ultimate purpose is to define horizontally represented views over

the HoVer representation. Users typically issue traditional SQL queries over

the horizontal view, which can be rewritten into queries over the underlying

HoVer representation. Generally, the reconstruction of the horizontal table H

from the subspace tables can be characterized by a relational algebraic

expression as

Where is the OID list which contains all the OIDs in the

horizontal table. Hence, we should maintain an OID list during vertical

partition. For example, the reconstruction of the horizontal table H shown in

Fig. 1 from the subspace tables shown in Fig. 6 can be characterized by a

relational algebraic expression as


27/51

In our work, the dimensions in the original sparse data space are

clustered into subspaces, and a horizontal table is

Vertically partitioned into subspace tables. In many real-life applications, the

dimensions with a high correlated degree are likely to characterize similar

topics and have high probability of being accessed together; hence, they

should be stored in the same subspace table. We can take advantage of this

characteristic and access as few subspace tables as possible in query

evaluation.


28/51

FEASIBILITY STUDY

The feasibility of the project is analyzed in this phase and business proposal is put

forth with a very general plan for the project and some cost estimates. During system

analysis the feasibility study of the proposed system is to be carried out. This is to

ensure that the proposed system is not a burden to the company. For feasibility analysis,

some understanding of the major requirements for the system is essential. Three key

considerations involved in the feasibility analysis are 1.Economical feasibility

2.Technical feasibility 3.Operational feasibility.

ECONOMICAL FEASIBILITY

The study is carried out to check the economic impact that the system will have

on the organization. The amount of fund that the company can pour into the research and

development of the system is limited. The expenditures must be justified. Thus the

developed system as well within the budget and this was achieved because most of the

technologies used are freely available. Only customized products had to be purchased.

OPERATIONAL FEASIBILITY

The aspect of study is to check the level of acceptance of the system by the user.

This includes the process of training the user to use the system efficiently. The user must

not be threatened by the system, instead must accept it as a necessity. The level of

acceptance by the users solely depends on the methods that are employed to educate the

user about the system and to make him familiar with it. The level of confidence must be


29/51

raised so that user is also able to make some constructive criticism, which is welcomed,

as user is the final user of the system.

TECHNICAL FEASIBILITY

Technical feasibility is carried out to check the, the technical requirements of

the system. Any system developed must not have a high demand on the available

technical resources. This will lead to high demands on the available technical resources.

This will lead to high demands being placed on the client.


30/51

SYSTEM SPECIFICATION

S/W REQUIREMENTS

Windows XP

MS-SQL server

MS Visual Studio 2005

H/W REQUIREMENTS

Processor : Dual Core

CPU Clock Speed : 651 MHz

External memory : 512 MB (min)

Hard Disk Drive : 40 GB (min)

Mouse : Logitech Mouse

Keyboard : Logitech of 104 Keys

Monitor : 15.6 LCD Monitor


31/51

SOFTWARE SPECIFICATION

FRONT END

NET FRAMEWORK

.NET is a "Software Platform". It is a language-neutral environment

for developing rich .NET experiences and building applications that can

easily and securely operate within it. When developed applications are

deployed, those applications will target .NET and will execute wherever

.NET is implemented instead of targeting a particular Hardware/OS

combination. The components that make up the .NET platform are

collectively called the .NET Framework.

The .NET Framework is a managed, type-safe environment for

developing and executing applications. The .NET Framework manages all

aspects of program execution, like, allocation of memory for the storage of

data and instructions, granting and denying permissions to the application,

managing execution of the application and reallocation of memory for

resources that are not needed.

The .NET Framework is designed for cross-language compatibility.

Cross-language compatibility means, an application written in Visual


32/51

Basic .NET may reference a DLL file written in C# (C-Sharp). A Visual

Basic .NET class might be derived from a C# class or vice versa.

The .NET Framework consists of two main components:

Common Language Runtime (CLR)

Class Libraries

COMMON LANGUAGE RUNTIME (CLR)

The CLR is described as the "execution engine" of .NET. It provides

the environment within which the programs run. It's this CLR that manages

the execution of programs and provides core services, such as code

compilation, memory allocation, thread management, and garbage

collection. Through the Common Type(CTS), it enforces strict type safety,

and it ensures that the code is executed in a safe environment by enforcing

code access. The software version of .NET is actually the CLR version.

WORKING OF THE CLR


33/51

When the .NET program is compiled, the output of the is not an

executable file but a file that contains a special type of code called the

Microsoft Intermediate Language (MSIL), which is a low-level set of

instructions understood by the common language run time. This MSIL

defines a set of portable instructions that are independent of any specific

CPU. It's the job of the CLR to translate this Intermediate code into a

executable code when the program is executed making the program to run in

any environment for which the CLR is implemented. And that's how the

.NET Framework achieves Portability. This MSIL is turned into executable

code using a JIT (Just In Time) complier. The process goes like this,

when .NET programs are executed, the CLR activates the JIT complier. The

JIT complier converts MSIL into native code on a demand basis as each part

of the program is needed. Thus the program executes as a native code even

though it is compiled into MSIL making the program to run as fast as it

would if it is compiled to native code but achieves the portability benefits of

MSIL.

CLASS LIBRARIES
http://www.startvbdotnet.com/dotnet/framework.aspxhttp://www.startvbdotnet.com/dotnet/framework.aspx


34/51

Class library is the second major entity of the .NET Framework

which is designed to integrate with the common language runtime. This

library gives the program access to runtime environment. The class library

consists of lots of prewritten code that all the applications created in VB

.NET and Visual Studio .NET will use. The code for all the elements like

forms, controls and the rest in VB .NET applications actually comes from

BACK END - SQL

SQL stands for Structured Query Language. SQL is used to

communicate with a database. According to ANSI (American National

Standards Institute), it is the standard language for relational database

management systems. SQL statements are used to perform tasks such as

update data on a database, or retrieve data from a database. Some common

relational database management systems that use SQL are: Oracle, Sybase,

Microsoft SQL Server, Access, Ingres, etc. Although most database systems

use SQL, most of them also have their own additional proprietary extensions

that are usually only used on their system. However, the standard SQL

commands such as "Select", "Insert", "Update", "Delete", "Create", and

"Drop" can be used to accomplish almost everything that one needs to do

with a database. This tutorial will provide you with the instruction on the


35/51

basics of each of these commands as well as allow you to put them to

practice using the SQL Interpreter.

CREATE A TABLE

To create a new table, enter the keywords create table followed by

the table name, followed by an open parenthesis, followed by the first

column name, followed by the data type for that column, followed by any

optional constraints, and followed by a closing parenthesis. It is important to

make sure you use an open parenthesis before the beginning table, and a

closing parenthesis after the end of the last column definition. Make sure

you seperate each column definition with a comma. All SQL statements

should end with a ";".

The table and column names must start with a letter and can be

followed by letters, numbers, or underscores - not to exceed a total of 30

characters in length. Do not use any SQL reserved keywords as names for

tables or column names (such as "select", "create", "insert", etc).

Data types specify what the type of data can be for that particular

column. If a column called "Last_Name", is to be used to hold names, then


36/51

that particular column should have a "varchar" (variable-length character)

data type.

INSERTING INTO A TABLE

The insert statement is used to insert or add a row of data into the table.

To insert records into a table, enter the key words insert into followed

by the table name, followed by an open parenthesis, followed by a list of

column names separated by commas, followed by a closing parenthesis,

followed by the keyword values, followed by the list of values enclosed in

parenthesis. The values that you enter will be held in the rows and they will

match up with the column names that you specify. Strings should be

enclosed in single quotes, and numbers should not.

insert into "tablename" (first_column,...last_column) values

(first_value,...last_value);

UPDATING RECORDS

The update statement is used to update or change records that match a

specified criteria. This is accomplished by carefully constructing a where

clause.


37/51

update "tablename"set "columnname" = "newvalue" [,"nextcolumn" =

"newvalue2"...]

where "columnname" OPERATOR "value" [and|or "column"

OPERATOR "value"];

DELETING RECORDS

The delete statement is used to delete records or rows from the table.

delete from "tablename"where "columnname" OPERATOR "value" [and|or

"column"

OPERATOR "value"];

DROP A TABLE

The drop table command is used to delete a table and all rows in the table.

To delete an entire table including all of its rows, issue the drop table

command followed by the tablename. drop table is different from deleting

all of the records in the table. Deleting all of the records in the table leaves

the table including column and constraint information. Dropping the table

removes the table definition as well as all of its rows.

drop table "tablename".


38/51

List of Modules

Data entry:

Get the details of user. To process the database we need many records

to show the efficiency of our Hover method. In this module we get the datas

from the user to process the data.

Horizontal Representation:

The horizontal format, which is straightforward and can be easily

implemented. However, the format is not suitable for sparse database, for it

may suffer from sparsity and frequent schema evolution, hence, the space

and time performance may not be satisfactory; in addition, the number of

columns in a horizontal table is typically limited to 1,000 in general

commercial DBMSs, which is not enough for many real-life applications.

Vertical representation:

In vertical format each active dimension of an object is represented by

the object identifier, the attribute name, and the value. The vertical format


39/51

can scale to thousands of attributes, avoid storage of null values, and support

evolving schemas; however, writing queries over the format is cumbersome

and error-prone, and an expensive multiway self-join needs to be conducted

if the objects in the query result need to be returned in the conventional

horizontal format.

HoVer representation:

Hover which stands for Horizontal representation over Vertically

partitioned subspaces. The HoVer representation can efficiently find a

better intermediate ground between the horizontal representation and the

vertical representation if there are dimension correlations to be exploited. In

HoVer, we first vertically partition the data set into multiple lower-

dimensional subspaces, and objects are represented in horizontal format in

the subspace tables. Partitioning the sparse data space into meaningful

subspaces is a nontrivial task


40/51

DFD:

ARCHITECTURE:

SYSTEM FLOW DIAGRAM:


41/51

INPUT DESIGN

Input design is one of the most important phases of the system design.

Input design is the process where the input received in the system are

planned and designed, so as to get necessary information from the user,

eliminating the information that is not required. The aim of the input design

is to ensure the maximum possible levels of accuracy and also ensures that

the input is accessible that understood by the user.

The input design is the part of overall system design, which requires

very careful attention. If the data going into the system is incorrect then the

processing and output will magnify the errors.

The objectives considered during input design are:

Nature of input processing.

Flexibility and thoroughness of validation rules.

Handling of properties within the input documents.

Screen design to ensure accuracy and efficiency of the

input relationship with files.


42/51

Careful design of the input also involves attention to

error handling, controls, batching and validation

procedures.

Input design features can ensure the reliability of the system and

produce result from accurate data or they can result in the production of

erroneous information.


43/51

OUTPUT DESIGN

The term output applying to information produced by an information

system whether printed or displayed while designing the output we should

identify the specific output that is needed to information requirements select

a method to present the formation and create a document report or other

formats that contains produced by the system.

TYPES OF OUTPUT

Whether the output is formatted report or a simple listing of the contents

of a file, a computer process will produce the output.

A Document

A Message

Retrieval from a data store

Transmission from a process or system activity

Directly from an output sources

The Output of our project will be a object detection with separate part

detection.


44/51

SOFTWARE TESTING FUNDAMENTALS

Testing presents an interesting task for software engineers. Earlier in

the software process, the engineer attempts to build software from an

abstract concept to a tangible implementation.. The engineer creates the

series of test cases that are intended to demolish the software that has been

build.

To test any program we need to have a description of its expected

behavior and a method of determining whether the observed behavior

conforms to the expected behavior for this we need a test oracle. A test-

oracle is a mechanism; different from the program itself that can be used

to check the correctness of the output of the program for the test cases.

Human-oracle is human beings who mostly compute by hand what the

output of the program should be. Human-oracle can make mistake. So

test oracle is defined in the tool to automate testing and avoids mistakes.

Testing principles

All the tests should be traceable to requirement.

Tests should be planned long before testing begins that is the

test planning can bring as soon as the requirement model is complete.

Testing should begin in the small and progress towards

testing in the large. The first planned and executed generally focus on

individual program modules. As testing progresses, testing shifts focus

and attempt to find errors in integrated clusters of modules and ultimately

in the entire system.

UNIT TESTING


45/51

In unit testing the program making the system are tested. It is

sometimes for this reason is called program testing. The software units in a

system are module routines that are assembled and integrated to perform a

specific function. It mainly focuses on the module, independently of on

another, to locate the errors in coding and logic that are contained within

the module alone. Setting break point in the code so that it is easy to find

the error location when the input is given does the unit testing. The unit test

is always white box oriented. Since each module in the system, receive input

and generate the output, test cases are needed to test the range expected. This

system dividing into sales module and purchase module, both the modules

are tested separately and unit test get successful.

ACCEPTANCE TESTING

This is the final stage in the testing process before the system is

accepted for the operational use. This system is tested with data supplied by

the system procure rather than similar test data acceptance testing may

reveal errors and omissions in the system requirements definition because

the real data exercise the system in different ways from the test data.

Acceptance testing may also reveal requirements problem where the

systems facilities do no really meet the users needs for the system

performance is unacceptable but this system met all the requirements of the

user and performed well.

INTEGRATION TESTING


46/51

Integration level testing focuses on the transfer of data and control

across a programs internal and external interfaces. External interfaces are

those with other software, system hardware, and the users and can be

described as communications links.

PERFORMANCE TESTING

Performance testing helps ensure that a product performs its functions

at the required speed. Planning for performance testing starts at the

beginning of the project when product goals and requirements are defined.

Performance testing is a part of the products initial engineering plan.

SYSTEM TESTING

System level testing demonstrates that all specified functionality

exists and that the software product is trustworthy. This testing verifies the

as built programs functionality and performance with respect to the

requirements for the software product as exhibited on the specified operating

platform(s). System level software testing addresses functional concerns and

the following elements of a devices software that are related to the intended

use(s).

Performance issues (e.g. response times, reliability measurements):

Response to stress conditions, e.g. behavior under

maximum load, continuous use.


47/51

Operational of internal and external security features.

Effectiveness of recovery procedures, including disaster

recovery.

Usability.

Compatibility with other software products.

Behavior in each of the defined hardware configurations

and

Accuracy of documentation.

Test Plan

Before going for testing, first decide the type of testing. For this

impact system unit testing is carried out. Before going for testing, the

following things are taken into consideration.

To ensure that information properly flows in and out of the program.

To find out whether the local data structures maintains its integrity

during all steps in an algorithm execution.

To ensure that the module operates properly at boundaries established

to limit or restrict processing.

To find out whether all statements in the module have been executed

at least once.

To find out whether error-handling paths are working correctly or not.


48/51

TEST CASES

A test case is as set of conditions or variables under which a tester will

determine if a requirement or use case upon an application is partially or

fully satisfied. It may take many test cases to determine that a requirement is

fully satisfied. In order to fully test that all the requirements of and

application met, there must be at least one test case for each requirement

unless a requirement has sub requirement. In that situation, each sub

requirement must have at least one test case. The written test case is that

there is known input and an expected output, which is worked out before the

test is executed. The known input should test a precondition and the

expected output should test a post condition test cases uncover the following

categories:

Erroneous initialization or default values and inconsistent data types

Incorrect (misspelled or truncated) variable name

Underflow, overflow and addressing exceptions


49/51

SYSTEM IMPLEMENTATION

Implementation is the most crucial stage in achieving a successful

system and giving the users confidence that the new system is workable and

effective. This type of conversation is relatively easy to handle, provided

there are no major changes in the system.

Each program is tested individually at the time of development using

the data and has verified that this program linked together in the way

specified in the programs specification, the computer system and its

environment is tested to the satisfaction of the user. The system that has

been developed is accepted and proved to be satisfactory for the user. And

so the system is going to be implemented very soon. A simple operating

procedure is included so that the user can understand the different functions

clearly and quickly.

Initially as a first step the executable form of the application is to be

created and loaded in the common server machine which is accessible to the

entire user and the server is to be connected to a network. The final stage is


50/51

to document the entire system which provides components and the operating

procedures of the system.

Implementation is the stage of the project when the theoretical design

is turned out into a working system. Thus it can be considered to be the most

critical stage in achieving a successful new system and in giving the user,

confidence that the new system will work and be effective. The file is

downloaded from the server which takes minimum time for retrieval.


51/51

Conclusion:

In this project, we have addressed the problem of efficient query

processing over sparse databases. To alleviate the suffering from sparsity

and high-dimensionality of sparse data, we proposed a new approach named

HoVer. According to the characteristics of sparse data sets, we vertically

partition the high-dimensional sparse data into multiple lower-dimensional

subspaces, and all the dimensions in each subspace are highly correlated,

respectively. The experimental results show that our proposed scheme can

find correlated subspaces effectively, and yield superior storage and query

performance for conducting queries in sparse databases.

Date post:	07-Apr-2018
Category:	Documents
Upload:	mohit-gupta
View:	219 times
Download:	0 times

Exploring Correlated Subspaces for Efficient

Documents