Date post: | 07-Apr-2018 |
Category: |
Documents |
Upload: | mohit-gupta |
View: | 219 times |
Download: | 0 times |
of 51
8/6/2019 Exploring Correlated Subspaces for Efficient
1/51
Exploring Correlated Subspaces for Efficient
Query Processing in Sparse Databases
8/6/2019 Exploring Correlated Subspaces for Efficient
2/51
Abstract
The sparse data is becoming increasingly common and available in many
real-life applications. However, relative little attention has been paid to
effectively model the sparse data and existing approaches such as the
conventional "horizontal" and "vertical" representations fail to provide
satisfactory performance for both storage and query processing, as such
approaches are too rigid and generally do not consider the dimension
correlations. In this project, we propose a new approach, named HoVer, to
store and conduct query for sparse datasets in an unmodified RDBMS,
where HoVer stands for Horizontal representation over Vertically
partitioned subspaces. According to the dimension correlations of sparse
datasets, a novel mechanism has been developed to vertically partition a
high-dimensional sparse dataset into multiple lower dimensional subspaces,
and all the dimensions are highly correlated intra-subspace and highly
unrelated inter-subspace respectively. Therefore, original data objects can be
represented by the horizontal format in respective subspaces. With the novel
HoVer representation, users can write SQL queries over the original
horizontal view, which can be easily rewritten into queries over the subspace
tables. Experiments over synthetic and real-life datasets show that our
8/6/2019 Exploring Correlated Subspaces for Efficient
3/51
approach is effective in finding correlated subspaces and yields superior
performance for the storage and query of sparse data.
8/6/2019 Exploring Correlated Subspaces for Efficient
4/51
Introduction
With continuous advances in the network and storage technology, there is
dramatic growth in the amount of very high-dimensional sparse data from a
variety of new application domains, such as bioinformatics, time series, and
perhaps, most importantly e-commerce, which pose significant challenges to
RDBMS. The main characteristics of these sparse data sets may be
summarized as follows: High dimensionality: The dimensionality of feature
vectors may be very high, i.e., the number of possible attributes for all
objects is huge. For example, in some e-commerce applications, each
participant may declare their own idiosyncratic attributes for the products,
which results in data sets that have thousands of attributes. Sparsity: Each
object may have only a small subset of attributes, which is called active
dimensions, i.e., significant values appears only in few active dimensions; in
addition, different objects may have different active dimensions. For
example, an e-commerce data set may have thousands of attributes, but most
of which are null and only a few of which apply to a particular product.
Correlation: Since each object may have only few active dimensions, more
likely, similar objects share same or similar active dimensions. For example,
in recommendation systems, it is important to find homogeneous groups of
8/6/2019 Exploring Correlated Subspaces for Efficient
5/51
users with similar ratings in subsets of the attributes. Therefore, it is possible
to find certain subspaces shared by similar objects.
In existing RDBMSs, objects are conventionally stored using a
horizontal format called the horizontal representation in this project. For this
format, one column corresponds to an attribute, and if an object misses a
particular attribute, the corresponding column in the row for the object will
be null and storing a sparse data set using the horizontal format, which is
straightforward and can be easily implemented. However, the format is not
suitable for sparse database, for it may suffer from sparsity and frequent
schema evolution, hence, the space and time performance may not be
satisfactory; in addition, the number of columns in a horizontal table is
typically limited to 1,000 in general commercial DBMSs, which is not
enough for many real-life applications. If the number of columns in a
horizontal table is more than 1,000, a record may not reside in a single disk
page, and the page overflow will significantly degrade the performance. In
the last decades, commercial RDBMSs, such as DB2, SQL Server, and
Oracle, have improved the null storing and handling capabilities, which
results in smaller horizontal tables and better query performance. Our
8/6/2019 Exploring Correlated Subspaces for Efficient
6/51
approach proposed in this project still uniformly outperforms the horizontal
representation.
An alternative is known as the vertical format, and is called the
vertical representation in this project. This format storing a sparse data set
using the vertical format; each active dimension of an object is represented
by the object identifier, the attribute name, and the value. The vertical format
can scale to thousands of attributes, avoid storage of null values, and support
evolving schemas; however, writing queries over the format is cumbersome
and error-prone, and an expensive multi way self-join needs to be conducted
if the objects in the query result need to be returned in the conventional
horizontal format.
From the above introduction, we know that both the horizontal and the
vertical representations have advantages and disadvantages. The horizontal
representation has lots of nulls but simple queries, and the vertical
representation has no nulls but more complex queries. Therefore, an optimal
representation should benefit from the advantages and alleviate the
drawbacks. In this project, we propose a new approach which combines the
horizontal and the vertical representations, and can store and conduct query
8/6/2019 Exploring Correlated Subspaces for Efficient
7/51
for sparse data sets in an unmodified RDBMS. This novel representation is
named HoVer, which stands for Horizontal representation over vertically
partitioned subspaces. The HoVer representation can efficiently find a better
intermediate ground between the horizontal representation and the vertical
representation if there are dimension correlations to be exploited. In HoVer,
we first vertically partition the data set into multiple lower- dimensional
subspaces, and objects are represented in horizontal format in the subspace
tables. Partitioning the sparse data space into meaningful subspaces is a
nontrivial task; however, sparse data sets generally have some good
properties in nature, such as sparsity and correlation.
Therefore, we can design an effective mechanism to split the data
space into multiple subspaces. We define the correlated degree between
dimensions and cluster highly correlated dimensions as a subspace. After
partitioning, the original sparse data set can be transformed into the HoVer
format. The combination of different approaches demonstrates the
frameworks of the horizontal, HoVer, and vertical approaches based on an
unmodified RDBMS. Users write SQL queries over the conventional
horizontal view; for the HoVer and vertical approaches, the SQL queries can
be rewritten into queries over the subspace tables and the vertical table
8/6/2019 Exploring Correlated Subspaces for Efficient
8/51
stored by the unmodified RDBMS, respectively, and the query results
returned by the RDBMS are all in the horizontal format. A comprehensive
experimental study demonstrates the superiority of our approach, as our
approach fully utilizes the properties of the sparse data.
8/6/2019 Exploring Correlated Subspaces for Efficient
9/51
Existing system:
The existing RDBMSs, objects are conventionally stored using a
horizontal format called the horizontal representation in this paper. For this
format, one column corresponds to an attribute, and if an object misses a
particular attribute, the corresponding column in the row for the object will
be null. Fig. 1 shows an example of storing a sparse data set using the
horizontal format, which is straightforward and can be easily implemented.
The format is not suitable for sparse database, for it may suffer from sparsity
and frequent schema evolution, hence, the space and time performance may
not be satisfactory; in addition, the number of columns in a horizontal table
is typically limited to 1,000 in general commercial DBMSs, which is not
enough for many real-life applications. If the number of columns in a
horizontal table is more than 1,000, a record may not reside in a single disk
page, and the page overflow will significantly degrade the performance. In
the last decades, commercial RDBMSs, such as DB2, SQL Server, and
Oracle, have improved the null storing and handling capabilities, which
results in smaller horizontal tables and better query performance.
8/6/2019 Exploring Correlated Subspaces for Efficient
10/51
An alternative is known as the vertical format and is called the vertical
representation in this paper. Fig. 2 shows an example of storing a sparse data
set using the vertical format; each active dimension of an object is
represented by the object identifier, the attribute name, and the value. The
vertical format can scale to thousands of attributes, avoid storage of null
values, and support evolving schemas; however, writing queries over the
format is cumbersome and error-prone, and an expensive multi way self-join
needs to be conducted if the objects in the query result need to be returned in
the conventional horizontal format. From the above introduction, we know
that both the horizontal and the vertical representations have advantages and
disadvantages. The horizontal representation has lots of nulls but simple
queries, and the vertical representation has no nulls but more complex
8/6/2019 Exploring Correlated Subspaces for Efficient
11/51
queries. Therefore, an optimal representation should benefit from the
advantages and alleviate the drawbacks. In this paper, we propose a new
approach which combines the horizontal and the vertical representations,
and can store and conduct query for sparse data sets in an unmodified
RDBMS. This novel representation is named HoVer, which stands for
Horizontal representation over vertically partitioned subspaces.
8/6/2019 Exploring Correlated Subspaces for Efficient
12/51
Proposed system:
THE HoVer REPRESENTATION
As we introduced previously, the pure horizontal or vertical
representation may yield unsatisfactory performance in sparse databases.
Therefore, we propose a new representation called HoVer, which can
effectively exploit the characteristics of sparse data sets, such as sparsity and
dimension correlation. We aim at achieving good space and time
performance for storing and querying high-dimensional sparse data sets.
Although the dimensionality of sparse data sets could be very high, up
to thousands, a single data object typically has only a few active dimensions,
and similar objects have a better chance to share similar active dimensions.
A closer inspection of many e-commerce sparse data sets shows that typical
e-commerce data sets have a wide variety of items which can be organized
into categories and the categories themselves are hierarchically grouped;
items that belong to a common category are likely to have common
attributes, while those within the same subcategory are likely to have more
common attributes. The RDF data also shows that the attributes of similar
8/6/2019 Exploring Correlated Subspaces for Efficient
13/51
subjects tend to be defined together. This motivates us to find certain
subspaces which are shared by similar data groups, and to split the full space
into some lower-dimensional subspaces.
There are some previous research works which focus on subspace
clustering. In general, subspace clustering is the task of automatically
detecting all clusters in the original feature space, either by directly
computing the subspace clusters or by selecting interesting subspaces for
clustering. However, such approaches are very time-inefficient, and hence,
cannot scale well to high-dimensional space. For example, in the proposed
algorithm takes 5 hours for a 30-dimensional data set, while jumping to 30
hours for a 50-dimensional data set.
8/6/2019 Exploring Correlated Subspaces for Efficient
14/51
Considering sparse data sets with thousands of dimensions, such
approaches are unacceptable in real-life applications. On the other hand, our
purpose is to split the full space into subspaces which can yield superior
performance for the storage and query of sparse data. These approaches are
not suitable for this scenario. Here we introduce how to present sparse data
sets using the novel HoVer representation. First, we design an efficient and
effective approach to find correlated dimensions. After that, we partition the
original full space into subspaces and store the original sparse data set using
multiple tables where each table corresponds to a certain subspace.
Correlated Degree Determination
Before subspace selection, we first consider how to measure the
correlation between two dimensions. Suppose that the sparse data set is
dimensional, and has N tuples, we generate a table to represent the
relation of inter dimensions of the data set. We call this table the correlation
table for ease of presentation.
Definition 1 (Correlation Table). The correlation table represents the
correlation of dimensions in a sparse data set, which is a super triangle
matrix. An entry , where , counts for the times that dimensions i and j
are active simultaneously.
8/6/2019 Exploring Correlated Subspaces for Efficient
15/51
Given the sparse data set as shown in Fig. 1, we can generate the
corresponding correlation table as shown in Fig. 5. For example,
which means that dimension 1 is active in four tuples; , which means
that dimensions 1 and 3 are active simultaneously once. Algorithm 1
illustrates an efficient way to generate the correlation table for a sparse data
set. We first initialize the correlation table , where each entry in the super
triangle matrix is set to 0. After that, the sparse data set is scanned, and
tuples in the data set are processed one by one. For each tuple, we convert it
into an array with length in this array, the value of a certain entry is set
to 1 if the corresponding dimension is active; otherwise, and it is set to 0.
With this array, we can accumulate the correlation information into the
correlation table. For each active dimension i, we access the i th row of the
super triangle matrix, scan the array, and increase by 1 if dimensions i
and j are both active. The algorithm is very efficient since the sparse data set
only needs to be scanned once. In addition, it is also time-efficient because
no distance computation is involved.
8/6/2019 Exploring Correlated Subspaces for Efficient
16/51
After the correlation table is created, it can be incrementally
maintained in the presence of updates, and we only need to revise the values
of the entries in the correlation table which correspond to two columns of a
row which tend to be active simultaneously or begin to be active
simultaneously. In the presence of insertions and deletions, the table can be
maintained in a similar way.
The information in the correlation table can be utilized to evaluate the
correlation between any two dimensions. We first define the correlated
degree between two dimensions which can facilitate subspace partitioning of
high-dimensional sparse data.
8/6/2019 Exploring Correlated Subspaces for Efficient
17/51
Definition 2 (Correlated Degree). The correlated degree measures the
correlation between two dimensions i and j, where i
8/6/2019 Exploring Correlated Subspaces for Efficient
18/51
where means that dimension i is active, seems to be a good choice for
measuring the correlation (with a slight abuse of terminology,
characterizes the ratio instead of the probability that dimension i is
active) According to the probability theory, with the increase of the value of
, the correlation between the two dimensions increases at the same time,
and the two dimensions are independent in case of But, the value of
is highly influenced by active densities of the two dimensions, which
makes it not eligible to measure the correlation. A variation of ,in
which is redefined as the ratio of dimension i is active among tuples in
which at least one of the two dimensions is active, i.e.,
seems to be a good choice for measuring the correlation
Since the two dimensions within the rows try not
to be active simultaneously, we can prove that all the time, which
means that the two dimensions cannot be positive-correlated. The problem
of is that it cannot accurately measure the correlation between two
dimensions in some cases. According to our correlation measure criteria and
above analysis, we select the correlated degree defined in Definition 2
8/6/2019 Exploring Correlated Subspaces for Efficient
19/51
to measure the correlation between two dimensions in sparse data sets,
where 0 , and with the increase of the value
of , the correlation between the two dimensions i and j increases at the
same time.
Subspace Selection
An optimal subspace partitioning should enjoy two properties, i.e., all
dimensions are highly correlated intra subspaces while being highly
unrelated inters subspaces. If the number of subspaces determined by the
user is smaller, dimensions which are not highly correlated may be clustered
into the same subspace; hence, the subspace tables are still very sparse. On
the other hand, if the number of subspaces determined by the user is larger,
dimensions which are highly correlated may be distributed into different
sub- spaces; since the highly correlated dimensions are often defined and
accessed together, the join operations for accessing the dimensions, which
are distributed into different subspaces, are rather expensive.
According to our above analysis, the number of subspaces should be
determined by the subspace selection algorithm according to the dimension
8/6/2019 Exploring Correlated Subspaces for Efficient
20/51
correlations of the sparse data set. Because the underlying storage and query
processing details of the RDBMS may have some influence on the
performance, there may not exist perfect subspace clustering typically.
Therefore, the main aim of our subspace selection algorithm is to find the
subspaces in an efficient way, and yield superior performance for the storage
and query of sparse data. First of all, any two dimensions in a subspace
should be highly correlated, which can ensure that the subspace tables no
longer suffer from sparsity. Next, in order to ensure that the highly
correlated dimensions, which are often defined and accessed together, can be
clustered into the same subspace, the number of the subspaces should be as
small as possible.
Therefore, our subspace selection problem can be formally defined as
follows: Given the correlated degree threshold and a sparse data set, which
contains dimensions we partition the original full space into m
subspaces , where and for any two
sub- spaces si and sj, where our objective is that the
correlated degree between any two dimensions in a subspace is no less than
and the number of subspaces m is minimized.
8/6/2019 Exploring Correlated Subspaces for Efficient
21/51
Our subspace selection problem can be mapped to the Minimum
Clique Partition problem. Given a graph the Minimum Clique
Partition problem partitions V into disjoint subsets the objective
is that for , the sub graph induced by Vi is a complete graph, and the
number of partitions m is minimized. If we map each dimension in the
sparse data set to a node in the graph, and if the correlated degree between
two dimensions is no less than the correlated degree threshold, we add an
edge to link the two corresponding nodes in the graph, and our subspace
selection problem is exactly same as the Minimum Clique Partition problem.
Unfortunately, the Minimum Clique Partition problem is NP-complete,
which means that we should use a heuristic algorithm to approximate
optimal partitions which tries to group together correlated dimensions.
Algorithm 2 presents how to generate subspaces from a given
correlation table in a heuristic manner. If there exist unclassified dimensions,
i.e., not included into any existing subspace, we pick the unclassified
dimension D with the highest correlation table value, which is the most
active dimension left in the sparse data set. Then, a new subspace will
be generated, and all unclassified dimensions are examined. If the correlated
degree between an unclassified dimension d and each dimension d0 in s is
8/6/2019 Exploring Correlated Subspaces for Efficient
22/51
not less than the given correlated degree threshold , then d is added to
subspace s. It is apparent that the algorithm ensures that the correlated
degree between any two dimensions in a subspace is no less than the
correlated degree threshold, and the algorithm minimizes the number of
the subspaces in a greedy manner, i.e., tries to add the unclassified
dimensions to current subspace.
8/6/2019 Exploring Correlated Subspaces for Efficient
23/51
The correlated degree threshold has great influence on the subspace
generation. With a larger threshold, the non null density of each sub space
will be larger, i.e., the dimensions in the subspace are highly correlated, but
more subspaces will be generated. With a smaller threshold, fewer subspaces
will be generated, but the non null density of each sub space will be smaller.
Actually, the optimal correlated degree threshold varies for different data
sets.
Given the correlation table as shown in Fig. 5, we are able to partition
the original 8-dimensional space into multiple subspaces. At the beginning,
D2 is selected as the first dimension of subspace s1, for D2 has the maximal
correlation table value, i.e., 5. If , D1, D3, and D4 will be
subsequently added to subspace s1, i.e., . After that, we can
8/6/2019 Exploring Correlated Subspaces for Efficient
24/51
use the same strategy to generate other two subspaces, and
if , four subspaces will be generated, i.e.
and . We can see that with the
increase of the correlated degree threshold, the number of subspaces
increases at the same time.
Vertical Partition
The HoVer representation means vertical partition of the
corresponding horizontal representation. The OID attribute exists in each
subspace for linking the data items which are partitioned into multiple
subspaces. The course of transforming horizontal representations into HoVer
representations is lossless, since the candidate key, i.e., OID is contained by
each subspace table. Fig. 6 shows the HoVer representation corresponding to
the horizontal representation shown in Fig. 1 with correlated degree
threshold ; as shown in Fig. 7, if we increase to 0.5, the subspace
D1234 will be further split into two subspaces D12 with objects {1, 2, 3, 4,
6} and D34 with objects {3, 5, 6}. We can see that if none of the dimensions
in a subspace is active in a horizontally represented tuple, the tuple will be
absent in the subspace table after vertical partition. It should be easy to
8/6/2019 Exploring Correlated Subspaces for Efficient
25/51
convert horizontally represented data to the HoVer representation; for each
horizontally represented tuple, if at least one dimension (not including OID)
of a subspace is active, OID along with the subspace dimensions are
projected and inserted into the subspace table. For example, converting the
horizontally represented table H shown in Fig. 1 to the subspace table shown
in Fig. 6b can be characterized by a relational algebraic expression as
Schema Evolution
When a new column is added, a new subspace which only contains
the new column will be created, and the correlation table should also be
updated accordingly. Since the correlation table is incrementally maintained,
the new column may be merged to a subspace when subspaces are
reorganized. When a column is deleted, we only need to delete the column
from the corresponding subspace and update the correlation table
accordingly.
QUERY PROCESSING IN HoVer
8/6/2019 Exploring Correlated Subspaces for Efficient
26/51
In this section, we introduce how the queries over the horizontal
representation can be processed over the HoVer representation.
Query Rewriting
Our ultimate purpose is to define horizontally represented views over
the HoVer representation. Users typically issue traditional SQL queries over
the horizontal view, which can be rewritten into queries over the underlying
HoVer representation. Generally, the reconstruction of the horizontal table H
from the subspace tables can be characterized by a relational algebraic
expression as
Where is the OID list which contains all the OIDs in the
horizontal table. Hence, we should maintain an OID list during vertical
partition. For example, the reconstruction of the horizontal table H shown in
Fig. 1 from the subspace tables shown in Fig. 6 can be characterized by a
relational algebraic expression as
8/6/2019 Exploring Correlated Subspaces for Efficient
27/51
In our work, the dimensions in the original sparse data space are
clustered into subspaces, and a horizontal table is
Vertically partitioned into subspace tables. In many real-life applications, the
dimensions with a high correlated degree are likely to characterize similar
topics and have high probability of being accessed together; hence, they
should be stored in the same subspace table. We can take advantage of this
characteristic and access as few subspace tables as possible in query
evaluation.
8/6/2019 Exploring Correlated Subspaces for Efficient
28/51
FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal is put
forth with a very general plan for the project and some cost estimates. During system
analysis the feasibility study of the proposed system is to be carried out. This is to
ensure that the proposed system is not a burden to the company. For feasibility analysis,
some understanding of the major requirements for the system is essential. Three key
considerations involved in the feasibility analysis are 1.Economical feasibility
2.Technical feasibility 3.Operational feasibility.
ECONOMICAL FEASIBILITY
The study is carried out to check the economic impact that the system will have
on the organization. The amount of fund that the company can pour into the research and
development of the system is limited. The expenditures must be justified. Thus the
developed system as well within the budget and this was achieved because most of the
technologies used are freely available. Only customized products had to be purchased.
OPERATIONAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the user.
This includes the process of training the user to use the system efficiently. The user must
not be threatened by the system, instead must accept it as a necessity. The level of
acceptance by the users solely depends on the methods that are employed to educate the
user about the system and to make him familiar with it. The level of confidence must be
8/6/2019 Exploring Correlated Subspaces for Efficient
29/51
raised so that user is also able to make some constructive criticism, which is welcomed,
as user is the final user of the system.
TECHNICAL FEASIBILITY
Technical feasibility is carried out to check the, the technical requirements of
the system. Any system developed must not have a high demand on the available
technical resources. This will lead to high demands on the available technical resources.
This will lead to high demands being placed on the client.
8/6/2019 Exploring Correlated Subspaces for Efficient
30/51
SYSTEM SPECIFICATION
S/W REQUIREMENTS
Windows XP
MS-SQL server
MS Visual Studio 2005
H/W REQUIREMENTS
Processor : Dual Core
CPU Clock Speed : 651 MHz
External memory : 512 MB (min)
Hard Disk Drive : 40 GB (min)
Mouse : Logitech Mouse
Keyboard : Logitech of 104 Keys
Monitor : 15.6 LCD Monitor
8/6/2019 Exploring Correlated Subspaces for Efficient
31/51
SOFTWARE SPECIFICATION
FRONT END
NET FRAMEWORK
.NET is a "Software Platform". It is a language-neutral environment
for developing rich .NET experiences and building applications that can
easily and securely operate within it. When developed applications are
deployed, those applications will target .NET and will execute wherever
.NET is implemented instead of targeting a particular Hardware/OS
combination. The components that make up the .NET platform are
collectively called the .NET Framework.
The .NET Framework is a managed, type-safe environment for
developing and executing applications. The .NET Framework manages all
aspects of program execution, like, allocation of memory for the storage of
data and instructions, granting and denying permissions to the application,
managing execution of the application and reallocation of memory for
resources that are not needed.
The .NET Framework is designed for cross-language compatibility.
Cross-language compatibility means, an application written in Visual
8/6/2019 Exploring Correlated Subspaces for Efficient
32/51
Basic .NET may reference a DLL file written in C# (C-Sharp). A Visual
Basic .NET class might be derived from a C# class or vice versa.
The .NET Framework consists of two main components:
Common Language Runtime (CLR)
Class Libraries
COMMON LANGUAGE RUNTIME (CLR)
The CLR is described as the "execution engine" of .NET. It provides
the environment within which the programs run. It's this CLR that manages
the execution of programs and provides core services, such as code
compilation, memory allocation, thread management, and garbage
collection. Through the Common Type(CTS), it enforces strict type safety,
and it ensures that the code is executed in a safe environment by enforcing
code access. The software version of .NET is actually the CLR version.
WORKING OF THE CLR
8/6/2019 Exploring Correlated Subspaces for Efficient
33/51
When the .NET program is compiled, the output of the is not an
executable file but a file that contains a special type of code called the
Microsoft Intermediate Language (MSIL), which is a low-level set of
instructions understood by the common language run time. This MSIL
defines a set of portable instructions that are independent of any specific
CPU. It's the job of the CLR to translate this Intermediate code into a
executable code when the program is executed making the program to run in
any environment for which the CLR is implemented. And that's how the
.NET Framework achieves Portability. This MSIL is turned into executable
code using a JIT (Just In Time) complier. The process goes like this,
when .NET programs are executed, the CLR activates the JIT complier. The
JIT complier converts MSIL into native code on a demand basis as each part
of the program is needed. Thus the program executes as a native code even
though it is compiled into MSIL making the program to run as fast as it
would if it is compiled to native code but achieves the portability benefits of
MSIL.
CLASS LIBRARIES
http://www.startvbdotnet.com/dotnet/framework.aspxhttp://www.startvbdotnet.com/dotnet/framework.aspx8/6/2019 Exploring Correlated Subspaces for Efficient
34/51
Class library is the second major entity of the .NET Framework
which is designed to integrate with the common language runtime. This
library gives the program access to runtime environment. The class library
consists of lots of prewritten code that all the applications created in VB
.NET and Visual Studio .NET will use. The code for all the elements like
forms, controls and the rest in VB .NET applications actually comes from
BACK END - SQL
SQL stands for Structured Query Language. SQL is used to
communicate with a database. According to ANSI (American National
Standards Institute), it is the standard language for relational database
management systems. SQL statements are used to perform tasks such as
update data on a database, or retrieve data from a database. Some common
relational database management systems that use SQL are: Oracle, Sybase,
Microsoft SQL Server, Access, Ingres, etc. Although most database systems
use SQL, most of them also have their own additional proprietary extensions
that are usually only used on their system. However, the standard SQL
commands such as "Select", "Insert", "Update", "Delete", "Create", and
"Drop" can be used to accomplish almost everything that one needs to do
with a database. This tutorial will provide you with the instruction on the
8/6/2019 Exploring Correlated Subspaces for Efficient
35/51
basics of each of these commands as well as allow you to put them to
practice using the SQL Interpreter.
CREATE A TABLE
To create a new table, enter the keywords create table followed by
the table name, followed by an open parenthesis, followed by the first
column name, followed by the data type for that column, followed by any
optional constraints, and followed by a closing parenthesis. It is important to
make sure you use an open parenthesis before the beginning table, and a
closing parenthesis after the end of the last column definition. Make sure
you seperate each column definition with a comma. All SQL statements
should end with a ";".
The table and column names must start with a letter and can be
followed by letters, numbers, or underscores - not to exceed a total of 30
characters in length. Do not use any SQL reserved keywords as names for
tables or column names (such as "select", "create", "insert", etc).
Data types specify what the type of data can be for that particular
column. If a column called "Last_Name", is to be used to hold names, then
8/6/2019 Exploring Correlated Subspaces for Efficient
36/51
that particular column should have a "varchar" (variable-length character)
data type.
INSERTING INTO A TABLE
The insert statement is used to insert or add a row of data into the table.
To insert records into a table, enter the key words insert into followed
by the table name, followed by an open parenthesis, followed by a list of
column names separated by commas, followed by a closing parenthesis,
followed by the keyword values, followed by the list of values enclosed in
parenthesis. The values that you enter will be held in the rows and they will
match up with the column names that you specify. Strings should be
enclosed in single quotes, and numbers should not.
insert into "tablename" (first_column,...last_column) values
(first_value,...last_value);
UPDATING RECORDS
The update statement is used to update or change records that match a
specified criteria. This is accomplished by carefully constructing a where
clause.
8/6/2019 Exploring Correlated Subspaces for Efficient
37/51
update "tablename"set "columnname" = "newvalue" [,"nextcolumn" =
"newvalue2"...]
where "columnname" OPERATOR "value" [and|or "column"
OPERATOR "value"];
DELETING RECORDS
The delete statement is used to delete records or rows from the table.
delete from "tablename"where "columnname" OPERATOR "value" [and|or
"column"
OPERATOR "value"];
DROP A TABLE
The drop table command is used to delete a table and all rows in the table.
To delete an entire table including all of its rows, issue the drop table
command followed by the tablename. drop table is different from deleting
all of the records in the table. Deleting all of the records in the table leaves
the table including column and constraint information. Dropping the table
removes the table definition as well as all of its rows.
drop table "tablename".
8/6/2019 Exploring Correlated Subspaces for Efficient
38/51
List of Modules
Data entry:
Get the details of user. To process the database we need many records
to show the efficiency of our Hover method. In this module we get the datas
from the user to process the data.
Horizontal Representation:
The horizontal format, which is straightforward and can be easily
implemented. However, the format is not suitable for sparse database, for it
may suffer from sparsity and frequent schema evolution, hence, the space
and time performance may not be satisfactory; in addition, the number of
columns in a horizontal table is typically limited to 1,000 in general
commercial DBMSs, which is not enough for many real-life applications.
Vertical representation:
In vertical format each active dimension of an object is represented by
the object identifier, the attribute name, and the value. The vertical format
8/6/2019 Exploring Correlated Subspaces for Efficient
39/51
can scale to thousands of attributes, avoid storage of null values, and support
evolving schemas; however, writing queries over the format is cumbersome
and error-prone, and an expensive multiway self-join needs to be conducted
if the objects in the query result need to be returned in the conventional
horizontal format.
HoVer representation:
Hover which stands for Horizontal representation over Vertically
partitioned sub- spaces. The HoVer representation can efficiently find a
better intermediate ground between the horizontal representation and the
vertical representation if there are dimension correlations to be exploited. In
HoVer, we first vertically partition the data set into multiple lower-
dimensional subspaces, and objects are represented in horizontal format in
the subspace tables. Partitioning the sparse data space into meaningful
subspaces is a nontrivial task
8/6/2019 Exploring Correlated Subspaces for Efficient
40/51
DFD:
ARCHITECTURE:
SYSTEM FLOW DIAGRAM:
8/6/2019 Exploring Correlated Subspaces for Efficient
41/51
INPUT DESIGN
Input design is one of the most important phases of the system design.
Input design is the process where the input received in the system are
planned and designed, so as to get necessary information from the user,
eliminating the information that is not required. The aim of the input design
is to ensure the maximum possible levels of accuracy and also ensures that
the input is accessible that understood by the user.
The input design is the part of overall system design, which requires
very careful attention. If the data going into the system is incorrect then the
processing and output will magnify the errors.
The objectives considered during input design are:
Nature of input processing.
Flexibility and thoroughness of validation rules.
Handling of properties within the input documents.
Screen design to ensure accuracy and efficiency of the
input relationship with files.
8/6/2019 Exploring Correlated Subspaces for Efficient
42/51
Careful design of the input also involves attention to
error handling, controls, batching and validation
procedures.
Input design features can ensure the reliability of the system and
produce result from accurate data or they can result in the production of
erroneous information.
8/6/2019 Exploring Correlated Subspaces for Efficient
43/51
OUTPUT DESIGN
The term output applying to information produced by an information
system whether printed or displayed while designing the output we should
identify the specific output that is needed to information requirements select
a method to present the formation and create a document report or other
formats that contains produced by the system.
TYPES OF OUTPUT
Whether the output is formatted report or a simple listing of the contents
of a file, a computer process will produce the output.
A Document
A Message
Retrieval from a data store
Transmission from a process or system activity
Directly from an output sources
The Output of our project will be a object detection with separate part
detection.
8/6/2019 Exploring Correlated Subspaces for Efficient
44/51
SOFTWARE TESTING FUNDAMENTALS
Testing presents an interesting task for software engineers. Earlier in
the software process, the engineer attempts to build software from an
abstract concept to a tangible implementation.. The engineer creates the
series of test cases that are intended to demolish the software that has been
build.
To test any program we need to have a description of its expected
behavior and a method of determining whether the observed behavior
conforms to the expected behavior for this we need a test oracle. A test-
oracle is a mechanism; different from the program itself that can be used
to check the correctness of the output of the program for the test cases.
Human-oracle is human beings who mostly compute by hand what the
output of the program should be. Human-oracle can make mistake. So
test oracle is defined in the tool to automate testing and avoids mistakes.
Testing principles
All the tests should be traceable to requirement.
Tests should be planned long before testing begins that is the
test planning can bring as soon as the requirement model is complete.
Testing should begin in the small and progress towards
testing in the large. The first planned and executed generally focus on
individual program modules. As testing progresses, testing shifts focus
and attempt to find errors in integrated clusters of modules and ultimately
in the entire system.
UNIT TESTING
8/6/2019 Exploring Correlated Subspaces for Efficient
45/51
In unit testing the program making the system are tested. It is
sometimes for this reason is called program testing. The software units in a
system are module routines that are assembled and integrated to perform a
specific function. It mainly focuses on the module, independently of on
another, to locate the errors in coding and logic that are contained within
the module alone. Setting break point in the code so that it is easy to find
the error location when the input is given does the unit testing. The unit test
is always white box oriented. Since each module in the system, receive input
and generate the output, test cases are needed to test the range expected. This
system dividing into sales module and purchase module, both the modules
are tested separately and unit test get successful.
ACCEPTANCE TESTING
This is the final stage in the testing process before the system is
accepted for the operational use. This system is tested with data supplied by
the system procure rather than similar test data acceptance testing may
reveal errors and omissions in the system requirements definition because
the real data exercise the system in different ways from the test data.
Acceptance testing may also reveal requirements problem where the
systems facilities do no really meet the users needs for the system
performance is unacceptable but this system met all the requirements of the
user and performed well.
INTEGRATION TESTING
8/6/2019 Exploring Correlated Subspaces for Efficient
46/51
Integration level testing focuses on the transfer of data and control
across a programs internal and external interfaces. External interfaces are
those with other software, system hardware, and the users and can be
described as communications links.
PERFORMANCE TESTING
Performance testing helps ensure that a product performs its functions
at the required speed. Planning for performance testing starts at the
beginning of the project when product goals and requirements are defined.
Performance testing is a part of the products initial engineering plan.
SYSTEM TESTING
System level testing demonstrates that all specified functionality
exists and that the software product is trustworthy. This testing verifies the
as built programs functionality and performance with respect to the
requirements for the software product as exhibited on the specified operating
platform(s). System level software testing addresses functional concerns and
the following elements of a devices software that are related to the intended
use(s).
Performance issues (e.g. response times, reliability measurements):
Response to stress conditions, e.g. behavior under
maximum load, continuous use.
8/6/2019 Exploring Correlated Subspaces for Efficient
47/51
Operational of internal and external security features.
Effectiveness of recovery procedures, including disaster
recovery.
Usability.
Compatibility with other software products.
Behavior in each of the defined hardware configurations
and
Accuracy of documentation.
Test Plan
Before going for testing, first decide the type of testing. For this
impact system unit testing is carried out. Before going for testing, the
following things are taken into consideration.
To ensure that information properly flows in and out of the program.
To find out whether the local data structures maintains its integrity
during all steps in an algorithm execution.
To ensure that the module operates properly at boundaries established
to limit or restrict processing.
To find out whether all statements in the module have been executed
at least once.
To find out whether error-handling paths are working correctly or not.
8/6/2019 Exploring Correlated Subspaces for Efficient
48/51
TEST CASES
A test case is as set of conditions or variables under which a tester will
determine if a requirement or use case upon an application is partially or
fully satisfied. It may take many test cases to determine that a requirement is
fully satisfied. In order to fully test that all the requirements of and
application met, there must be at least one test case for each requirement
unless a requirement has sub requirement. In that situation, each sub
requirement must have at least one test case. The written test case is that
there is known input and an expected output, which is worked out before the
test is executed. The known input should test a precondition and the
expected output should test a post condition test cases uncover the following
categories:
Erroneous initialization or default values and inconsistent data types
Incorrect (misspelled or truncated) variable name
Underflow, overflow and addressing exceptions
8/6/2019 Exploring Correlated Subspaces for Efficient
49/51
SYSTEM IMPLEMENTATION
Implementation is the most crucial stage in achieving a successful
system and giving the users confidence that the new system is workable and
effective. This type of conversation is relatively easy to handle, provided
there are no major changes in the system.
Each program is tested individually at the time of development using
the data and has verified that this program linked together in the way
specified in the programs specification, the computer system and its
environment is tested to the satisfaction of the user. The system that has
been developed is accepted and proved to be satisfactory for the user. And
so the system is going to be implemented very soon. A simple operating
procedure is included so that the user can understand the different functions
clearly and quickly.
Initially as a first step the executable form of the application is to be
created and loaded in the common server machine which is accessible to the
entire user and the server is to be connected to a network. The final stage is
8/6/2019 Exploring Correlated Subspaces for Efficient
50/51
to document the entire system which provides components and the operating
procedures of the system.
Implementation is the stage of the project when the theoretical design
is turned out into a working system. Thus it can be considered to be the most
critical stage in achieving a successful new system and in giving the user,
confidence that the new system will work and be effective. The file is
downloaded from the server which takes minimum time for retrieval.
8/6/2019 Exploring Correlated Subspaces for Efficient
51/51
Conclusion:
In this project, we have addressed the problem of efficient query
processing over sparse databases. To alleviate the suffering from sparsity
and high-dimensionality of sparse data, we proposed a new approach named
HoVer. According to the characteristics of sparse data sets, we vertically
partition the high-dimensional sparse data into multiple lower-dimensional
subspaces, and all the dimensions in each subspace are highly correlated,
respectively. The experimental results show that our proposed scheme can
find correlated subspaces effectively, and yield superior storage and query
performance for conducting queries in sparse databases.