+ All Categories
Home > Documents > Exploring Correlated Subspaces for Efficient

Exploring Correlated Subspaces for Efficient

Date post: 07-Apr-2018
Category:
Upload: mohit-gupta
View: 219 times
Download: 0 times
Share this document with a friend

of 51

Transcript
  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    1/51

    Exploring Correlated Subspaces for Efficient

    Query Processing in Sparse Databases

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    2/51

    Abstract

    The sparse data is becoming increasingly common and available in many

    real-life applications. However, relative little attention has been paid to

    effectively model the sparse data and existing approaches such as the

    conventional "horizontal" and "vertical" representations fail to provide

    satisfactory performance for both storage and query processing, as such

    approaches are too rigid and generally do not consider the dimension

    correlations. In this project, we propose a new approach, named HoVer, to

    store and conduct query for sparse datasets in an unmodified RDBMS,

    where HoVer stands for Horizontal representation over Vertically

    partitioned subspaces. According to the dimension correlations of sparse

    datasets, a novel mechanism has been developed to vertically partition a

    high-dimensional sparse dataset into multiple lower dimensional subspaces,

    and all the dimensions are highly correlated intra-subspace and highly

    unrelated inter-subspace respectively. Therefore, original data objects can be

    represented by the horizontal format in respective subspaces. With the novel

    HoVer representation, users can write SQL queries over the original

    horizontal view, which can be easily rewritten into queries over the subspace

    tables. Experiments over synthetic and real-life datasets show that our

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    3/51

    approach is effective in finding correlated subspaces and yields superior

    performance for the storage and query of sparse data.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    4/51

    Introduction

    With continuous advances in the network and storage technology, there is

    dramatic growth in the amount of very high-dimensional sparse data from a

    variety of new application domains, such as bioinformatics, time series, and

    perhaps, most importantly e-commerce, which pose significant challenges to

    RDBMS. The main characteristics of these sparse data sets may be

    summarized as follows: High dimensionality: The dimensionality of feature

    vectors may be very high, i.e., the number of possible attributes for all

    objects is huge. For example, in some e-commerce applications, each

    participant may declare their own idiosyncratic attributes for the products,

    which results in data sets that have thousands of attributes. Sparsity: Each

    object may have only a small subset of attributes, which is called active

    dimensions, i.e., significant values appears only in few active dimensions; in

    addition, different objects may have different active dimensions. For

    example, an e-commerce data set may have thousands of attributes, but most

    of which are null and only a few of which apply to a particular product.

    Correlation: Since each object may have only few active dimensions, more

    likely, similar objects share same or similar active dimensions. For example,

    in recommendation systems, it is important to find homogeneous groups of

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    5/51

    users with similar ratings in subsets of the attributes. Therefore, it is possible

    to find certain subspaces shared by similar objects.

    In existing RDBMSs, objects are conventionally stored using a

    horizontal format called the horizontal representation in this project. For this

    format, one column corresponds to an attribute, and if an object misses a

    particular attribute, the corresponding column in the row for the object will

    be null and storing a sparse data set using the horizontal format, which is

    straightforward and can be easily implemented. However, the format is not

    suitable for sparse database, for it may suffer from sparsity and frequent

    schema evolution, hence, the space and time performance may not be

    satisfactory; in addition, the number of columns in a horizontal table is

    typically limited to 1,000 in general commercial DBMSs, which is not

    enough for many real-life applications. If the number of columns in a

    horizontal table is more than 1,000, a record may not reside in a single disk

    page, and the page overflow will significantly degrade the performance. In

    the last decades, commercial RDBMSs, such as DB2, SQL Server, and

    Oracle, have improved the null storing and handling capabilities, which

    results in smaller horizontal tables and better query performance. Our

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    6/51

    approach proposed in this project still uniformly outperforms the horizontal

    representation.

    An alternative is known as the vertical format, and is called the

    vertical representation in this project. This format storing a sparse data set

    using the vertical format; each active dimension of an object is represented

    by the object identifier, the attribute name, and the value. The vertical format

    can scale to thousands of attributes, avoid storage of null values, and support

    evolving schemas; however, writing queries over the format is cumbersome

    and error-prone, and an expensive multi way self-join needs to be conducted

    if the objects in the query result need to be returned in the conventional

    horizontal format.

    From the above introduction, we know that both the horizontal and the

    vertical representations have advantages and disadvantages. The horizontal

    representation has lots of nulls but simple queries, and the vertical

    representation has no nulls but more complex queries. Therefore, an optimal

    representation should benefit from the advantages and alleviate the

    drawbacks. In this project, we propose a new approach which combines the

    horizontal and the vertical representations, and can store and conduct query

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    7/51

    for sparse data sets in an unmodified RDBMS. This novel representation is

    named HoVer, which stands for Horizontal representation over vertically

    partitioned subspaces. The HoVer representation can efficiently find a better

    intermediate ground between the horizontal representation and the vertical

    representation if there are dimension correlations to be exploited. In HoVer,

    we first vertically partition the data set into multiple lower- dimensional

    subspaces, and objects are represented in horizontal format in the subspace

    tables. Partitioning the sparse data space into meaningful subspaces is a

    nontrivial task; however, sparse data sets generally have some good

    properties in nature, such as sparsity and correlation.

    Therefore, we can design an effective mechanism to split the data

    space into multiple subspaces. We define the correlated degree between

    dimensions and cluster highly correlated dimensions as a subspace. After

    partitioning, the original sparse data set can be transformed into the HoVer

    format. The combination of different approaches demonstrates the

    frameworks of the horizontal, HoVer, and vertical approaches based on an

    unmodified RDBMS. Users write SQL queries over the conventional

    horizontal view; for the HoVer and vertical approaches, the SQL queries can

    be rewritten into queries over the subspace tables and the vertical table

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    8/51

    stored by the unmodified RDBMS, respectively, and the query results

    returned by the RDBMS are all in the horizontal format. A comprehensive

    experimental study demonstrates the superiority of our approach, as our

    approach fully utilizes the properties of the sparse data.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    9/51

    Existing system:

    The existing RDBMSs, objects are conventionally stored using a

    horizontal format called the horizontal representation in this paper. For this

    format, one column corresponds to an attribute, and if an object misses a

    particular attribute, the corresponding column in the row for the object will

    be null. Fig. 1 shows an example of storing a sparse data set using the

    horizontal format, which is straightforward and can be easily implemented.

    The format is not suitable for sparse database, for it may suffer from sparsity

    and frequent schema evolution, hence, the space and time performance may

    not be satisfactory; in addition, the number of columns in a horizontal table

    is typically limited to 1,000 in general commercial DBMSs, which is not

    enough for many real-life applications. If the number of columns in a

    horizontal table is more than 1,000, a record may not reside in a single disk

    page, and the page overflow will significantly degrade the performance. In

    the last decades, commercial RDBMSs, such as DB2, SQL Server, and

    Oracle, have improved the null storing and handling capabilities, which

    results in smaller horizontal tables and better query performance.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    10/51

    An alternative is known as the vertical format and is called the vertical

    representation in this paper. Fig. 2 shows an example of storing a sparse data

    set using the vertical format; each active dimension of an object is

    represented by the object identifier, the attribute name, and the value. The

    vertical format can scale to thousands of attributes, avoid storage of null

    values, and support evolving schemas; however, writing queries over the

    format is cumbersome and error-prone, and an expensive multi way self-join

    needs to be conducted if the objects in the query result need to be returned in

    the conventional horizontal format. From the above introduction, we know

    that both the horizontal and the vertical representations have advantages and

    disadvantages. The horizontal representation has lots of nulls but simple

    queries, and the vertical representation has no nulls but more complex

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    11/51

    queries. Therefore, an optimal representation should benefit from the

    advantages and alleviate the drawbacks. In this paper, we propose a new

    approach which combines the horizontal and the vertical representations,

    and can store and conduct query for sparse data sets in an unmodified

    RDBMS. This novel representation is named HoVer, which stands for

    Horizontal representation over vertically partitioned subspaces.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    12/51

    Proposed system:

    THE HoVer REPRESENTATION

    As we introduced previously, the pure horizontal or vertical

    representation may yield unsatisfactory performance in sparse databases.

    Therefore, we propose a new representation called HoVer, which can

    effectively exploit the characteristics of sparse data sets, such as sparsity and

    dimension correlation. We aim at achieving good space and time

    performance for storing and querying high-dimensional sparse data sets.

    Although the dimensionality of sparse data sets could be very high, up

    to thousands, a single data object typically has only a few active dimensions,

    and similar objects have a better chance to share similar active dimensions.

    A closer inspection of many e-commerce sparse data sets shows that typical

    e-commerce data sets have a wide variety of items which can be organized

    into categories and the categories themselves are hierarchically grouped;

    items that belong to a common category are likely to have common

    attributes, while those within the same subcategory are likely to have more

    common attributes. The RDF data also shows that the attributes of similar

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    13/51

    subjects tend to be defined together. This motivates us to find certain

    subspaces which are shared by similar data groups, and to split the full space

    into some lower-dimensional subspaces.

    There are some previous research works which focus on subspace

    clustering. In general, subspace clustering is the task of automatically

    detecting all clusters in the original feature space, either by directly

    computing the subspace clusters or by selecting interesting subspaces for

    clustering. However, such approaches are very time-inefficient, and hence,

    cannot scale well to high-dimensional space. For example, in the proposed

    algorithm takes 5 hours for a 30-dimensional data set, while jumping to 30

    hours for a 50-dimensional data set.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    14/51

    Considering sparse data sets with thousands of dimensions, such

    approaches are unacceptable in real-life applications. On the other hand, our

    purpose is to split the full space into subspaces which can yield superior

    performance for the storage and query of sparse data. These approaches are

    not suitable for this scenario. Here we introduce how to present sparse data

    sets using the novel HoVer representation. First, we design an efficient and

    effective approach to find correlated dimensions. After that, we partition the

    original full space into subspaces and store the original sparse data set using

    multiple tables where each table corresponds to a certain subspace.

    Correlated Degree Determination

    Before subspace selection, we first consider how to measure the

    correlation between two dimensions. Suppose that the sparse data set is

    dimensional, and has N tuples, we generate a table to represent the

    relation of inter dimensions of the data set. We call this table the correlation

    table for ease of presentation.

    Definition 1 (Correlation Table). The correlation table represents the

    correlation of dimensions in a sparse data set, which is a super triangle

    matrix. An entry , where , counts for the times that dimensions i and j

    are active simultaneously.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    15/51

    Given the sparse data set as shown in Fig. 1, we can generate the

    corresponding correlation table as shown in Fig. 5. For example,

    which means that dimension 1 is active in four tuples; , which means

    that dimensions 1 and 3 are active simultaneously once. Algorithm 1

    illustrates an efficient way to generate the correlation table for a sparse data

    set. We first initialize the correlation table , where each entry in the super

    triangle matrix is set to 0. After that, the sparse data set is scanned, and

    tuples in the data set are processed one by one. For each tuple, we convert it

    into an array with length in this array, the value of a certain entry is set

    to 1 if the corresponding dimension is active; otherwise, and it is set to 0.

    With this array, we can accumulate the correlation information into the

    correlation table. For each active dimension i, we access the i th row of the

    super triangle matrix, scan the array, and increase by 1 if dimensions i

    and j are both active. The algorithm is very efficient since the sparse data set

    only needs to be scanned once. In addition, it is also time-efficient because

    no distance computation is involved.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    16/51

    After the correlation table is created, it can be incrementally

    maintained in the presence of updates, and we only need to revise the values

    of the entries in the correlation table which correspond to two columns of a

    row which tend to be active simultaneously or begin to be active

    simultaneously. In the presence of insertions and deletions, the table can be

    maintained in a similar way.

    The information in the correlation table can be utilized to evaluate the

    correlation between any two dimensions. We first define the correlated

    degree between two dimensions which can facilitate subspace partitioning of

    high-dimensional sparse data.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    17/51

    Definition 2 (Correlated Degree). The correlated degree measures the

    correlation between two dimensions i and j, where i

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    18/51

    where means that dimension i is active, seems to be a good choice for

    measuring the correlation (with a slight abuse of terminology,

    characterizes the ratio instead of the probability that dimension i is

    active) According to the probability theory, with the increase of the value of

    , the correlation between the two dimensions increases at the same time,

    and the two dimensions are independent in case of But, the value of

    is highly influenced by active densities of the two dimensions, which

    makes it not eligible to measure the correlation. A variation of ,in

    which is redefined as the ratio of dimension i is active among tuples in

    which at least one of the two dimensions is active, i.e.,

    seems to be a good choice for measuring the correlation

    Since the two dimensions within the rows try not

    to be active simultaneously, we can prove that all the time, which

    means that the two dimensions cannot be positive-correlated. The problem

    of is that it cannot accurately measure the correlation between two

    dimensions in some cases. According to our correlation measure criteria and

    above analysis, we select the correlated degree defined in Definition 2

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    19/51

    to measure the correlation between two dimensions in sparse data sets,

    where 0 , and with the increase of the value

    of , the correlation between the two dimensions i and j increases at the

    same time.

    Subspace Selection

    An optimal subspace partitioning should enjoy two properties, i.e., all

    dimensions are highly correlated intra subspaces while being highly

    unrelated inters subspaces. If the number of subspaces determined by the

    user is smaller, dimensions which are not highly correlated may be clustered

    into the same subspace; hence, the subspace tables are still very sparse. On

    the other hand, if the number of subspaces determined by the user is larger,

    dimensions which are highly correlated may be distributed into different

    sub- spaces; since the highly correlated dimensions are often defined and

    accessed together, the join operations for accessing the dimensions, which

    are distributed into different subspaces, are rather expensive.

    According to our above analysis, the number of subspaces should be

    determined by the subspace selection algorithm according to the dimension

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    20/51

    correlations of the sparse data set. Because the underlying storage and query

    processing details of the RDBMS may have some influence on the

    performance, there may not exist perfect subspace clustering typically.

    Therefore, the main aim of our subspace selection algorithm is to find the

    subspaces in an efficient way, and yield superior performance for the storage

    and query of sparse data. First of all, any two dimensions in a subspace

    should be highly correlated, which can ensure that the subspace tables no

    longer suffer from sparsity. Next, in order to ensure that the highly

    correlated dimensions, which are often defined and accessed together, can be

    clustered into the same subspace, the number of the subspaces should be as

    small as possible.

    Therefore, our subspace selection problem can be formally defined as

    follows: Given the correlated degree threshold and a sparse data set, which

    contains dimensions we partition the original full space into m

    subspaces , where and for any two

    sub- spaces si and sj, where our objective is that the

    correlated degree between any two dimensions in a subspace is no less than

    and the number of subspaces m is minimized.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    21/51

    Our subspace selection problem can be mapped to the Minimum

    Clique Partition problem. Given a graph the Minimum Clique

    Partition problem partitions V into disjoint subsets the objective

    is that for , the sub graph induced by Vi is a complete graph, and the

    number of partitions m is minimized. If we map each dimension in the

    sparse data set to a node in the graph, and if the correlated degree between

    two dimensions is no less than the correlated degree threshold, we add an

    edge to link the two corresponding nodes in the graph, and our subspace

    selection problem is exactly same as the Minimum Clique Partition problem.

    Unfortunately, the Minimum Clique Partition problem is NP-complete,

    which means that we should use a heuristic algorithm to approximate

    optimal partitions which tries to group together correlated dimensions.

    Algorithm 2 presents how to generate subspaces from a given

    correlation table in a heuristic manner. If there exist unclassified dimensions,

    i.e., not included into any existing subspace, we pick the unclassified

    dimension D with the highest correlation table value, which is the most

    active dimension left in the sparse data set. Then, a new subspace will

    be generated, and all unclassified dimensions are examined. If the correlated

    degree between an unclassified dimension d and each dimension d0 in s is

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    22/51

    not less than the given correlated degree threshold , then d is added to

    subspace s. It is apparent that the algorithm ensures that the correlated

    degree between any two dimensions in a subspace is no less than the

    correlated degree threshold, and the algorithm minimizes the number of

    the subspaces in a greedy manner, i.e., tries to add the unclassified

    dimensions to current subspace.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    23/51

    The correlated degree threshold has great influence on the subspace

    generation. With a larger threshold, the non null density of each sub space

    will be larger, i.e., the dimensions in the subspace are highly correlated, but

    more subspaces will be generated. With a smaller threshold, fewer subspaces

    will be generated, but the non null density of each sub space will be smaller.

    Actually, the optimal correlated degree threshold varies for different data

    sets.

    Given the correlation table as shown in Fig. 5, we are able to partition

    the original 8-dimensional space into multiple subspaces. At the beginning,

    D2 is selected as the first dimension of subspace s1, for D2 has the maximal

    correlation table value, i.e., 5. If , D1, D3, and D4 will be

    subsequently added to subspace s1, i.e., . After that, we can

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    24/51

    use the same strategy to generate other two subspaces, and

    if , four subspaces will be generated, i.e.

    and . We can see that with the

    increase of the correlated degree threshold, the number of subspaces

    increases at the same time.

    Vertical Partition

    The HoVer representation means vertical partition of the

    corresponding horizontal representation. The OID attribute exists in each

    subspace for linking the data items which are partitioned into multiple

    subspaces. The course of transforming horizontal representations into HoVer

    representations is lossless, since the candidate key, i.e., OID is contained by

    each subspace table. Fig. 6 shows the HoVer representation corresponding to

    the horizontal representation shown in Fig. 1 with correlated degree

    threshold ; as shown in Fig. 7, if we increase to 0.5, the subspace

    D1234 will be further split into two subspaces D12 with objects {1, 2, 3, 4,

    6} and D34 with objects {3, 5, 6}. We can see that if none of the dimensions

    in a subspace is active in a horizontally represented tuple, the tuple will be

    absent in the subspace table after vertical partition. It should be easy to

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    25/51

    convert horizontally represented data to the HoVer representation; for each

    horizontally represented tuple, if at least one dimension (not including OID)

    of a subspace is active, OID along with the subspace dimensions are

    projected and inserted into the subspace table. For example, converting the

    horizontally represented table H shown in Fig. 1 to the subspace table shown

    in Fig. 6b can be characterized by a relational algebraic expression as

    Schema Evolution

    When a new column is added, a new subspace which only contains

    the new column will be created, and the correlation table should also be

    updated accordingly. Since the correlation table is incrementally maintained,

    the new column may be merged to a subspace when subspaces are

    reorganized. When a column is deleted, we only need to delete the column

    from the corresponding subspace and update the correlation table

    accordingly.

    QUERY PROCESSING IN HoVer

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    26/51

    In this section, we introduce how the queries over the horizontal

    representation can be processed over the HoVer representation.

    Query Rewriting

    Our ultimate purpose is to define horizontally represented views over

    the HoVer representation. Users typically issue traditional SQL queries over

    the horizontal view, which can be rewritten into queries over the underlying

    HoVer representation. Generally, the reconstruction of the horizontal table H

    from the subspace tables can be characterized by a relational algebraic

    expression as

    Where is the OID list which contains all the OIDs in the

    horizontal table. Hence, we should maintain an OID list during vertical

    partition. For example, the reconstruction of the horizontal table H shown in

    Fig. 1 from the subspace tables shown in Fig. 6 can be characterized by a

    relational algebraic expression as

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    27/51

    In our work, the dimensions in the original sparse data space are

    clustered into subspaces, and a horizontal table is

    Vertically partitioned into subspace tables. In many real-life applications, the

    dimensions with a high correlated degree are likely to characterize similar

    topics and have high probability of being accessed together; hence, they

    should be stored in the same subspace table. We can take advantage of this

    characteristic and access as few subspace tables as possible in query

    evaluation.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    28/51

    FEASIBILITY STUDY

    The feasibility of the project is analyzed in this phase and business proposal is put

    forth with a very general plan for the project and some cost estimates. During system

    analysis the feasibility study of the proposed system is to be carried out. This is to

    ensure that the proposed system is not a burden to the company. For feasibility analysis,

    some understanding of the major requirements for the system is essential. Three key

    considerations involved in the feasibility analysis are 1.Economical feasibility

    2.Technical feasibility 3.Operational feasibility.

    ECONOMICAL FEASIBILITY

    The study is carried out to check the economic impact that the system will have

    on the organization. The amount of fund that the company can pour into the research and

    development of the system is limited. The expenditures must be justified. Thus the

    developed system as well within the budget and this was achieved because most of the

    technologies used are freely available. Only customized products had to be purchased.

    OPERATIONAL FEASIBILITY

    The aspect of study is to check the level of acceptance of the system by the user.

    This includes the process of training the user to use the system efficiently. The user must

    not be threatened by the system, instead must accept it as a necessity. The level of

    acceptance by the users solely depends on the methods that are employed to educate the

    user about the system and to make him familiar with it. The level of confidence must be

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    29/51

    raised so that user is also able to make some constructive criticism, which is welcomed,

    as user is the final user of the system.

    TECHNICAL FEASIBILITY

    Technical feasibility is carried out to check the, the technical requirements of

    the system. Any system developed must not have a high demand on the available

    technical resources. This will lead to high demands on the available technical resources.

    This will lead to high demands being placed on the client.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    30/51

    SYSTEM SPECIFICATION

    S/W REQUIREMENTS

    Windows XP

    MS-SQL server

    MS Visual Studio 2005

    H/W REQUIREMENTS

    Processor : Dual Core

    CPU Clock Speed : 651 MHz

    External memory : 512 MB (min)

    Hard Disk Drive : 40 GB (min)

    Mouse : Logitech Mouse

    Keyboard : Logitech of 104 Keys

    Monitor : 15.6 LCD Monitor

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    31/51

    SOFTWARE SPECIFICATION

    FRONT END

    NET FRAMEWORK

    .NET is a "Software Platform". It is a language-neutral environment

    for developing rich .NET experiences and building applications that can

    easily and securely operate within it. When developed applications are

    deployed, those applications will target .NET and will execute wherever

    .NET is implemented instead of targeting a particular Hardware/OS

    combination. The components that make up the .NET platform are

    collectively called the .NET Framework.

    The .NET Framework is a managed, type-safe environment for

    developing and executing applications. The .NET Framework manages all

    aspects of program execution, like, allocation of memory for the storage of

    data and instructions, granting and denying permissions to the application,

    managing execution of the application and reallocation of memory for

    resources that are not needed.

    The .NET Framework is designed for cross-language compatibility.

    Cross-language compatibility means, an application written in Visual

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    32/51

    Basic .NET may reference a DLL file written in C# (C-Sharp). A Visual

    Basic .NET class might be derived from a C# class or vice versa.

    The .NET Framework consists of two main components:

    Common Language Runtime (CLR)

    Class Libraries

    COMMON LANGUAGE RUNTIME (CLR)

    The CLR is described as the "execution engine" of .NET. It provides

    the environment within which the programs run. It's this CLR that manages

    the execution of programs and provides core services, such as code

    compilation, memory allocation, thread management, and garbage

    collection. Through the Common Type(CTS), it enforces strict type safety,

    and it ensures that the code is executed in a safe environment by enforcing

    code access. The software version of .NET is actually the CLR version.

    WORKING OF THE CLR

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    33/51

    When the .NET program is compiled, the output of the is not an

    executable file but a file that contains a special type of code called the

    Microsoft Intermediate Language (MSIL), which is a low-level set of

    instructions understood by the common language run time. This MSIL

    defines a set of portable instructions that are independent of any specific

    CPU. It's the job of the CLR to translate this Intermediate code into a

    executable code when the program is executed making the program to run in

    any environment for which the CLR is implemented. And that's how the

    .NET Framework achieves Portability. This MSIL is turned into executable

    code using a JIT (Just In Time) complier. The process goes like this,

    when .NET programs are executed, the CLR activates the JIT complier. The

    JIT complier converts MSIL into native code on a demand basis as each part

    of the program is needed. Thus the program executes as a native code even

    though it is compiled into MSIL making the program to run as fast as it

    would if it is compiled to native code but achieves the portability benefits of

    MSIL.

    CLASS LIBRARIES

    http://www.startvbdotnet.com/dotnet/framework.aspxhttp://www.startvbdotnet.com/dotnet/framework.aspx
  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    34/51

    Class library is the second major entity of the .NET Framework

    which is designed to integrate with the common language runtime. This

    library gives the program access to runtime environment. The class library

    consists of lots of prewritten code that all the applications created in VB

    .NET and Visual Studio .NET will use. The code for all the elements like

    forms, controls and the rest in VB .NET applications actually comes from

    BACK END - SQL

    SQL stands for Structured Query Language. SQL is used to

    communicate with a database. According to ANSI (American National

    Standards Institute), it is the standard language for relational database

    management systems. SQL statements are used to perform tasks such as

    update data on a database, or retrieve data from a database. Some common

    relational database management systems that use SQL are: Oracle, Sybase,

    Microsoft SQL Server, Access, Ingres, etc. Although most database systems

    use SQL, most of them also have their own additional proprietary extensions

    that are usually only used on their system. However, the standard SQL

    commands such as "Select", "Insert", "Update", "Delete", "Create", and

    "Drop" can be used to accomplish almost everything that one needs to do

    with a database. This tutorial will provide you with the instruction on the

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    35/51

    basics of each of these commands as well as allow you to put them to

    practice using the SQL Interpreter.

    CREATE A TABLE

    To create a new table, enter the keywords create table followed by

    the table name, followed by an open parenthesis, followed by the first

    column name, followed by the data type for that column, followed by any

    optional constraints, and followed by a closing parenthesis. It is important to

    make sure you use an open parenthesis before the beginning table, and a

    closing parenthesis after the end of the last column definition. Make sure

    you seperate each column definition with a comma. All SQL statements

    should end with a ";".

    The table and column names must start with a letter and can be

    followed by letters, numbers, or underscores - not to exceed a total of 30

    characters in length. Do not use any SQL reserved keywords as names for

    tables or column names (such as "select", "create", "insert", etc).

    Data types specify what the type of data can be for that particular

    column. If a column called "Last_Name", is to be used to hold names, then

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    36/51

    that particular column should have a "varchar" (variable-length character)

    data type.

    INSERTING INTO A TABLE

    The insert statement is used to insert or add a row of data into the table.

    To insert records into a table, enter the key words insert into followed

    by the table name, followed by an open parenthesis, followed by a list of

    column names separated by commas, followed by a closing parenthesis,

    followed by the keyword values, followed by the list of values enclosed in

    parenthesis. The values that you enter will be held in the rows and they will

    match up with the column names that you specify. Strings should be

    enclosed in single quotes, and numbers should not.

    insert into "tablename" (first_column,...last_column) values

    (first_value,...last_value);

    UPDATING RECORDS

    The update statement is used to update or change records that match a

    specified criteria. This is accomplished by carefully constructing a where

    clause.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    37/51

    update "tablename"set "columnname" = "newvalue" [,"nextcolumn" =

    "newvalue2"...]

    where "columnname" OPERATOR "value" [and|or "column"

    OPERATOR "value"];

    DELETING RECORDS

    The delete statement is used to delete records or rows from the table.

    delete from "tablename"where "columnname" OPERATOR "value" [and|or

    "column"

    OPERATOR "value"];

    DROP A TABLE

    The drop table command is used to delete a table and all rows in the table.

    To delete an entire table including all of its rows, issue the drop table

    command followed by the tablename. drop table is different from deleting

    all of the records in the table. Deleting all of the records in the table leaves

    the table including column and constraint information. Dropping the table

    removes the table definition as well as all of its rows.

    drop table "tablename".

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    38/51

    List of Modules

    Data entry:

    Get the details of user. To process the database we need many records

    to show the efficiency of our Hover method. In this module we get the datas

    from the user to process the data.

    Horizontal Representation:

    The horizontal format, which is straightforward and can be easily

    implemented. However, the format is not suitable for sparse database, for it

    may suffer from sparsity and frequent schema evolution, hence, the space

    and time performance may not be satisfactory; in addition, the number of

    columns in a horizontal table is typically limited to 1,000 in general

    commercial DBMSs, which is not enough for many real-life applications.

    Vertical representation:

    In vertical format each active dimension of an object is represented by

    the object identifier, the attribute name, and the value. The vertical format

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    39/51

    can scale to thousands of attributes, avoid storage of null values, and support

    evolving schemas; however, writing queries over the format is cumbersome

    and error-prone, and an expensive multiway self-join needs to be conducted

    if the objects in the query result need to be returned in the conventional

    horizontal format.

    HoVer representation:

    Hover which stands for Horizontal representation over Vertically

    partitioned sub- spaces. The HoVer representation can efficiently find a

    better intermediate ground between the horizontal representation and the

    vertical representation if there are dimension correlations to be exploited. In

    HoVer, we first vertically partition the data set into multiple lower-

    dimensional subspaces, and objects are represented in horizontal format in

    the subspace tables. Partitioning the sparse data space into meaningful

    subspaces is a nontrivial task

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    40/51

    DFD:

    ARCHITECTURE:

    SYSTEM FLOW DIAGRAM:

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    41/51

    INPUT DESIGN

    Input design is one of the most important phases of the system design.

    Input design is the process where the input received in the system are

    planned and designed, so as to get necessary information from the user,

    eliminating the information that is not required. The aim of the input design

    is to ensure the maximum possible levels of accuracy and also ensures that

    the input is accessible that understood by the user.

    The input design is the part of overall system design, which requires

    very careful attention. If the data going into the system is incorrect then the

    processing and output will magnify the errors.

    The objectives considered during input design are:

    Nature of input processing.

    Flexibility and thoroughness of validation rules.

    Handling of properties within the input documents.

    Screen design to ensure accuracy and efficiency of the

    input relationship with files.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    42/51

    Careful design of the input also involves attention to

    error handling, controls, batching and validation

    procedures.

    Input design features can ensure the reliability of the system and

    produce result from accurate data or they can result in the production of

    erroneous information.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    43/51

    OUTPUT DESIGN

    The term output applying to information produced by an information

    system whether printed or displayed while designing the output we should

    identify the specific output that is needed to information requirements select

    a method to present the formation and create a document report or other

    formats that contains produced by the system.

    TYPES OF OUTPUT

    Whether the output is formatted report or a simple listing of the contents

    of a file, a computer process will produce the output.

    A Document

    A Message

    Retrieval from a data store

    Transmission from a process or system activity

    Directly from an output sources

    The Output of our project will be a object detection with separate part

    detection.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    44/51

    SOFTWARE TESTING FUNDAMENTALS

    Testing presents an interesting task for software engineers. Earlier in

    the software process, the engineer attempts to build software from an

    abstract concept to a tangible implementation.. The engineer creates the

    series of test cases that are intended to demolish the software that has been

    build.

    To test any program we need to have a description of its expected

    behavior and a method of determining whether the observed behavior

    conforms to the expected behavior for this we need a test oracle. A test-

    oracle is a mechanism; different from the program itself that can be used

    to check the correctness of the output of the program for the test cases.

    Human-oracle is human beings who mostly compute by hand what the

    output of the program should be. Human-oracle can make mistake. So

    test oracle is defined in the tool to automate testing and avoids mistakes.

    Testing principles

    All the tests should be traceable to requirement.

    Tests should be planned long before testing begins that is the

    test planning can bring as soon as the requirement model is complete.

    Testing should begin in the small and progress towards

    testing in the large. The first planned and executed generally focus on

    individual program modules. As testing progresses, testing shifts focus

    and attempt to find errors in integrated clusters of modules and ultimately

    in the entire system.

    UNIT TESTING

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    45/51

    In unit testing the program making the system are tested. It is

    sometimes for this reason is called program testing. The software units in a

    system are module routines that are assembled and integrated to perform a

    specific function. It mainly focuses on the module, independently of on

    another, to locate the errors in coding and logic that are contained within

    the module alone. Setting break point in the code so that it is easy to find

    the error location when the input is given does the unit testing. The unit test

    is always white box oriented. Since each module in the system, receive input

    and generate the output, test cases are needed to test the range expected. This

    system dividing into sales module and purchase module, both the modules

    are tested separately and unit test get successful.

    ACCEPTANCE TESTING

    This is the final stage in the testing process before the system is

    accepted for the operational use. This system is tested with data supplied by

    the system procure rather than similar test data acceptance testing may

    reveal errors and omissions in the system requirements definition because

    the real data exercise the system in different ways from the test data.

    Acceptance testing may also reveal requirements problem where the

    systems facilities do no really meet the users needs for the system

    performance is unacceptable but this system met all the requirements of the

    user and performed well.

    INTEGRATION TESTING

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    46/51

    Integration level testing focuses on the transfer of data and control

    across a programs internal and external interfaces. External interfaces are

    those with other software, system hardware, and the users and can be

    described as communications links.

    PERFORMANCE TESTING

    Performance testing helps ensure that a product performs its functions

    at the required speed. Planning for performance testing starts at the

    beginning of the project when product goals and requirements are defined.

    Performance testing is a part of the products initial engineering plan.

    SYSTEM TESTING

    System level testing demonstrates that all specified functionality

    exists and that the software product is trustworthy. This testing verifies the

    as built programs functionality and performance with respect to the

    requirements for the software product as exhibited on the specified operating

    platform(s). System level software testing addresses functional concerns and

    the following elements of a devices software that are related to the intended

    use(s).

    Performance issues (e.g. response times, reliability measurements):

    Response to stress conditions, e.g. behavior under

    maximum load, continuous use.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    47/51

    Operational of internal and external security features.

    Effectiveness of recovery procedures, including disaster

    recovery.

    Usability.

    Compatibility with other software products.

    Behavior in each of the defined hardware configurations

    and

    Accuracy of documentation.

    Test Plan

    Before going for testing, first decide the type of testing. For this

    impact system unit testing is carried out. Before going for testing, the

    following things are taken into consideration.

    To ensure that information properly flows in and out of the program.

    To find out whether the local data structures maintains its integrity

    during all steps in an algorithm execution.

    To ensure that the module operates properly at boundaries established

    to limit or restrict processing.

    To find out whether all statements in the module have been executed

    at least once.

    To find out whether error-handling paths are working correctly or not.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    48/51

    TEST CASES

    A test case is as set of conditions or variables under which a tester will

    determine if a requirement or use case upon an application is partially or

    fully satisfied. It may take many test cases to determine that a requirement is

    fully satisfied. In order to fully test that all the requirements of and

    application met, there must be at least one test case for each requirement

    unless a requirement has sub requirement. In that situation, each sub

    requirement must have at least one test case. The written test case is that

    there is known input and an expected output, which is worked out before the

    test is executed. The known input should test a precondition and the

    expected output should test a post condition test cases uncover the following

    categories:

    Erroneous initialization or default values and inconsistent data types

    Incorrect (misspelled or truncated) variable name

    Underflow, overflow and addressing exceptions

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    49/51

    SYSTEM IMPLEMENTATION

    Implementation is the most crucial stage in achieving a successful

    system and giving the users confidence that the new system is workable and

    effective. This type of conversation is relatively easy to handle, provided

    there are no major changes in the system.

    Each program is tested individually at the time of development using

    the data and has verified that this program linked together in the way

    specified in the programs specification, the computer system and its

    environment is tested to the satisfaction of the user. The system that has

    been developed is accepted and proved to be satisfactory for the user. And

    so the system is going to be implemented very soon. A simple operating

    procedure is included so that the user can understand the different functions

    clearly and quickly.

    Initially as a first step the executable form of the application is to be

    created and loaded in the common server machine which is accessible to the

    entire user and the server is to be connected to a network. The final stage is

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    50/51

    to document the entire system which provides components and the operating

    procedures of the system.

    Implementation is the stage of the project when the theoretical design

    is turned out into a working system. Thus it can be considered to be the most

    critical stage in achieving a successful new system and in giving the user,

    confidence that the new system will work and be effective. The file is

    downloaded from the server which takes minimum time for retrieval.

  • 8/6/2019 Exploring Correlated Subspaces for Efficient

    51/51

    Conclusion:

    In this project, we have addressed the problem of efficient query

    processing over sparse databases. To alleviate the suffering from sparsity

    and high-dimensionality of sparse data, we proposed a new approach named

    HoVer. According to the characteristics of sparse data sets, we vertically

    partition the high-dimensional sparse data into multiple lower-dimensional

    subspaces, and all the dimensions in each subspace are highly correlated,

    respectively. The experimental results show that our proposed scheme can

    find correlated subspaces effectively, and yield superior storage and query

    performance for conducting queries in sparse databases.


Recommended