+ All Categories
Home > Documents > MDSN Report

MDSN Report

Date post: 05-Apr-2018
Category:
Upload: athar1988
View: 214 times
Download: 0 times
Share this document with a friend

of 59

Transcript
  • 7/31/2019 MDSN Report

    1/59

    Chapter 1

    Introduction

    Currently, the World Wide Web is the largest source of information. Huge amount of data

    is present on the Web and large amount of data is added to the web constantly. User searches for

    the required information by using particular keywords. We are specially dealing with News. As the

    large number of news available on web, so does the need to provide high-quality summaries in

    order to allow the user to quickly locate the desired information. i.e. to get summary of different

    news from variety of newspapers about same topic as per the query specification.

    Summarization is the process of condensing a source text into a shorter version preserving its

    information content. It can serve several goals from survey analysis of a scientific field to quick

    indicative notes on the general topic of a text. In other words summarization is the process of

    automatically creating a compressed version of a given text that provides useful information for the

    user. The information content of a summary depends on users needs. Topic-oriented summaries

    focus on a users topic of interest, and extract the information in the text that is related to the

    specified topic. Indicative summaries, which can be used to quickly decide whether a text is worth

    reading, are naturally easier to produce

    Query-oriented summarization (QS) tries to extract a summary for a given query. It is a

    common task in many text mining applications. For example, a user submits a query to a search

    engine and the search engine usually returns a lot of result documents. To click-and-view each of

    the returned documents is obviously tedious and infeasible in many cases. One challenging issue is

    how to help the user digest the re turned documents. Typically, the documents talk about different

    perspectives of the query. An ideal solution might be that the system automatically generates a

    concise and informative summary for each perspective of the query. Much work has been done for

    document(s) summarization. Generally, document(s) summarization can be classified into three

    categories:

    1. Single document summarization (SDS)2. Multi-document summarization (MDS)

    3. Query oriented summarization (QS)

    SDS is to extract a summary from a single document; while MDS is to extract a summary from

    multiple documents. The two tasks have been intensively investigated and many methods have been

    Multi-Document Extractive Summarization for News Page 1 of 59

  • 7/31/2019 MDSN Report

    2/59

    proposed. The methods for document(s) summarization can be further categorized into two groups:

    unsupervised and supervised. The unsupervised method is mainly based on scoring sentences in the

    documents by combining a set of predefined features.

    In the supervised method, summarization is treated as a classification or a sequential

    labeling problem and the task is formalized as identifying whether a sentence should be included in

    the summary or not. However, the method requires training examples. Query-oriented

    summarization (QS) is different from the SDS and the MDS tasks. The document cluster denotes

    the information source and the query denotes the information need. A document cluster is a sub set

    of the entire document collection. A compelling application of document summarization is the

    snippets generated by Web search engines for each query result, which assist users in further

    exploring individual results. The Information Retrieval (IR) community has largely viewed text

    documents as linear sequences of words for the purpose of summarization. Although this model

    has proven quite successful in efficiently answering keyword queries, it is clearly not optimal since

    it ignores the inherent structure in documents.

    Furthermore, most summarization techniques are query-independent and follow one of the

    following two extreme approaches:

    1. Either they simply extract relevant passages viewing the document as an unstructured set of

    passages.

    2. Employ Natural Language Processing techniques.

    The former approach ignores the structural information of documents while the latter is too

    expensive for large datasets (e.g., the Web) and sensitive to the writing style of the documents.

    Here a method to add structure, in form of a graph, to text documents in order to allow effective

    query specificsummarization is discussed. That is a document is viewed as a set of interconnected

    text fragments.

    Main focus is on keyword queries since keyword search is the most popular information

    discovery method on documents, because of its power and ease of use. This technique has the

    following key steps: First, at the preprocessing stage, a structure is added to every document,

    which can then be viewed as a labeled, weighted graph, called the document graph. Then, at query

    time, given a set of keywords, a keyword proximity search is performed on the document graphs to

    discover how the keywords are associated in the document graphs. For each document its summary

    is the minimum spanning tree on the corresponding document graph that contains all the keywords

    Multi-Document Extractive Summarization for News Page 2 of 59

  • 7/31/2019 MDSN Report

    3/59

    (or equivalent based on a thesaurus). So data from the minimum spanning tree nodes is collected

    and presented as a summary of the document.

    Automatic summarization is the creation of a shortened version of a text by a

    Computer program which contains important information of the original documents

    1.1 History:

    1. in 1950s: First systems surface level approaches

    Term frequency (Luhn, Rath)

    2. 2.in 1960s: First entity level approaches

    Syntactic analysis

    Surface Level: Location features (Edmundson 1969)

    3. 3.in 1970s:

    Surface Level: Cue phrases (Pollock and Zamora)

    Entity Level

    First Discourse Level: Stroy grammars

    4. 4.in 1980s:

    Entity Level (AI): Use of scripts, logic and production rules, semantic networks (Dejong 1982,

    Fum et al.1985)

    Hybrid (Aretoulaki 1994)

    5. 5.from 1990s-:explosuion of all

    1.2 Literature survey:

    1.2.1 Aim:

    Our aim is to achieve multi-document news summarization.

    1. In this case we are parsing the HTML document(s) and extracting the text file(s) from it.

    As we are dealing with the text only, we have chosen the nearest neighbor algorithm for

    clustering. As it is less complex and sufficient for text.

    2. For the same we are dealing with extractive summary along with the query Specification.

    1.2.2 Extractive and Abstractive Summarization

    Extractive Summarization:

    Multi-Document Extractive Summarization for News Page 3 of 59

  • 7/31/2019 MDSN Report

    4/59

    Produces a summary by selecting indicative sentences, passages or paragraphs from an original

    document according to a predefined target summarization ratio.

    Abstractive summarization:

    Provides a fluent and concise abstract of a certain length that reflects the key

    summarization. This requires highly sophisticated techniques, including semantic representation

    and inference, as well as natural language generation concept of the document. In recent years,

    researchers have tended to focus on extractive years spectrum of Text Summarization Research.

    1.2.3 A System for Query-Specific Document Summarization

    Mr. Ramakrishna Varadarajan & Vangelis Hristidis presented a method to create query specific

    summaries by identifying the most query-relevant fragments and combining them using the

    semantic associations within the document. In particular, structure is added to the documents in the

    preprocessing stage and converted them to document graphs. Then, the best summaries are

    computed by calculating the top spanning trees on the document graphs. This paper presents and

    experimentally evaluates efficient algorithms that support computing summaries in interactive

    time. Furthermore, the quality of the summarization method is compared to current approaches

    using a user survey.

    In this work a structure-based technique is presented to create query-specific summaries

    for text documents. In particular, the document graph of a document is created; to represent the

    hidden semantic structure of the document and then perform keyword proximity search on this

    graph. It is shown in the paper that with a user survey that our approach performs better than other

    state of the art approaches. Furthermore, feasibility of the approach with a performance evaluation

    is shown at last.

    In this approach document graph was built and processing was done on text document, we

    are implementing somewhat similar methodology but in addition HTML to text parser is added,

    i.e. we are processing on HTML files.

    1.2.4 An Incremental Summary Generation System

    Multi-Document Extractive Summarization for News Page 4 of 59

  • 7/31/2019 MDSN Report

    5/59

    Mr. C Ravindranath Chowdary & P Sreenivasa Kumar presented an algorithm strategy to finds

    pair of sentences, one from the current summary and other from the new document that is to be

    swapped to improve the quality of the summary. For a given query, quality of a summary is

    determined by its informativeness, coherence and completeness. A scoring function that captures

    these features to calculate the quality of a summary is proposed. The process of

    updating/improving summary is continued iteratively till the improvement in quality measure

    becomes negligible. Experimental results, both qualitative and quantitative, show that performance

    of the proposed approach for incremental summary generation is quite encouraging.

    This paper deals with updating the available extractive summary in the scenario where the

    initial documents used for summarization are not accessible. The proposed algorithm updates the

    available summary as and when a new document is made available to the system.

    In this approach extractive summarization in used but the original document is not

    accessible. We are also dealing with extractive summarization but original document is accessible

    moreover a highlighted feature is added for convenience of user.

    1.2.5 Automatic Text Summarization

    Mr. Mohamed Abdel Fattah & Fuji Ren investigates the effect of each sentence feature on the

    summarization task. Then they used all features score function to train genetic algorithm (GA) and

    mathematical regression (MR) models to obtain a suitable combination of feature weights. The

    proposed approach performance is measured at several compressions rates on a data corpus

    composed of 100 English religious articles. The results of the proposed approach are promising.

    This paper investigates the use of genetic algorithm GA), mathematical regression (MR),

    for automatic text summarization task. This new approach is applied on a sample of 100 English

    religious articles. The approach results outperform the baseline approach results. The approaches

    have been used the feature extraction criteria which gives researchers opportunity to use many

    varieties of these features based on the used language and the text type.

    In this approach the algorithm used for summarization was Genetic algorithm while we are

    using Nearest Neighbor algorithm, moreover it is automatic summarization strategy while we are

    using extractive summarization strategy.

    1.2.6 Multi-topic based Query-oriented Summarization

    Multi-Document Extractive Summarization for News Page 5 of 59

  • 7/31/2019 MDSN Report

    6/59

    Mr. Jie Tang , Limin Yao & Dewei Chen tries to break limitations of the existing methods and

    study a new setup of the problem of multi-topic based query-oriented summarization. More

    specifically, this paper proposed two strategies to incorporate the query information into a

    probabilistic model. Experimental results on two different genres of data show that our proposed

    approach can effectively extract a multi-topic summary from a document collection and the

    summarization performance is better than baseline methods. The approach is quite general and can

    be applied to many other mining tasks, for example product opinion analysis and question

    answering.

    This paper investigates the problem of multi-topic based query-oriented summarization.

    The paper formalizes the major tasks and proposes a probabilistic approach to solve the tasks. Two

    strategies are studied for simultaneously modeling document contents and the query information.

    We are also dealing with query oriented multi-document summarization and we have

    specified it for news.

    1.2.7 Proposed modules:

    1. HTML to text parser

    2. Processing the input text file and creating the document graph

    3. Adding weighted edges to document graph

    4. Document Clustering

    5. Create clustered document graph

    6. Adding weight to nodes in clustered document graph

    7. Generate closure graph and find minimal clusters

    8. Result

    1.2.8 Clustering:

    Clustering is one of the most important unsupervised learning processes that organizing

    objects into groups whose members are similar in some way. Clustering finds structures in a

    collection of unlabeled data. A cluster is a collection of objects which are similar between them

    and are dissimilar to the objects belonging to other clusters

    1.2.8.1Uses of Clustering:

    Multi-Document Extractive Summarization for News Page 6 of 59

  • 7/31/2019 MDSN Report

    7/59

    If a collection is well clustered, we can search only the cluster that will contain relevant

    documents.

    Searching a smaller collection should improve effectiveness and efficiency.

    1.2.8.2Nearest Neighbour Algorithm

    1. Nearest Neighbor Algorithm is an agglomerative approach (bottom-up).

    2. Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each

    step, and stops when the desired number of clusters is reached.

    Step1:Nearest Neighbor, Level 2, k = 7 clusters.

    Step 2: Nearest Neighbor, Level 3, k = 6 clusters.

    Step 3: Nearest Neighbor, Level 4, k = 5 clusters.

    Multi-Document Extractive Summarization for News Page 7 of 59

  • 7/31/2019 MDSN Report

    8/59

    Step 4: Nearest Neighbor, Level 5, k = 4 clusters.

    Step 5: Nearest Neighbor, Level 6, and k = 3 clusters.

    Multi-Document Extractive Summarization for News Page 8 of 59

  • 7/31/2019 MDSN Report

    9/59

    Step6: Nearest Neighbor, Level 7, k = 2clusters.

    Step 7:Nearest Neighbor, Level 8, k = 1 cluster.

    Multi-Document Extractive Summarization for News Page 9 of 59

  • 7/31/2019 MDSN Report

    10/59

    1.3 Advantages:

    1. Because of the multi-document news summarization, there is no need to go through all the

    newspapers.

    2. As we are dealing with query specific summarization, user can easily have news summeryaccording to his/her interest.

    3. The accuracy of the result is depend upon initial Edge Threshold and Cluster threshold as

    well as Result accuracy percentage, so user can control the relevance.

    1.4 Limitations:

    1. Takes too long time to process the text files more than 50 KB or having more than 200

    paragraphs due to heavy computational loops.

    2. The text in the images or in the Flash contents cannot be parsed.

    3. The text accessed through web services cannot be parsed.

    Chapter 2

    Multi-Document Extractive Summarization for News Page 10 of 59

  • 7/31/2019 MDSN Report

    11/59

    System Requirement and Specification

    2.1 Scope of the Project

    This project can create the query dependent summary generated by clustering algorithm.

    Here we have considered nearest neighbor clustering algorithm. As every file format can be

    converted into text file. This algorithm can be applied on text file. Nodes in text file i.e. contents in

    every newline are clustered and query dependent summary can be generated.

    2.2 Requirement Specifications

    Requirements are the desired characteristics of the software being developed. The first

    activity in most projects is the identification and documentation of the requirements. Requirements

    cover both requirements engineering (identification, analysis and capture) and requirements

    management (managing change, creating and maintaining agreement with customers, trace ability

    and metrics).

    The development of large, complex systems presents many challenges to systems

    engineers. Foremost among these is the ability to ensure that the final system satisfies the needs of

    users and provide for easy maintenance and enhancement of these systems during their deployed

    lifetime. These systems often change and evolve throughout their life cycle. This makes it difficult

    to track the implemented system against the original and evolving user requirements.

    Requirements establish an understanding of users need and also provide the final yardstick

    against which implementation success is measured. Various studies have shown that roughly halfof the application errors can be traced to requirement errors and deficiencies. Thorough

    documentation and properly managing requirements are the keys to developing quality

    applications. By allowing project teams to define and document requirement data including user

    defined attributes, priority, status, acceptance criteria and traceability, detection and correction of

    missing, contradictory or inadequately defined requirements can be done the following

    requirements and constraints were considered during the requirement analysis phase. For clustering

    we have to take text as input file. If any other file format is there, it is firstly converted into text

    format. Then clustering of nodes in text document and query dependent summarization is done.

    2.2.1 Product performance requirements

    Input file must be text file. As the size of the text document changes, performance of the

    algorithms also changes .So if file is larger, we should have better hardware facilities.

    2.2.2 Hardware Requirements

    Multi-Document Extractive Summarization for News Page 11 of 59

  • 7/31/2019 MDSN Report

    12/59

    Processor : Pentium IV or higher.

    Ram : Minimum 256 MB.

    Hard Disk : 40 GB.

    Input device : Standard Keyboard and Mouse.

    Output device : VGA and High Resolution Monitor.

    2.2.3 Software Requirements

    Software components required for building the Project are:

    Operating System : WINDOWS XP or above

    Technique : Microsoft Visual Studio 2010 (.NET Framework 3.5)

    Internet explorer.

    2.3 Functional requirements

    The project includes the following modules:

    HTML to text parser: Processing the input HTML files parsing the HTML contents and extracting

    the text lines.

    Uploading and processing: In this module a text file is uploaded and processed. Every

    single line is considered as a node and the data within that node is displayed.

    Building document graph: The weight between every node to every other node is

    calculated.

    Clustering and making clustered graph: Nearest neighboring method and agglomerativehierarchical clustering technique is used for making clusters of previous step document

    graph and clustered graph is prepared.

    Query firing and getting minimal cluster: Here we are firing the query and finding the

    minimal cluster. Minimal cluster is the cluster, which contains the part of the fired query.

    Here we are getting summary the result.

    2.4 Feasibility Study

    Not everything imaginable is feasible!, therefore it is necessary to evaluate feasibility of

    project at the earliest stage.

    The software feasibility has 3 solid dimensions:

    Multi-Document Extractive Summarization for News Page 12 of 59

  • 7/31/2019 MDSN Report

    13/59

    2.4.1 Technology:

    Technical feasibility is study of functions, performance, and constraints that may affect the

    ability to achieve an acceptable system. This project is technically feasible to implement. The user

    does not require any extra hardware or any higher-end technology. The software can execute on a

    single client machine operating on a WINDOWS XP or a higher version of Operating System.

    2.4.2 Finance:

    Financial feasibility is the evaluation of the development cost weighed against the ultimate

    income or benefits derived from the developed system. The resources that are required for the

    system can be available easily. The system is developed basically for study purpose so economical

    feasibility is not a major issue.

    This project is financially feasible because the software does not require any extra hardware or any

    additional supporting technology which in turn adds no extra cost to the software. Thus the cost is

    only for the development. Thus the project is financially feasible.

    2.4.3 Resources:

    The organization that wishes to implement this system requires only a single or multiple

    machines. Thus no additional resources are required to implement the system. Thus the software is

    also resource feasible.

    2.5 .NET Framework :

    .NET Framework is designed for cross-language compatibility. Cross-language

    compatibility means .NET components can interact with each other irrespective of the languages

    they are written in. An application written in VB .NET can reference a DLL file written in C# or a

    C# application can refer to a resource written in VC++, etc. This language interoperability extends

    to Object-Oriented inheritance. This cross-language compatibility is possible due to common

    language runtime.

    2.5.1 .NET Framework Advantages:

    The .NET Framework offers a number of advantages to developers.

    Different programming languages have different approaches for doing a task. For example,

    accessing data with a VB 6.0 application and a VC++ application is totally different. When using

    different programming languages to do a task, a disparity exists among the approach developers

    Multi-Document Extractive Summarization for News Page 13 of 59

  • 7/31/2019 MDSN Report

    14/59

    use to perform the task. The difference in techniques comes from how different languages interact

    with the underlying system that applications rely on. With .NET, for example, accessing data with

    a VB .NET and a C# .NET looks very similar apart from slight syntactical differences. Both the

    programs need to import the System. Data namespace, both the programs establish a connection

    with the database and both the programs run a query and display the data on a data grid.

    .NET v/s Java :

    Java is one of the greatest programming languages created by humans. Java doesn't have a

    visual interface and requires us to write heaps of code to develop applications. On the other hand,

    with .NET, the Framework supports around 20 different programming languages which are better

    and focus only on business logic leaving all other aspects to the Framework.

    Visual Studio .NET comes with a rich visual interfaces and supports drag and drop. Many

    applications were developed, tested and maintained to compare the differences between .NET and

    Java and the end result was a particular application developed using .NET requires less lines of

    code, less time to develop and lower deployment costs along with other important issues.

    Personally, I don't mean to say that Java is gone or .NET based applications are going to dominate

    the Internet but I think .NET definitely has an extra edge as it is packed with features that simplify

    application development.

    2.5.2 Main features of C#:

    C# was developed as a language that would combine the best features of previously

    existing Web and Windows programming languages. Many of the features in C# language are

    preexisted in various languages such as C++, Java, Pascal, and Visual Basic.

    Main features:

    1. C# is a simple, modern, object oriented language derived from C++ and Java.

    2. It combine the high productivity of Visual Basic and the raw power of C++.

    3. It is a part of Microsoft Visual Studio7.0.

    4. Visual studio supports VB, VC++, C++, VBscript, and Jscript. All of these languages

    provide access to the Microsoft .NET platform.

    5. .NET includes a Common Execution engine and a rich class library.

    6. Microsoft's JVM equiv. is Common language run time (CLR).

    Multi-Document Extractive Summarization for News Page 14 of 59

  • 7/31/2019 MDSN Report

    15/59

    7. CLR accommodates more than one language such as C#, VB.NET, Jscript, ASP.NET,

    C++.

    8. Source code --->Intermediate Language code.

    9. The classes and data types are common to all of the .NET languages.

    10. We may develop Console application, Windows application, and Web application

    using C#.

    11. In C# Microsoft has taken care of C++ problems such as Memory management,

    pointers etc.

    12. It supports garbage collection, automatic memory management and a lot.

    Here is a list of some of the primary characteristics of C# language.

    Modern and Object Oriented

    Simple and Flexible

    Type safety

    Interoperability

    Scalable and Updateable

    2.6 Risk Management

    The software development process is inherently subjects to risks, the consequence of which

    are manifested as financial failures (time scale overrun, budget overrun) and technical failures

    (failures to meet required functionality, reliability or maintainability).The objectives of risk

    management are to identify, analyze and give priorities to risk items before they become either

    threats to successful operation or major sources of expensive software rework, to establish a

    balanced and integrated strategy for eliminating or reducing the various sources of risk, and to

    monitor and control the execution of the strategy.

    2.7 Data Flow Diagrams

    Data Flow Diagrams serves two purposes:

    1. To provide an indication of how data are transformed as they move through the system.

    2. To depict the functions that transforms the data flow.

    Multi-Document Extractive Summarization for News Page 15 of 59

  • 7/31/2019 MDSN Report

    16/59

    The DFD provides additional information that is used during the analysis of the

    information domain and serves as a basis for the modeling of function. A description for each

    function presented in the DFD is contained in a process specification.

    As information moves through software, it is modified by a series of transformations. A

    data flow diagram is a graphical representation of information flow and transforms that are applied

    as data moves from input to output. The basic form of data flow diagram is also known as data

    flow graph or bubble chart.

    The data flow diagram may be used to represent a system or software at any level of abstraction. In

    fact, DFDs may be partitioned into levels that represent increasing information flow and functional

    detail. Therefore, the DFD provides a mechanism for functional modeling as well as information

    flow modeling.

    Figure2.1 DFD (Level 0)

    Data flow diagram( level 1) provides more details of the data flow diagram level zero. It represents

    information flow and transforms that are applied as data moves from input to output.

    Input HTML File(s)

    Multi-Document Extractive Summarization for News Page 16 of 59

    SYSTEM

    HTML

    FileSummary of

    Html file

    Query

    Uploading

    &

    processin

    g i/p file

    HTML To

    Text

    Converter

    Building

    document

    graph

  • 7/31/2019 MDSN Report

    17/59

    Clustering algorithm Threshold

    I/P Query

    Figure2.2 DFD (Level 1)

    Chapter 3

    Design and analysis

    3.1 Design Overview

    Multi-Document Extractive Summarization for News Page 17 of 59

    Creating

    weighted

    graph

    Clustering

    and

    building

    clustered

    graph

    Highlighting

    Text in the

    HTML

    Document

    as Result.

    Generati

    ng

    Summar

    y

  • 7/31/2019 MDSN Report

    18/59

    A specialist has to check for the dataflow and have to manually where the data flows.

    Analysts have proved that it would take more time for an experienced specialist to note the

    dataflow.

    We are accepting HTML/text file only. Newline contents are forming a node, hence a

    single cluster. If there is no newline content then only one node will be there, hence only on

    cluster. This will degrade the performance of the algorithms, as the cluster size is very big.

    3.2 Software Architecture:

    Figure 3.1 Architecture Diagram

    Figure 3.1 shows the architecture diagram of the system. As shown in figure there are five

    main blocks : a block for uploading and processing html file(s) by parsing text from html

    document and making document graph, a block for clustering and making clustered graph, a block

    for making weighted clustered document graph., the last block for generating summary for fired

    query.

    Block 1: HTML to Text conversion:

    This block accept the input files in the form of Html, and then convert it into text files.

    After conversion of html to text, the text file passes to the next block as input.

    Multi-Document Extractive Summarization for News Page 18 of 59

  • 7/31/2019 MDSN Report

    19/59

    Block 2: Processing input file and generating document graph:

    This block is needed to accept the text file only. It is responsible to upload text file, to

    process the file i.e. to form nodes for every newline contents. It is also responsible for generating

    weight from each node to very other node

    Block 3: Clustering node and building clustered graph:

    This block is responsible for choosing a clustering algorithm out of two. It also accepts the

    threshold, so that can check the similarity between the clusters up to that level. It is responsible for

    making clusters.

    Block 4: Creating weighted document clustered graph:

    This block is responsible to accept the fired query. It is responsible to check the similarities

    between the query a contents and the contents in the clusters. It then build weighted clustered

    document graph.

    Block 5.Summary generation:

    This block is responsible for generating the summary of the clusters we formed, as a

    response for fired query. It generated the minimal clusters and after finding the weight of the node

    for fired query, it gives top most summaries.

    3.3 Team Work Graph:

    Team Members: Mr. Athar Nawaz Khan

    Mr. Nikhil Vilasrao Ubale

    Miss. Shraddha B. Ahire.

    Multi-Document Extractive Summarization for News Page 19 of 59

  • 7/31/2019 MDSN Report

    20/59

    Fig.3.2 Deviation of work

    3.4 Software Engineering Model used i.e. Incremental Model:

    3.4.1 Communication:

    The software development process starts with communication between customer and

    developer. In this phase we communicated with following principles of communication phase. We

    prepare before the communication i.e. we decide agenda of the meeting for concentrating on the

    News Summarization. Our leader directs our team and drawn out all the requirement of the user i.e.

    what they are actually needed, what is input, output format of system.

    3.4.2 Planning:

    It includes complete estimation and scheduling and risk analysis. In this phase we planned

    about when estimated release the software, cost estimation, risk in the project regarding application

    etc. Finally in this phase we estimated the cost of the project including all expenditure of software,

    releasing software according to user deadline with his participation.

    3.4.3 Modeling:

    It includes detail requirement analysis and project design. Flowchart shows complete

    pictorial flow of program where the algorithm is step by step solution of problem. We analyze the

    requirement of the user according to that we drawn the block diagrams of the system. That is

    nothing but behavioral structure of the system using UML 2.0 i.e. Class Diagram, Use Case report,

    component diagram, communication diagram,activity diagram, state machine diagram.

    Multi-Document Extractive Summarization for News Page 20 of 59

  • 7/31/2019 MDSN Report

    21/59

    3.4.4 Construction:

    It includes Coding and Testing Steps

    Coding:

    Design details are implemented using appropriate programming language. In coding we

    choose the platform i.e.ASP.NET

    Testing:

    Testing is carried out by analyzing the system i.e. we first develop the prototype of the

    system and step by step find out input and output errors such as interface errors, data structure

    errors, initialization errors etc. Therefore here Black Box testing strategy is useful.

    3.4.5 Deployment:

    It includes software delivery, support and feedback from customer. If customer suggest

    some corrections, or demands additional features are added into this software.

    3.5 Analysis of Work:

    The following table shows you the way of work that we followed in the period.

    Table3.1: Analysis of Work

    Sr. No. Name of Task Subtask Period

    1 Information Gathering 1.Problem Definition:

    Collecting detail information

    of the system to be

    implemented.

    17/07/2011

    TO

    28/07/2011

    2.Literature Survey:

    Visiting different websites

    studying existing system with

    its limitations

    Going through Journals,magazines

    Studying the reference books.

    2 Analysis Project Plan:

    Preparing complete project

    pla.n

    07/08/2011

    TO

    24/08/2011

    Multi-Document Extractive Summarization for News Page 21 of 59

  • 7/31/2019 MDSN Report

    22/59

    Requirement Analysis:

    Software requirements

    Hardware requirements

    3 Design Architectural Design:

    Describing relationships

    between modules and sub

    modules

    04/09/2011

    TO

    25/09/2011

    UML documentation:

    Use case diagram

    Class Diagram

    Sequence Diagram

    Activity Diagram

    State Machine Diagram

    Component Diagram

    Form Design:

    Showing relationship among

    different menus and sub

    menus

    4 GUI Output screens:

    Preparing for detail output

    screens.

    22/2/2012

    TO

    29/02/2012

    Report Submission:

    Submission of report of

    Analysis and Design

    5 Construction of System Coding:

    Implementation of design

    details using Programming

    language c# .net

    12/03/2012

    TO

    19/03/2012

    Testing:

    Testing the system for

    expected results

    6 Deployment System Deployment: 25/03/2012

    Multi-Document Extractive Summarization for News Page 22 of 59

  • 7/31/2019 MDSN Report

    23/59

    Delivery of Project

    Support

    Feedback

    Modification

    TO

    15/04/2012

    7 Final Document

    Preparation and

    Submission

    Project Submission:

    Preparing final project Report

    Submission of final Project

    Report

    15/04/2012

    TO

    23/04/2012

    3.6 Risk Assessment:

    The risk always involves two characteristics

    Uncertainty: The risk may or may not occur there are no 100% probable risks.

    Loss: If the risk becomes a reality, unwanted consequences or losses will occur.

    3.6.1 Risk projection:

    Risk projection, also called as risk estimation, attempts to rate each risk in two ways

    The like hood or probability that risk is real.

    Consequences of the problems associated with the risk should it occur.

    3.6.2 Risk Identification:

    Risk Identification is systematic attempts to specify threats to the project plan.

    Generic risk: These are potential threats to every software project.

    Product Specific: These risks can be identified only by those with a clear understanding of

    the technology and the environment that is specific to the project at hand.

    Following are the Risks involved:

    1. Technology to be built: Risks associated with the complexity of the system to be built and

    the newness of the technology to be packaged by the system.

    2. Development Environment: Risks associated with availability and quality of the tools to be

    used built this system.

    3. Risk related to Time.

    4. Risk related to Functionality of the system.

    3.7 Requirement Analysis:

    Requirement analysis results in the specification of softwares operational characteristics.

    Multi-Document Extractive Summarization for News Page 23 of 59

  • 7/31/2019 MDSN Report

    24/59

    Requirements gathering comprises of the following:

    1. Elaboration

    2. Negotiation

    3. Specification

    4. Validation

    Software requirement specification is produced at the culmination of analysis task. The functions

    and performance allocation to Software as part of system engineering are refined by establishing

    following:

    A complete information description

    A detailed functional description

    A representation of system behavior

    An indication of performance requirement and design constraints

    Appropriate validation criteria

    3.8 UML Documentation:

    A UML diagram is a representation of the components or elements of a system or process

    model and, depending on the type of diagram, how those elements are connected or how they

    interact from a particular perspective. We are developing following UML diagrams to show the

    elements and connection of the elements in the diagram.

    There are two types of UML diagrams:

    1. Structural Diagrams

    2. Behavioral Diagrams

    These two are major grouping of UML diagrams. We are developing some of the diagrams

    which are sufficient to show the flow and elements of working project. Those diagrams are listed

    below:

    Use case Diagram

    Class Diagram

    Sequence Diagram

    Activity Diagram

    State Machine Diagram

    Component Diagram

    Multi-Document Extractive Summarization for News Page 24 of 59

  • 7/31/2019 MDSN Report

    25/59

    Description about the UML diagram:

    3.8.1 Use case Model:

    A Use Case diagram captures Use Cases and relationships between Actors and the subject

    (system). It describes the functional requirements of the system, the manner in which outside

    things (Actors) interact at the system boundary, and the response of the system.

    Components of Use case diagram:

    1. Actor

    2. Use Cases

    3. Association

    4. Include

    5. Extends

    An Actor is a user of the system; user can mean a human user, a machine, or even another

    system or subsystem in the model. Anything that interacts with the system from the outside or

    system boundary is termed an Actor. Actors are typically associated with Use Cases. Here in

    Interactive System we are using five actors. They are as follows:

    System

    End User

    Use Case Diagram

    Multi-Document Extractive Summarization for News Page 25 of 59

  • 7/31/2019 MDSN Report

    26/59

    uc use

    Minimal clusters

    Adding weight to clustered graph

    Clustering

    Weighted graph

    Split

    Parsing

    End User System

    Browse News (HTML

    file) from internet

    Provide News to

    system

    Parsing

    Extract text files

    from body tag

    Processing

    text files

    Split documents

    into nodes

    (paragraph)

    Find similarity

    between nodes

    Build documentgraph

    Document

    clustering

    Use Nearest

    Neighbour

    aglorithm

    Create clustered

    document graph

    Enter the query

    Add weight to

    nodes in graph

    Enter threshold for

    minimal cluster

    Find minimal

    clusters

    Show result

    include

    include

    include

    Figure3.3 Use Case Diagram

    3.8.2 Sequence Model:

    Multi-Document Extractive Summarization for News Page 26 of 59

  • 7/31/2019 MDSN Report

    27/59

    A Sequence diagram is a structured representation of behavior as a series of sequential

    steps over time. It is used to depict work flow, message passing and how elements in general

    cooperate over time to achieve a result.

    Each sequence element is arranged in a horizontal sequence, with messages passing back

    and forward between elements. An Actor element can be used to represent the user initiating the

    flow of events. Stereotyped elements, such as Boundary, Control and Entity, can be used to

    illustrate screens, controllers and database items, respectively. Each element has a dashed stem

    called a lifeline, where that element exists and potentially takes part in the interactions.

    Components of Sequence diagrams:

    1. Actor

    2. Lifeline

    3. Message

    4. Self Message

    5. End point

    An Actor is a user of the system; user can mean a human user, a machine, or even another

    system or subsystem in the model. Anything that interacts with the system from the outside or

    system boundary is termed an Actor. Actors also represent the role of a user in Sequence

    Diagrams. Enterprise Architect supports a stereotyped Actor element for business modeling. A

    Lifeline is an individual participant in an interaction.

    Sequence Diagram

    Multi-Document Extractive Summarization for News Page 27 of 59

  • 7/31/2019 MDSN Report

    28/59

    sd

    Parser Spl i tter Graph

    creation

    Relation

    manager

    Cluster

    algori thm

    Minimal

    spanning

    treeDisplay screenUser

    alt Input query

    [ I f avai lable]

    [If not]

    Calculates similarity

    between Query and

    Each Cluster (as node)

    Clusters relevant to the

    query. (i.e. Clusters

    having maximum

    weight with query)

    Provide HTML d ocuments ()

    Edge thrshold val ue()

    Extract text file from body tag()

    Text document()

    Split into nodes()

    Provide no des()

    Build document graph()

    Provide graph()

    Calculate wei ght of edges()

    Dispay nodes

    ()

    Form cl usters()

    Display clusters()

    Ask thresho ld

    value()

    Ask for the

    Query()

    Provide qu ery()

    Show

    clusters()

    Provide clusters()

    Caculates weight()

    Form weighted cluster graph()

    Show weighte d cluster graph()

    Calculate minimal clusters()

    Display Result()

    Figure3.4 Sequence Diagram

    3.8.3 Class Model:

    Multi-Document Extractive Summarization for News Page 28 of 59

  • 7/31/2019 MDSN Report

    29/59

    Class diagrams capture the logical structure of the system, the Classes and objects that

    make up the model, describing what exists and what attributes and behavior it has. The Class

    diagram captures the logical structure of the system: the Classes - including Active and

    Parameterized (template) Classes - and things that make up the model. It is a static model,

    describing what exists and what attributes and behavior it has, rather than how something is done.

    Class diagrams are most useful to illustrate relationships between Classes and Interfaces.

    Generalizations, Aggregations and Associations are all valuable in reflecting inheritance,

    composition or usage, and connections, respectively.

    Components of Class diagram:

    1. Class

    2. Associate

    3. Compose

    4. Realize

    A Class is a representation of objects that reflects their structure and behavior within the

    system. It is a template from which actual running instances are created, although a Class can be

    defined either to control its own execution or as a template or parameterized Class that specifies

    parameters that must be defined by any binding class. A Class can have attributes (data) and

    methods (operations or behavior). Classes can inherit characteristics from parent Classes and

    delegate behavior to other Classes. Class models usually describe the logical structure of the

    system and are the building blocks from which components are built.

    Class Diagram

    Multi-Document Extractive Summarization for News Page 29 of 59

  • 7/31/2019 MDSN Report

    30/59

    c l a s s C la s s M o d e l

    U s e r

    - P a s sw o r d : i n t

    - U s e r _ i d : i n t

    - U s e r_ n a m e : i n t

    + A u t h e n t i c a t io n () : v o i d

    + I n p u t ( ) : v o i d

    + I n p u t _ t h r e s h o l d ( ) : v o i d

    I n p u t

    - F i l e _ n a m e : i n t

    - F i l e _ s i z e : i n t

    + C o n v e r si o n ( ) : v o i d

    + F i r e q u e r y () : vo i d

    + I n p u t t h r e sh o l d ( ) : v o i d

    S p l i t t e r

    - F i l e _ s i z e : i n t

    - N o . o f st o p w o r d s: i n t

    + E x t r a c t ( ) : v o i d

    + P a r si n g ( ) : v o i d

    + S p l i t ( ) : v o i d

    R e l a t io n M a n a g e r

    - F i l e _ s i z e : i n t

    - n o . o f n o d e s : i n t

    + B u i l t g r a p h ( ) : v o i d

    + C a l c u l a t e c l u st e r w e i g h t () : v o i d

    + C a l c u l a t e w e i g h t o f n o d e s( ) : v o i d

    + C r e a t e n o d e s () : v o i d

    + D i s p l a y n o d e s () : v o i d

    + D i s p l a y w e i g h t () : vo i d

    C l u s t e r i n g a l g o r i th m

    - N o . o f c l u s t e r : i n t

    - T h r e sh o l d : i n t

    + C r e a t e c l u s t e r g r a p h ( ) : vo i d

    + D i s p l a y c l u s te r ( ) : v o i d

    + F o r m c l u s t e r () : v o i d

    I n p u t q u e r y

    - K e y w o r d s: i n t

    + A s si g n w e i g h t ( ) : v o i d

    + C o m p a r e n o d e s () : v

    + F i r e q u e r y () : v o i d

    O u t p u t

    - R e s u l t : i n t

    + D i s p l a y c l u s te r ( ) : v o i d

    + D i s p l a y n o d e s( ) : v o i d

    + D i s p l a y r e s u l t ( ) : v o i d

    Figure3.5 Class Diagram

    3.8.4. Activity Diagram

    Multi-Document Extractive Summarization for News Page 30 of 59

  • 7/31/2019 MDSN Report

    31/59

    Activity diagrams are used to model the behaviors of a system, and the way in which these

    behaviors are related in an overall flow of the system. The logical paths a process follows, based

    on various conditions, concurrent processing, data access, interruptions and other logical path

    distinctions, are all used to construct a process, system or procedure.

    act activ i ty

    Parsing State

    In i t i a l s ta te

    HTML doc.to

    systemParsing

    Extract text

    fi le from

    body tagText file

    Split State

    Processing

    text fi le

    Split doc.into

    nodes

    P ro v i d e

    Nodes

    Built doc.graph

    Clustering

    Find

    similariry

    b e tw e e n

    nodes

    U s e n e a re s t

    neighbour

    algorithm

    Doc.clustering

    Minimal c luster formation

    Creat

    doc.cluster

    graph

    Add we ights

    to nodes in

    graph

    Query by

    u s e r Calculate

    similarity

    b e tw e e n

    query and

    each cluster

    F ind minimal

    cluster

    Resul t

    Enter

    theshold for

    c luster

    S h o w re s u l t

    Fina l s ta te

    Figure3.6 Activity Diagram

    3.8.5. State machine Diagram

    Multi-Document Extractive Summarization for News Page 31 of 59

  • 7/31/2019 MDSN Report

    32/59

    A State Machine diagram illustrates how an element can move between states, classifying

    its behavior according to transition triggers and constraining guard

    stm

    Input

    HTML input files

    Parsing

    Text files

    Processing text file Split document into

    nodes

    Initial

    Assign weigh to each

    nodeWeighted graph of

    nodes

    Clustering

    Extract

    Clustering by nearest

    neighbour

    Create node upto

    threshold v alue

    Assign weight to

    cluster

    Create graph

    Result

    User query Comparison between

    query and clusters

    Display topmost result

    threshold value

    Figure3.7 State Machine Diagram

    3.8.6. Component Diagram

    Multi-Document Extractive Summarization for News Page 32 of 59

  • 7/31/2019 MDSN Report

    33/59

    A Component diagram illustrates the pieces of software, embedded controllers and such

    that make up a system, and their organization and dependencies. A Component diagram has a

    higher level of abstraction than a Class diagram; usually a component is implemented by one or

    more Classes (or Objects) at runtime. They are building blocks, built up so that eventually a

    component can encompass a large portion of a system.

    cmp Component Model

    Result

    Minimal spanning

    treeClustering

    algorithm

    Relation manager Graph creation

    Splitter Parser

    Input file

    Threshold and input fi le

    Figure3.8 Component Diagram

    Chapter 4

    Multi-Document Extractive Summarization for News Page 33 of 59

  • 7/31/2019 MDSN Report

    34/59

    Implementation

    Implementation is the stage in the project where the theoretical design is turned into a

    working system and is giving confidence on the new system for the users, which it will work

    efficiently and effectively. It involves careful planning, investigation of the current System and its

    constraints on implementation, design of methods to achieve the change over, an evaluation, of

    change over methods. Apart from planning major task of preparing the implementation are

    education and training of users. The more complex system being implemented, the more involved

    will be the system analysis and the design effort required just for implementation.

    An implementation co-ordination committee based on policies of individual organization

    has been appointed. The implementation process begins with preparing a plan for the

    implementation of the system. According to this plan, the activities are to be carried out,

    discussions made regarding the equipment and resources and the additional equipment has to be

    acquired to implement the new system.

    Implementation is very important phase, the most critical stage in achieving a successful

    new system and in giving the users confidence. That the new system will work is effective. After

    the system is implemented the testing can be done. This method also offers the greatest security

    since the old system can take over if the errors are found or inability to handle certain type of

    transactions while using the new system.

    Main Functions implemented for project are listed as below in the project modules.4.1 System implementation

    The input is a text file contains new line keyword. The contents are separated by new line

    are the contents of the node which are the paragraph. If there are no new line in the file, then whole

    file contents becomes a single node and hence a single cluster, which can degrade the performance

    of the result.

    The total workflow is divided into following modules:

    Module 1: HTML to text parser

    Processing the input HTML files parsing the HTML contents and extracting the text lines.

    Module 2: Processing the input text file and creating the document graph

    Functions Used:

    Split ()

    Multi-Document Extractive Summarization for News Page 34 of 59

  • 7/31/2019 MDSN Report

    35/59

    The system accepts input text file. The file is read and stored into a string. The string is

    then split by the newline keyword. The split file is assigned to the string array as the split function

    returns the string array. The array contains paragraphs which are further treated as nodes.

    string [] nodeList = null;

    NodeList = File.ReadAllLines (txtInputFile.Text);

    The next stage is to find the similarity between the nodes that means finding the similarity

    edges between nodes and finding their similarity or weight.

    Each paragraph becomes a node in the document graph.

    The document graph G (V, E) of a document dis defined as follows:

    vdis split to a set of non-overlapping nodes t (v),v V.

    An edge e (u, v)Eis added between nodes u, v Vif there is an association between t (u) and t (v) in

    d.Hence, we can view G as an equivalent representation ofd, where the associations between text

    fragments ofdare depicted.

    Module 3: Adding Weighted Edges to Document Graph

    (Note: Adding weighted edge is query independent)

    A weighted edge is added to the document graph between two nodes if they either

    correspond to adjacent node or if they are semantically related, and the weight of an edge denotes

    the degree of the relationship. Here two nodes are considered to be related if they share common

    words (not stop words) and the degree of relationship is calculated by Semantic parsing. Also

    notice that the edge weights are query-independent, so they can be pre-computed.

    The following input parameters are required at the pre computation stage to create the

    document graph:

    4.1.1 Threshold for edge weights:

    Only edges with weight not below threshold will be created in the document graph. (A

    threshold is user configurable value that controls the formation of edges)

    Adding weighted edge is the next step after generating document graph. Here for each pair of

    nodes u, v we compute the association degree between them, that is, the score (weight) EScore (e)

    of the edge e (u, v). If Score (e) threshold, then e is added to E. The score of edge e (u, v) where

    nodes u, v have text fragments t(u), t(v) respectively is:

    Multi-Document Extractive Summarization for News Page 35 of 59

  • 7/31/2019 MDSN Report

    36/59

    Where t f (d, w) is the number of occurrences of w in d,

    Id f (w) is the inverse of the number of documents containing w, and

    size(d) is the size of the document (in words).That is, for every word w appearing in both text

    fragments we add a quantity equal to the tf/idf score of w. Notice that stop words are ignored.

    Functions Used:

    Remove Common Words ()

    The common words are eliminated from the nodes as they can degrade the performance of

    calculating the similarity between two nodes also they can degrade the system performance

    because of number of computational loops increases. E.g. a, an, the, he, she, they, as, it, and, are,

    were, there etc.

    The filtered two nodes are passed as parameters to the Relation Manager Class for finding

    the similarity between them.

    Relation Manager ()

    The relation manager function takes two nodes as a parameter and returns the semantic

    relation in the form of weight (EScore) between two nodes by traditional edge weight formula

    specified as below:

    If EScore >= Threshold, the edge is added to the document graph.

    The graph is stored into tabular form as shown below

    Table 4.1. Nodes and Node weights

    First Node Second Node Edge Weight

    Multi-Document Extractive Summarization for News Page 36 of 59

  • 7/31/2019 MDSN Report

    37/59

    1 2 0.5

    1 3 0.7

    .

    .

    .

    .

    .

    .

    30 31 0.8

    30 32 0.6

    Module 4: Document Clustering

    Clustering is grouping of similar nodes (The nodes which shows degree of closure greater

    than or equal to the Cluster Threshold specified by the user) into a group. The following approach

    of clustering is used Nearest Neighbor.

    Algorithm for Nearest Neighbor Clustering:

    1. Set i = 1 and k = 1. Assign pattern to cluster .

    2. Set i = i + 1. Find nearest neighbor of among the patterns already assigned to clusters.

    Let denote the distance from to its nearest neighbor. Suppose the nearest neighbor

    is in cluster m.

    3. If greater than or equal to t then assign to where t is the threshold specified by

    the user. Otherwise set k = k+1 and assign to a new cluster .

    4. If every pattern has been considered then stop else go to step 2.

    Functions Used:

    FindMaxWeight ()

    FindMaxWeight returns the pair of nodes having maximum edge weight with their weight

    from document graph. E.g.

    Table 4.2.Nodes and the max weight

    First Node Second Node Max Weight

    1 22 2.5

    2 19 1.2

    Multi-Document Extractive Summarization for News Page 37 of 59

  • 7/31/2019 MDSN Report

    38/59

    3 31 3.5

    .

    .

    .

    .

    .

    .

    31 12 2.7

    NearestNeighborCluster ()

    The first pair of nodes in the above table is added in first Cluster because they have

    maximum weight. Here Node 1 and 22 are closely related hence added to the first cluster. So

    Cluster_1 contains 2 nodes 1 and 22. Cluster_1 :- 1,22

    Next node node 2 shows maximum weight with node 19 but none of the node (node 2 and node

    19 )are in previous clusters so they forms new cluster Cluster_2

    Cluster_2:- 2, 19

    Similarly Node 3 and 13 are forming new cluster. Cluster_3:- 3, 31

    Now next pair (node 31 and 12) contains node 31 which is already in cluster_3 hence node 12 is

    added into cluster_3, so cluster_3 now becomes Cluster_3:-3, 31, 12.

    The above procedure is repeated till the end of the node pairs.

    Module 5: Creating Clustered Document Graph

    After the clusters are formed either by Nearest Neighbor or agglomerative hierarchical, the

    similarity edges between two similar clusters are calculated. This is same as creating document

    graph and adding the similarity edges between two similar nodes. Every cluster is split intoindividual nodes and this grouping of nodes is passed to the relation manager in order to find the

    weight between two set of nodes or Clusters.[5,7,10]

    Module 6: Adding Weight to Nodes In Clustered Document Graph

    When a query Q arrives, the nodes in V are assigned query-dependent weights according to

    their relevance to Q. In particular, we assign to each node v corresponding to a text fragment t(v)

    node score NScore(v) defined by the Okapi formula as given below.

    NScore (V) =

    Tf- is the terms frequency in document,

    Qtf- is the terms frequency in query,

    N -is the total number of documents in the collection,

    Multi-Document Extractive Summarization for News Page 38 of 59

  • 7/31/2019 MDSN Report

    39/59

    df is the number of documents that contain the term,

    dl is the document length (in words),

    avdl is the average document length and

    k1 (between 1.02.0), b (usually 0.75), and k3 (between 01000) are constants.

    Functions Used:

    CalculateClusterWeight ()

    All the values mentioned above are computed and passed as parameters to the okapi formula.

    The returned Node Weight is stored in the table. e.g.

    Table 4.3.Cluster node and weight of cluster

    Cluster No Nodes Cluster Weight

    Cluster_1 1,22,1, 32 2.4

    Cluster_2 9,17,24 2.5

    Cluster_3 34,12,10 0Cluster_4 4,14,23 0

    Module 7: Generating Closure Graph and Finding Minimal Clusters

    Closure graph contains minimal clusters. Minimal clusters are the clusters which shows

    non zero weight with the in out query. In Above example (Tab 3) only Cluster_1 and Cluster_2 are

    the minimal clusters. The minimal clusters are the clusters which appear in the result.

    Module 8: Result

    After getting the minimal clusters, the result can be displayed in two ways:

    Top 1 Result Summary

    Multi-Result Summary

    In top 1 result summary, the minimal cluster having highest weight with the input query is

    returned, and in multi-result summary all the minimal clusters are returned as result. Before

    displaying the result as a cluster, the cluster is split into its nodes and the weight of every node

    with the input query is calculated. The nodes are displayed in decreasing order of the weight with

    the input query. Means the node having highest weight is displayed at the top and lowest at the

    bottom.

    Multi-Document Extractive Summarization for News Page 39 of 59

  • 7/31/2019 MDSN Report

    40/59

    Chapter 5

    Testing5.1 Testing strategies

    Testing is an important phase in the Software Development Life Cycle. Testing should be

    planned and conducted systematically.

    Generic aspects of a test strategy

    Multi-Document Extractive Summarization for News Page 40 of 59

  • 7/31/2019 MDSN Report

    41/59

    1. Testing begins at the module level and works outwards.

    2. Different testing techniques are used at different points of time.

    3. Testing is done by developers and mainly for larger projects, by an independent test group.

    4. Testing and debugging are two different activities, but debugging should be incorporated

    into any testing strategy.

    5.2 Testing Techniques

    5.2.1 Black box testing

    Black box testing focuses on the functional requirements of the software. That is, black box

    testing enables the software engineer to derive sets of input conditions that will fully exercise all

    functional requirements for a program. Black box testing is not an alternative to white box testing

    techniques; rather it is a complementary approach that is likely to uncover a different class of

    errors than white box methods.

    Black box testing attempts to find errors in the following categories

    1. Incorrect or missing functions.

    2. Interface errors.

    3. Errors in data structures or external database access.

    4. Performance errors.

    5. Initialization and termination errors

    Unlike white box testing, which is performed early in the testing process, black box testing

    tends to be applied during later stages of testing. Because black box testing purposely disregards

    control structure, attention is focused on the information domain.

    5.2.2 White box testing

    Using white box testing methods, the software engineer can derive test cases that can

    1. Guarantee that all dependent paths within a module have been exercised at least once.

    2. Exercise all logical decisions on their true and false sides.

    3. Execute all loops at their boundaries and within their operational bounds

    4. Exercise internal data structures to assure their validity.

    Need for white box testing arises because of different reasons:

    1. Errors tend to creep into the work when we design and implement function, conditions or

    controls that are out of the main stream.

    Multi-Document Extractive Summarization for News Page 41 of 59

  • 7/31/2019 MDSN Report

    42/59

    2. We often believe that a logical is not likely to be executed when, in fact, it may be executed

    on the regular basis.

    3. Typographical errors are random. When a program is translated into programming

    language source code, it is likely that some typing errors will occur.

    I have used White Box testing and Black Box Testing. It is also called behavioral testing which

    focuses on functional requirements of the software. In this testing the software is tested as a black

    box without considering its internal details. Required sets of input were supplied and the desired

    outputs are obtained.

    5.3 Test Cases:

    Table 5.1. Test cases

    Test

    Case

    ID

    Test Case

    Name

    Description Steps Carried out Expected

    Results

    Actual Result

    TC1 HTML file Validation

    of the input

    file

    1. Enter a correct

    text file in which

    at least 2

    paragraphs are

    there.

    2. Enter an

    incorrect folder

    in which web

    pages are stored

    and click set

    dataset folder.

    Accepted the

    text file.

    Then create

    nodes and

    show data in a

    particular

    node

    Error Message

    = Other

    format file

    than text will

    not be

    uploaded

    Accepted the

    text file.

    Then create

    nodes and show

    data in a

    particular node

    Error Message

    = Other

    format file than

    text will not be

    uploaded

    TC2 Changing

    threshold

    for

    clustering

    To check

    how

    threshold

    value

    Selecting the

    threshold value

    for clustering

    The size of

    cluster

    increases and

    no of cluster

    The size of

    cluster

    increased and

    no of cluster

    Multi-Document Extractive Summarization for News Page 42 of 59

  • 7/31/2019 MDSN Report

    43/59

    affects the

    size of

    cluster and

    performanc

    e of

    algorithm

    decreases.

    Due to big

    size of cluster,

    looping in

    case of nearest

    neighbour

    algorithm

    increases.

    Hence its

    performance

    decreases.

    decreased. Due

    to big size of

    cluster, looping

    in case of

    nearest

    neighbour

    algorithm

    increased.

    Hence its

    performance

    decreased.

    TC3 Top mostsummary

    To checkwhether in

    result the

    first result

    is the best

    result

    Afterclustering ,get

    minimal cluster

    and find the

    summary as

    result

    The clustercontaining the

    best result for

    fired query

    should appear

    at top. In that

    cluster a node

    containing the

    best result

    should come

    at first

    position.

    The cluster containing the

    best result for

    fired query

    appeared at top.

    In thet cluster a

    node

    containing the

    best result

    came at first.

    TC4 Similarity

    calculation

    Checking

    similarity

    between

    two clusters

    Provide two

    clusters as input

    Clustering of

    similar

    clusters

    Clustering of

    similar clusters

    TC5 Weight

    Calculation

    Calculating

    weight

    between

    two nodes

    Provide two

    nodes as input

    Weight

    between two

    nodes

    Weight

    between two

    nodes

    TC6 Removal of Checking Providing text After After removing

    Multi-Document Extractive Summarization for News Page 43 of 59

  • 7/31/2019 MDSN Report

    44/59

    common

    words

    the effect

    after

    removal of

    common

    words

    file with and

    without common

    words

    removing

    common

    words, as less

    no of words

    remained it is

    easy to

    calculate

    document

    graph, similar

    nods. Hence

    performance

    of system get

    increases

    common

    words, as less

    no of words

    remained it is

    easy to

    calculate

    document

    graph, similar

    nods. Hence

    performance of

    system get

    increases

    TC7 GUI Alignment

    of

    Controls

    Color of all

    buttons

    should be

    uniform.

    All

    textboxes

    should be

    aligned in a

    straight line

    Textboxes

    should be

    properly

    aligned.

    Should be

    uniform

    Should be

    aligned

    Textboxes

    should be

    properly

    aligned.

    Should be

    uniform

    Should be

    aligned

    Multi-Document Extractive Summarization for News Page 44 of 59

  • 7/31/2019 MDSN Report

    45/59

    5.4 User Interface (Screenshots)

    The basic user interface consists of at least three windows. First window is needed to input

    the text file. For this user has to give input as text file only.

    The second interface is to display different clustering techniques .Here threshold for

    clustering is also taken as input for clustering. For this user has to select a clustering algorithm out

    of two. He has to give threshold value for clustering.

    The third interface is to display the query and to take % of the correlation of cluster with

    the query.

    5.4.1 Before uploading the HTML file(s).

    Multi-Document Extractive Summarization for News Page 45 of 59

  • 7/31/2019 MDSN Report

    46/59

    Figure5.1. before uploading the HTML Files

    5.4.2 Uploading HTML file(s).

    Multi-Document Extractive Summarization for News Page 46 of 59

  • 7/31/2019 MDSN Report

    47/59

    Figure5.2:Uploading HTML file(s).

    Multi-Document Extractive Summarization for News Page 47 of 59

  • 7/31/2019 MDSN Report

    48/59

    5.4.3: Browsing the HTML file(s).

    Figure5.3: Browsing the HTML file(s).

    Multi-Document Extractive Summarization for News Page 48 of 59

  • 7/31/2019 MDSN Report

    49/59

    5.4.4: After browsing the HTML file(s).

    Figure5.4: After browsing the HTML file(s).

    Multi-Document Extractive Summarization for News Page 49 of 59

  • 7/31/2019 MDSN Report

    50/59

    5.4.5: Processing HTML file(s) and Display node relations.

    Figure5.5: Processing HTML file(s) and Display node relations

    Multi-Document Extractive Summarization for News Page 50 of 59

  • 7/31/2019 MDSN Report

    51/59

    5.4.6: Before clustering of nodes.

    Figure5.6: Before clustering of nodes.

    Multi-Document Extractive Summarization for News Page 51 of 59

  • 7/31/2019 MDSN Report

    52/59

    5.4.7: Clusters formation and building clustered graph

    Figure 5.7: Clusters formation and building clustered graph

    Multi-Document Extractive Summarization for News Page 52 of 59

  • 7/31/2019 MDSN Report

    53/59

    5.4.8: Taking input query and thresholds for minimal combination of clusters in %.

    Figure 5.8: Taking input query and thresholds for minimal combination of clusters in %.

    Multi-Document Extractive Summarization for News Page 53 of 59

  • 7/31/2019 MDSN Report

    54/59

    5.4. 9: Display minimal cluster as result along with link to actual web page(s).

    Figure 5.9: Display minimal cluster as result along with link to actual web page(s).

    Multi-Document Extractive Summarization for News Page 54 of 59

  • 7/31/2019 MDSN Report

    55/59

    5.4.10: Display actual web page(s) we are currently dealing with and highlighting the output

    data.

    Multi-Document Extractive Summarization for News Page 55 of 59

  • 7/31/2019 MDSN Report

    56/59

    Figure 5.10: Display actual web page(s) we are currently dealing with and highlighting the output

    data.

    Chapter 6

    Future Scope

    Future scope

    The sentence ordering module can be used to define ordering among those topic sentences.

    Another important aspect is that our system can be tuned to generate summary with custom size

    specified by users. It is shown that our system can generate summary for other non-English

    documents also if some simple resources of the language are available. In future we will use some

    dictionary to use all the synonyms of the query words and as well As of the keywords as the extra

    keywords to search the relevant information, so the quality of the summary will increase.

    In the News domain Update Summary is a very important and useful concept. On a same

    news topic every day or every hour there are some new or updated news arrived. So one who

    already read the previous news article, (s) he will not be interested to read the whole article again.

    (S)He will want to know the updated News only. With the help of the Update summary, reader can

    read and track news very easily. We can develop a system which will produce the update summary

    too.

    Multi-Document Extractive Summarization for News Page 56 of 59

  • 7/31/2019 MDSN Report

    57/59

    Conclusion

    We are specially dealing with generating the summary for the News domain. Summary is a

    very important and useful concept. On a same news topic every day or every hour there are some

    new or updated news arrived. So one cannot go through all the newspapers and each and every

    news article. (S)He will want to know the summary only. With the help of the News summary,

    reader can read and track news very easily.

    As we are providing facility of query dependent news summary, user can easily have news

    summery according to his/her interest.

    We are directly dealing with the HTML pages so user can retrieve online news and directly

    get the summary for the same. In this work we present a graph based approach for query dependent

    multi document. Summarization system along with the nearest neighbor clustering technique. It

    works efficiently in case of news summarization.

    Because of the multi-document news summarization, there is no need to go through all the

    newspapers. As we are dealing with query specific summarization, user can easily have news

    summery according to his/her interest. As well as the accuracy of the result is depend upon initial

    Edge Threshold and Cluster threshold as well as Result accuracy percentage, so user can control

    the relevance.

    Multi-Document Extractive Summarization for News Page 57 of 59

  • 7/31/2019 MDSN Report

    58/59

    Bibilography

    References:

    [1] Beyond Single-Page Web Search Results: Ramakrishna Varadarajan, Vagelis Hristidis, Tao Li

    Published in IEEE TKDE, 2008 (Journal paper).

    http://pages.cs.wisc.edu/~ramkris/Document_Summarization.pdf

    [2] R. Varadarajan, V Hristidis : A System for Query-Specific Document Summarization ,

    CIKM06, November 511, 2006, Arlington, Virginia, USA.

    Copyright 2006 ACM 1-59593-433-2/06/0011.

    [3] R. Varadarajan, V Hristidis: Structure-Based Query-Specific Document Summarization. Poster

    paper at CIKM 2005

    [4] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti and S,Sudarshan: Keyword Searching and

    Browsing in Databases using BANKS. ICDE, 2002.

    [5] M. White, T. Korelsky, C. Cardie, V. Ng, D. Pierce, and K. Wagstaff.: Multidocument

    Summarization via Information Extraction. HLT, 2001

    [6] Paladhi, S., Bandyopadhyay, S. 2008.

    A Document Graph Based Query Focused Multi- Document Summarizer.

    http://www.sivajibandyopadhyay.com/pinaki/papers/paclic24_SB_PB.pdf

    [7] G. Bhalotia, C. Nakhe, A. Hulgeri, S. Chakrabarti and S,Sudarshan: Keyword Searching and

    Browsing in Databases using BANKS. ICDE, 2002.

    [8]R. Mihalcea, Graph-based ranking algorithms for sentence extraction, applied to text

    summarization, in Proceedings of the ACL 2004 on Interactive poster and demonstration sessions,

    (Morristown, NJ, USA), p. 20, Association for Computational Linguistics, 2004.

    Books refered:

    Multi-Document Extractive Summarization for News Page 58 of 59

  • 7/31/2019 MDSN Report

    59/59

    [1]Analyzing the hierarchical Clustering Algorithm for Categorical Attributes, By Parul Agarwal,

    M. Afshar Alam, Ranjit Biswas.

    International journal of innovation, Management and Technology, Vol. 1, No. 2, June 2010

    ISSN: 2010-024

    http://www.ijimt.org/papers/34-K033.pdf

    [2] Professional asp.net 3.5

    Author: Bil evjen, scott hanselman,Devin rade Chapter2: pp 63-10, Chapter20: pp 929

    [3] Asp.net website programming Chapter 1: pp 15 - 38

    [4] C# 2008 Programmers Reference

    Author: Wei-Meng Lee

    [5] Hardy, H., Shimizu, N., Strzalkowski, T., Ting, L., Wise, G. B., Zhang. X. 2002.

    Crossdocument summarization by concept classification. SIGIR, pp. 65--69.

    [6]Barzilay, R. and M. Elhadad. 1999. Using lexical chains for text summarization. In Mani and

    Maybury (1999), 111 21.

    http://www.ijimt.org/papers/34-K033.pdfhttp://www.ijimt.org/papers/34-K033.pdf

Recommended