+ All Categories
Home > Documents > Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to...

Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to...

Date post: 09-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
150
Transcript
Page 1: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned
Page 2: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned
Page 3: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Contents

About This Book .......................................................................................1Audience .......................................................................................................... 1Prerequisites ..................................................................................................... 1Conventions ...................................................................................................... 2

1 About SAS Text Summarizer Studio ....................................................31.1 What is SAS Text Summarizer Studio? ..................................................... 31.2 Benefits to Using SAS Text Summarizer Studio ....................................... 41.3 Architecture ................................................................................................ 5

1.3.1 Overview .......................................................................................... 51.3.2 Compile Architecture ....................................................................... 61.3.3 RunTime Architecture ..................................................................... 7

2 Installing .................................................................................................92.1 Overview of Installing ................................................................................ 92.2 Prerequisite System Requirements ............................................................. 92.3 Contents of the Installation Kits ................................................................. 102.4 Installing ..................................................................................................... 11

2.4.1 Install on UNIX ................................................................................ 112.4.2 Install on Windows .......................................................................... 11

3 Interface ..................................................................................................153.1 Your First Look at the SAS Text Summarizer Studio User Interface ........ 153.2 SAS Text Summarizer Studio Menus ........................................................ 17

3.2.1 About Menus .................................................................................... 173.2.2 File Menu ......................................................................................... 173.2.3 Edit Menu ......................................................................................... 183.2.4 View Menu ....................................................................................... 183.2.5 Build Menu ...................................................................................... 183.2.6 Help Menu ....................................................................................... 193.2.7 Toolbar ............................................................................................. 193.2.8 Status Bar ......................................................................................... 213.2.9 Taxonomy Window ......................................................................... 21

iii

Page 4: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.3 Important Concepts Window ......................................................................243.3.1 About Important Concepts ...............................................................243.3.2 Use the Definition Tab .....................................................................253.3.3 Use the Test Tab ...............................................................................26

3.4 Using the Special Concept Window ...........................................................293.4.1 About the Special Concept Window ................................................293.4.2 Use the Definition Tab .....................................................................293.4.3 Test Tab ............................................................................................31

3.5 Using the Tokenizer Window .....................................................................313.5.1 About the Tokenizer Window ..........................................................313.5.2 Use the Tokenizer Window ..............................................................32

3.6 Using the Sentence Tokenizer Window .....................................................333.6.1 About the Sentence Tokenizer Window ...........................................333.6.2 Use the Sentence Tokenizer Window ..............................................34

3.7 Using the Paragraph Tokenizer Window ....................................................373.7.1 About the Paragraph Tokenizer Window .........................................373.7.2 Use the Paragraph Tokenizer Window .............................................37

3.8 Using the Profiles Window .........................................................................403.8.1 About the Profiles Interfaces ............................................................403.8.2 Drop-Down Menu Selections ...........................................................403.8.3 Profiles - Definition Tab ..................................................................413.8.4 Profile - Test Tab ..............................................................................443.8.5 Testing Selections that are Available in the Profile - Test Tab ........463.8.6 Initial Section Node ..........................................................................483.8.7 Other Section Nodes .........................................................................503.8.8 Special Concept - Definition Tab .....................................................50

3.9 Use the Preferences Dialog Box .................................................................51

4 Creating a New Project ..........................................................................554.1 Overview of Creating a New Project ..........................................................554.2 Plan Your Project ........................................................................................564.3 Create a New Project ..................................................................................574.4 Set Project-Wide Preferences .....................................................................594.5 Save Your Project .......................................................................................614.6 Build the Summarizer .................................................................................61

5 Specifying Concepts .............................................................................635.1 Overview of Specifying Concepts ..............................................................635.2 Important Concepts .....................................................................................64

iv SAS Text Summarizer Studio: User’s Guide

Page 5: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

5.2.1 Understanding the Three Types of Important Concepts .................. 645.2.2 How Important Concept Matches are Ranked ................................. 685.2.3 Understanding the Two Types of Important Concepts Files ........... 685.2.4 Import the Important Concept Branch ............................................. 695.2.5 Edit Your Important Concepts ......................................................... 72

5.3 Defining the Special Concept ..................................................................... 745.3.1 Understanding the Function of the Special Concept ........................ 745.3.2 Understanding Conditionals and the Weights that They Affect ...... 745.3.3 Write the Definition for the Special Concept .................................. 77

5.4 Prioritizing Sentence Location ................................................................... 785.5 Prioritizing Important Concepts ................................................................. 78

6 Specifying Tokenizers ...........................................................................816.1 Overview of Specifying Tokenizers ........................................................... 816.2 Import the Tokenizer .................................................................................. 826.3 Define the Sentence Tokenizer .................................................................. 856.4 Define the Paragraph Tokenizer ................................................................. 87

7 Creating Profiles ....................................................................................917.1 Overview of Creating Profiles ................................................................... 917.2 Plan Your Profile ........................................................................................ 927.3 Creating a New Profile and Sections ......................................................... 93

7.3.1 Create a Profile ................................................................................ 937.3.2 About Sections ................................................................................. 947.3.3 Add a Section ................................................................................... 957.3.4 How to Specify the Important Concepts Weight in the Profile Definition Tab ............................................................................... 99

8 Testing ....................................................................................................1018.1 Overview of Testing ................................................................................... 1018.2 Assembling a Testing Directory ................................................................. 1028.3 Testing Concepts ........................................................................................ 103

8.3.1 Before You Test Concepts ............................................................... 1038.3.2 Test the Important Concepts ............................................................ 1048.3.3 Test the Special Concept .................................................................. 106

8.4 Testing Tokenizers ..................................................................................... 1078.4.1 Test the Sentence Tokenizer ............................................................ 1078.4.2 Test the Paragraph Tokenizer .......................................................... 109

SAS Text Summarizer Studio: User’s Guide v

Page 6: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

8.5 Testing Profiles ...........................................................................................1118.5.1 Choose Advanced Testing Selections ..............................................1118.5.2 Test a Profile ....................................................................................1138.5.3 Understanding Profile Testing Results .............................................1138.5.4 +Export the Test Results ..................................................................116

9 Quick Start Guide ...................................................................................117Reference Section ..................................................................119

A Teragram Regex Syntax .......................................................................121A.1 What Rules and Restrictions Apply to Teragram Regular Expressions? ..121A.2 What Special Characters are Used with Teragram Regular Expressions? 123A.3 Special Cases .............................................................................................124

B Glossary .................................................................................................125

Index ...........................................................................................................129

vi SAS Text Summarizer Studio: User’s Guide

Page 7: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

About This Book

Audience

SAS Text Summarizer Studio is designed for the following users:- subject matter experts who define the parameters that SAS Text

Summarizer Studio uses to summarize input documents.- persons who assemble sets of representative documents to use for

testing purposes. These documents are representative of the texts that you want to summarize.

- (optional) linguists who develop the Concept rules to identify key terms in summary sentences.

You could be assigned a specific function, or you might develop a project by yourself.

Prerequisites

Here are the prerequisites for using SAS Text Summarizer Studio:- SAS Text Summarizer Studio loaded on your machine- access to representative documents that you want to summarize

1

Page 8: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Conventions

This manual uses the following typographical conventions:

Convention Description

TGM_ROOT The root directory where SAS Text Summarizer Studio is installed, typically the following:

Windows: C:/Program Files/SAS/SAS Text Summarizer StudioUNIX: /opt/SAS Text Summarizer Studio

.tsa The code examples for the .tsa file are shown in a fixed-width font.

Test button The labels for user interface controls are shown in a bold, sans serif font.

Tokenizer The names of taxonomy nodes appear in fixed-width font.

www.sas.com The hypertext links are shown in light blue, fixed-width font, and are underlined.

The Question Mark button accesses SAS Text Summarizer Studio: User’s Guide in PDF format.

2 SAS Text Summarizer Studio: User’s Guide

Page 9: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

1About SAS Text Summarizer Studio

- What is SAS Text Summarizer Studio?- Benefits to Using SAS Text Summarizer Studio- Architecture

1.1 What is SAS Text Summarizer Studio?

It is increasingly important to obtain summaries of texts in real time. Summaries enable you to quickly and easily grasp the subject matter of an input text in order to obtain subject matter expertise without reading the entire document. You gain the information contained in the documents through existing workflows. These summaries can also be distributed in formats with limited real estate such as smart phones and cell phones.Create a customized summarizer that applies Teragram’s Advanced Linguistic technologies to identify words, sentences, and paragraphs in input texts. Locate the key Concepts, and condense these texts into readily deployable summaries.The SAS Text Summarizer Studio program contains the summarizer with an easy-to-use, graphical user interface. Use the taxonomy tree structure in the interface to define and track the developmental progress of your project. SAS Text Summarizer Studio automatically summarizes incoming texts at the document, section, or paragraph levels. To perform these operations, SAS Text Summarizer Studio locates metadata, or Concepts, within the input text using the customized components that you define.

Page 10: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

The summarizer contains four tools that are essential to summarization:Important Concepts

identify the key information in the texts. Important Concepts, for example, personal names, publicly-traded companies, geographical locations and parts-of-speech, are built using SAS Content Categorization Studio. You can also purchase predefined Concepts from SAS.

Special Concept

locate the anchor words for the summary. These strings, related words, and codes that rank matches on these terms, form the definition for the Special Concept.

Tokenizers

break input streams of text into words, sentences, and paragraphs. Profiles

specify how each document-type is summarized and the ranking of each section-type within this document-type.

In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned to the end user as a summary.

1.2 Benefits to Using SAS Text Summarizer Studio

SAS Text Summarizer Studio, using Teragram’s Advanced Linguistic technologies, is a comprehensive solution to the multi-faceted challenges of summarizing input documents. SAS Text Summarizer Studio uses the parameters that you specify to automatically summarize the documents that you select. SAS Text Summarizer Studio combines several key technologies to provide a comprehensive solution to summarization challenges:

4 SAS Sentiment Analysis Studio: User and Java API Guide

Page 11: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Summarization SAS Text Summarizer Studio summarizes input documents using Teragram’s Advanced Linguistic technologies that include tokenizers and metadata identification in the form of Concept extraction technologies.

A choice of summary selectionschoose the settings that determine how summaries are identified. Select the number or percentage of sentences used to define a summary, the sections of the input document where the summary sentences are identified and how Concept matches are handled.

Intuitive user interfaceuse the SAS Text Summarizer Studio Windows’ interface to create a custom project.

Testing and Displaysee your summary results and the test results obtained from the various parameters that you specified.

Sample Projectuse the sample SAS Text Summarizer Studio project that is included with SAS Text Summarizer Studio.

1.3 Architecture

1.3.1 Overview

Before running the SAS Text Summarizer Studio program you should understand the SAS Text Summarizer Studio architecture and how SAS Text Summarizer Studio works with SAS Content Categorization Studio. For more information, see SAS Content Categorization Studio: User’s Guide.

SAS Sentiment Analysis Studio: User and Java API Guide 5

Page 12: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

1.3.2 Compile Architecture

The summarizer uses two types of Important Concepts files:<language>.concepts file

define this file using the SAS Content Categorization Studio program, or the predefined Concepts that you purchase from SAS.

<language>.concept.xml filecontains the taxonomy structure for the Concepts.

In either case, these Concepts form the Important Concepts branch of the taxonomy for your SAS Text Summarizer Studio project

Figure 1-1 SAS Text Summarizer Studio compile architecture

6 SAS Sentiment Analysis Studio: User and Java API Guide

Page 13: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

1.3.3 RunTime Architecture

After SAS Text Summarizer Studio collects and assembles all of the summaries (summarizer.bin file), this file can be used by the Server to create summaries, or abstracts, from input documents.

Figure 1-2 SAS Text Summarizer Studio runtime architecture

The SAS Text Summarizer Server is available as a Java API.

SAS Sentiment Analysis Studio: User and Java API Guide 7

Page 14: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

8 SAS Sentiment Analysis Studio: User and Java API Guide

Page 15: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

2 Installing

- Overview of Installing- Prerequisite System Requirements- Contents of the Installation Kits- Installing

2.1 Overview of Installing

Use this chapter to gain an understanding of the hardware requirements and the installation process for SAS Text Summarizer Studio.

2.2 Prerequisite System Requirements

Configure the local machine where you install SAS Text Summarizer Studio according to the recommended system configuration:CPU

x86 with 1 GHz or higher required. 2+ CPUs of 2 GHz or higher, each, are recommended

RAM

1 GB or higher is recommended, but this base number depends on the size of the project that you have loaded

9

Page 16: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Use the table below to learn about the supporting operating systems and the platforms required to run SAS Text Summarizer Studio:

2.3 Contents of the Installation Kits

The SAS Text Summarizer Studio software installation kit contains all the components that are required to install and use SAS Text Summarizer Studio.This kit is distributed for Windows as a zipped file that you copy and unzip into the directory that you select. SAS Text Summarizer Studio is distributed for UNIX systems as a tar archive. In either case, the kit contains—the SAS Text Summarizer Studio program, a Teragram Tokenizer, and a sample program.

Note: If you choose not to build your Concepts using SAS Content Categorization Studio, you can purchase predefined Concepts. In this case, the Concept files are included in the SAS Text Summarizer Studio program folder.

Operating System Platform

Linux, (Red Hat 7.x, 8, 9, Fedora 1-3, RHEL 2.1 and higher), Suse

x86, x86-64

IBM AIX PPC

FreeBSD x86

HP-UX 32 PA-RISC

Sun Solaris (32-bit) x86

Sun Solaris (32-bit) SPARC

Sun Solaris (64-bit) UltraSPARC

Tru64 UNIX HP Alpha

Windows x86, x86-64

10 SAS Text Summarizer Studio: User’s Guide

Page 17: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

2.4 Installing

2.4.1 Install on UNIX

SAS Text Summarizer Studio is distributed on UNIX systems as a tar archive. To install the software, use the following UNIX commands:

gzip -d installKit.tar.gztar -xvpf installKit.tar

The -d switch in the gzip command line decompresses the distribution file (it has been compressed to save space) in preparation for the expansion of the archive tar file. The switches on the tar command are used to extract the contents from the specified tar file and to preserve the file and directory permissions of the contents.

Note: The actual name of your tar file might vary from that shown in the example.

Additional information about using the gzip and tar commands is available in the UNIX main pages.

2.4.2 Install on Windows

To install SAS Text Summarizer Studio on a Windows machine, complete these steps:

SAS Text Summarizer Studio: User’s Guide 11

Page 18: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

1. Double-click SAS_TextSummarizer_Studio_Setup.exe and the installation wizard appears.

2. Click Next in the Welcome page that appears.

12 SAS Text Summarizer Studio: User’s Guide

Page 19: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

The Choose Install Location page appears.

3. (Optional) Click Browse to and the Browse For Folder dialog box appears. Use this dialog box to select another installation location.

4. Compare the Space required number to the Space available number to see if there is enough space to install the program on your hard drive.

SAS Text Summarizer Studio: User’s Guide 13

Page 20: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

5. Click Install and the Installation Complete page appears.

6. Click Close to begin working in the SAS Text Summarizer Studio.

14 SAS Text Summarizer Studio: User’s Guide

Page 21: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3 Interface

- Your First Look at the SAS Text Summarizer Studio User Interface- SAS Text Summarizer Studio Menus- Important Concepts Window- Using the Special Concept Window- Using the Tokenizer Window- Using the Sentence Tokenizer Window- Using the Paragraph Tokenizer Window- Using the Profiles Window- Use the Preferences Dialog Box

3.1 Your First Look at the SAS Text Summarizer Studio User Interface

To open the SAS Text Summarizer Studio user interface, go to Start —> Programs —> SAS Text Summarizer Studio —> SAS Text Summarizer Studio.

Display 3-1 SAS Text Summarizer Studio interface

15

Page 22: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

The components of the main window are listed below from top to bottom:Program and Project title bar

specifies the name of the program that you are running and the title of the project that you are working on. (The title only appears in parentheses [()] after you create a new project.)

Menu barcontains the menus (File, Edit, View, Build, and Help) for project tasks. For more information, see Section 3.2 SAS Text Summarizer Studio Menus on page 17.

Toolbarcontains icons that correspond to some of the operations that are available in the drop-down menus in the Menu bar. For more information, see Section 3.2.7 Toolbar on page 19.

Taxonomy windowforms the left-hand section of the user interface and is used to display the taxonomy for your SAS Text Summarizer Studio project. Some operations are available only by clicking in this interface. For example, add a new profile, and expand and collapse Important concepts.

16 SAS Text Summarizer Studio: User’s Guide

Page 23: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.2 SAS Text Summarizer Studio Menus

3.2.1 About Menus

The menus contain operations that apply across the entire project. For example, use menus to create a new project, open an existing project, or to build the summarizer.

3.2.2 File Menu

Here are the operations that are available in the File menu:New

open the New Project dialog box where you name, set the path to the directory for your new project, and name this project.

Open

open the Open dialog box that enables you to locate and open an existing project.

Import

open the Open dialog box that enables you to locate an existing project to import.

Save

save your project.Close

close the current project.Recent Files

see a drop-down menu of the five most recently opened SAS Text Summarizer Studio files (.tsa). The first time you open this window the term (empty) appears to the right of Recent Files. You can change the number of recent project files to display using the Preferences dialog box.

Exit

leave the SAS Text Summarizer Studio program.

SAS Text Summarizer Studio: User’s Guide 17

Page 24: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.2.3 Edit Menu

Standard Window operations and the Preferences operation are available in the Edit menu:Preferences

open the Preferences dialog box that enables you to set your project-wide display sections for the number of project files (.tsa) in the Recent Files list and the colors for the results displays. For more information, see Section 3.9 Use the Preferences Dialog Box on page 51.

3.2.4 View Menu

The following two commands are located in the View menu: Toolbar

see the expanded Toolbar. By default, these buttons include—the New, Open, Save, Copy, Cut, Paste, Undo, Redo, and Find operations. (If you deselect the default Toolbar, all of the buttons but—New, Open and Save are displayed.) For more information, see Section 3.2.7 Toolbar on page 19.

Status bar

the operation that you are performing is displayed beneath the taxonomy window, in the SAS Text Summarizer Studio. Deselect this default selection and your operations are not displayed.

3.2.5 Build Menu

The following command is located in the Build menu: Build Summarizer

Build the summarizer containing the Concepts, tokenization, and relevancy requirements for your project before you test.

18 SAS Text Summarizer Studio: User’s Guide

Page 25: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.2.6 Help Menu

The following command is located in the Help menu: About

displays licensing, versioning, and dating information for the version of SAS Text Summarizer Studio that you are running.

3.2.7 Toolbar

Access a number of operations using the standard toolbar that is located below the menu bar. These standard toolbar icons are shortcuts to some, but not all, of the commands is available in the menu bar.

Figure 3-1 Standard toolbar

In order to hide, or to show, the standard toolbar, select View —> Toolbar.

Table 3-1: Standard Toolbar Icons

Icon Command

Click the New button and the New Project dialog box appears where you can name and choose a location for your project.

Click the Open button and the Open Project dialog box appears where you can select a SAS Text Summarizer Studio project file (*.tsa) to open.

Click the Save button to save your project.

Click the Copy button to copy what you selected.

SAS Text Summarizer Studio: User’s Guide 19

Page 26: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Click the Cut button to cut the selected item.

Click the Paste button to perform a paste operation.

Click the Undo button to restore your last action.

Click the Redo button to redo the last action.

Click the Find button and the Find dialog box appears where you can search text.

Table 3-1: Standard Toolbar Icons (Continued)

Icon Command

20 SAS Text Summarizer Studio: User’s Guide

Page 27: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.2.8 Status Bar

Use the Status Bar located on the bottom left side of the SAS Text Summarizer Studio User Interface to see the current operation.

Display 3-2 Status bar

3.2.9 Taxonomy Window

The taxonomy window provides a visual representation of the taxonomy of your SAS Text Summarizer Studio project in a hierarchical layout. Hierarchical means that some nodes have child, or subnodes. These nodes are the children of their parent nodes. This provides an easily accessible, visual representation of your SAS Text Summarizer Studio project.Use this interface to navigate through your summarizer components as you build your project. As you navigate through these components, the interfaces on the right side of the taxonomy window change.

SAS Text Summarizer Studio: User’s Guide 21

Page 28: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Display 3-3 Taxonomy window

The nodes that are displayed in the taxonomy window include those described below:Summarizer - Project name (the first node in the taxonomy tree)

name of the project, for example, Demo, is always displayed to the right of the Summarizer node.

Important Concepts

when you import the <language>.concepts and <language>.concept.xml file that you either created using SAS Content Categorization Studio or purchased from SAS, a taxonomy of Concepts and their definitions is displayed. These nodes form a branch of the taxonomy.

Special Concept

specify the terms that serve as anchor words in your summary sentences.Tokenizer

import the Teragram Tokenizer file that breaks streams of input text into words.

Sentence Tokenizer

specify the delimiters that identify sentences in the input text. Delimiters are the characters that signal the end of a sentence or paragraph.

22 SAS Text Summarizer Studio: User’s Guide

Page 29: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Paragraph Tokenizer

specify the delimiters that identify paragraphs in the input text. Profiles

this taxonomy branch specifies how the information in the nodes above are applied to input documents and the sentences within the various sections of each text.Default

specify the ways to handle the various document sources, its sections, and the summary sentences in these sections. This information includes weight and rankings.Initial

every profile contains an Initial node that enables you to specify a definition for a section of an input document.

Special Concept

specify the anchor terms that are applied the parent profile—only. Alternatively, override the Special Concepts set for the project using the terms specified in this definition.

SAS Text Summarizer Studio: User’s Guide 23

Page 30: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.3 Important Concepts Window

3.3.1 About Important Concepts

Important Concepts form the branch of Concepts that is developed in a SAS Content Categorization Studio project. For this reason, Important Concepts can be any of the Concept-types that are available in SAS Content Categorization Studio—Classifier, Grammar, and Regex (Teragram Regular Expression) Concepts.These predefined Concepts are imported from the SAS Content Categorization Studio project directory where they were created. Alternatively, you can purchase these files from SAS. Each of these two files is used by SAS Text Summarizer Studio differently: <language>.concepts file

contains the Concepts and their definitions.<language>.concept.xml file

contains the taxonomy structure for the Concepts that you are loading.

Caution: If you move the <language>.concepts file or the <language>.concept.xml file, the definitions cannot be imported into the Important Concepts branch.

24 SAS Text Summarizer Studio: User’s Guide

Page 31: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.3.2 Use the Definition Tab

Use the Definition tab to load the predefined Concepts. This tab appears when you click the Important Concepts node.

Display 3-4 Important Concepts - Definition tab

You have two choices when you import your Important Concepts. Choose to use the Concepts and Concepts XML fields or select Using concept project directory:

- To use the Concepts and Concepts XML fields, complete these steps:

a. Click Browse to open the Select a Concepts File dialog box. Select the <language>.concepts file that you want to import.

b. Click Browse to open the Open File dialog box. Select the <language>.concept.xml file that contains the taxonomy for the Concepts. This file, like <language>.concepts above, is either output from SAS Content Categorization Studio or purchased directly from SAS.

SAS Text Summarizer Studio: User’s Guide 25

Page 32: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

- To use the Using concept project directory check box, complete these steps:

a. Select Using concept project directory and the field beneath this box is activated.

b. Click Browse to open the Browse For Folder dialog box where you can locate the SAS Content Categorization Studio directory that contains both the <language>.concepts file and the <language>.concept.xml file.

When you use this selection, the Concepts and Concepts XML fields are automatically populated.

3.3.3 Use the Test Tab

Use the Test tab to test the definitions of your Important Concepts. This tab enables you to identify anchor terms that you might want to use for your Special Concepts and any weaknesses in a definition that require correction.

Notes: The Test tab is only available for the Important Concepts node. It is not available for any child nodes.The testing operation can only be performed after you have loaded the Teragram Tokenizer.

26 SAS Text Summarizer Studio: User’s Guide

Page 33: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

To open and use the components of the Test tab, complete these steps:

1. Click on the Important Concepts node and click the Test tab.

2. Click Load and the Select Test Files dialog box opens.

SAS Text Summarizer Studio: User’s Guide 27

Page 34: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3. Use the Look in field to locate the directory of your testing files, for example, central repository.

4. Select a testing document, for example, BaseballTeam.txt.

5. Click OK.

6. Click Test in the Test tab. The document title appears in the Test tab to the left of the Load button, for example, BaseballTeam.txt.

The text of the tested document is displayed in the pane below the Load and Test buttons.See any of the following results under the Results heading:- The matching terms and the Concepts that they match appear here.

- Any necessary steps that you missed. For example, if you forget to add the word tokenizer to the Tokenizer node, SAS Text Summarizer Studio status screen appears.

- If the summarizer needs to be rebuilt this message appears in the Results pane or a SAS Text Summarizer Studio status screen appears.

For more information, see Section 8.3.2 Test the Important Concepts on page 104.

28 SAS Text Summarizer Studio: User’s Guide

Page 35: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.4 Using the Special Concept Window

3.4.1 About the Special Concept Window

Define a Special Concept for the strings that you do not catch with the Important Concepts. These strings are anchor terms. When a Special Concept is located in a sentence, this sentence can be ranked higher than sentences that contain one or more Important Concepts but no Special Concept. This is because you can specify conditionals that rank each term in the Special Concept definition. For more information, see Section 5.3.2 Understanding Conditionals and the Weights that They Affect on page 74.

3.4.2 Use the Definition Tab

To open and use the Definition tab, complete these steps:

1. Click on the Special Concept node in your taxonomy window and the Definition tab appears.

2. Type in the definition for your Special Concept. For more information, see Section 5.3 Defining the Special Concept on page 74.

SAS Text Summarizer Studio: User’s Guide 29

Page 36: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3. Click Syntax Check to validate the syntax for this definition and a SAS Text Summarizer Studio status screen appears with a Syntax OK message.

If the SAS Text Summarizer Studio status screen appears and states that there is an error, a screen appears at the bottom of the user interface that you can use to locate the syntax error.

4. (Optional) Click the LineNo heading to reorder the lines that contain errors.

5. See the results under the Error heading to see the term specified for the selected line.

6. See the explanation under the Message heading to correct your syntax.

Click anywhere in the message line and the rule is highlighted in the pane below the Definition tab.

7. Click Hide Message to close the message window and to see the Show Message button.

30 SAS Text Summarizer Studio: User’s Guide

Page 37: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.4.3 Test Tab

The Special Concept - Test tab appears when you click Test. This testing tab works like the Important Concepts - Test tab. For more information, see Section 3.3.3 Use the Test Tab on page 26 and Section 8.3.3 Test the Special Concept on page 106.

3.5 Using the Tokenizer Window

3.5.1 About the Tokenizer Window

Use the Tokenizer node to import the Teragram Tokenizer that breaks streams of input text into words. By default, this file is located in the following location:

C:\Program Files\SAS_Institute\SAS Text Summarizer Studio

Notes: The Build —> Build Summarizer is only operable after you add the Teragram Tokenizer.There is no test operation for the Teragram Tokenizer.

SAS Text Summarizer Studio: User’s Guide 31

Page 38: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.5.2 Use the Tokenizer Window

To open and use the tokenizer interface, complete these steps:

1. Click on the Tokenizer icon in the taxonomy window and the Tokenizer field appears on the right side of the interface.

2. Click Browse to open the Select a Tokenizer File dialog box where you can select the tokenizer file for your project.

32 SAS Text Summarizer Studio: User’s Guide

Page 39: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3. Click to the right of the Look in field to navigate to the directory that contains the tokenizer file, for example, tkzo.bin.

4. Select the tkzo.bin file.

5. Click Open.

6. Select Build —> Build Summarizer and a SAS Text Summarizer Studio confirmation screen appears.

3.6 Using the Sentence Tokenizer Window

3.6.1 About the Sentence Tokenizer Window

Write a definition for the sentence tokenizer that breaks streams of words into sentences. The Concepts that you specify are located in sentences that are ranked in order to return the sentences with the highest priority as the document summary.To delimit sentences, you specify the characters that represent the end of a sentence. For example, type in ., ?, and !. However it is also necessary to enter the characters that follow abbreviations or other terms that should not identify the end of a sentence. For example, SAS Text Summarizer Studio should not delimit the following sentence using the period (.) that follows the Assn. abbreviation. The Bolt Assn. met this week.

SAS Text Summarizer Studio: User’s Guide 33

Page 40: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.6.2 Use the Sentence Tokenizer Window

To open and use the components of the Sentence Tokenizer interface, complete these steps:

1. Click on the Sentence Tokenizer node in the taxonomy window and the Definition tab appears.

2. Type your sentence delimiters into the Definition tab. For more information, see Section 6.3 Define the Sentence Tokenizer on page 85.

3. Click Syntax Check to validate the definition for your sentence tokenizer.

34 SAS Text Summarizer Studio: User’s Guide

Page 41: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

4. If the SAS Text Summarizer Studio status screen reports an error, a message window appears at the bottom of the interface.

5. Use Step 4 through Step 7 on page 30.

SAS Text Summarizer Studio: User’s Guide 35

Page 42: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

6. Click Test to open the Test tab where you test your sentence tokenizer.

For more information, see Section 3.3.3 Use the Test Tab on page 26 and Section 8.4.1 Test the Sentence Tokenizer on page 107.

36 SAS Text Summarizer Studio: User’s Guide

Page 43: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.7 Using the Paragraph Tokenizer Window

3.7.1 About the Paragraph Tokenizer Window

Write a definition for the paragraph tokenizer that identifies paragraphs within input documents. These paragraphs are grouped into sections when you add section nodes to this branch of the taxonomy.To delimit paragraphs, you specify the characters that represent the end of a paragraph. For example, specify a line break using \n.You can also type in the characters that should not identify the end of a paragraph.

3.7.2 Use the Paragraph Tokenizer Window

To open and use the components of the Sentence Tokenizer interface, complete these steps:

1. Click on the Paragraph Tokenizer node in the taxonomy window and the Definition tab appears.

2. Type your paragraph delimiters into the Definition tab that appears. For more information, see Section 6.4 Define the Paragraph Tokenizer on page 87.

SAS Text Summarizer Studio: User’s Guide 37

Page 44: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3. Click Syntax Check to validate the definition for your sentence tokenizer.

4. If the SAS Text Summarizer Studio status screen reports an error, a message window appears at the bottom of the interface.

5. Use Step 4 through Step 7 on page 30.

38 SAS Text Summarizer Studio: User’s Guide

Page 45: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

6. Click Test to open the Test interface where you test your paragraph tokenizer.

The Results pane displays the numbered paragraphs and the section for each paragraph. In this case, because only the default section is listed under the Profiles node, each paragraph is part of the Default section, (The entire document is one section.)For more information, see Section 3.3.3 Use the Test Tab on page 26 and Section 8.4.1 Test the Sentence Tokenizer on page 107.

SAS Text Summarizer Studio: User’s Guide 39

Page 46: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.8 Using the Profiles Window

3.8.1 About the Profiles Interfaces

The profiles’ interfaces bring together all of the technologies that you applied in the Concepts and tokenizer interfaces. Use this taxonomy branch to specify the construction of a sentence, and to weight and locate matches. Then test your profiles and their sections and make any necessary edits.

3.8.2 Drop-Down Menu Selections

When you right-click on the following nodes in the profiles branch, these drop-down menu selections are available:Profiles node: Select Add a Profile and a new profile node and Initial section node are added to the taxonomy.Default node: Choose from the following menu selections:

- Add a Section: Add a new section node to the taxonomy.- Rename: Change the name of the selected node.- Delete: Remove the selected node from the taxonomy.- Add Special Concept: Add a Special Concept node to the taxonomy.

Special Concept node: Select Delete to remove this node from the taxonomy.

40 SAS Text Summarizer Studio: User’s Guide

Page 47: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.8.3 Profiles - Definition Tab

The Definition tab appears when you select a child profile of the Profiles node, for example, Default. Use the Definition tab to define the parameters for sentences in the specified document type.

Display 3-5 Profiles - Default tab

SAS Text Summarizer Studio: User’s Guide 41

Page 48: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Set the following seven selections in the Definition tab:Weight

set the relative weight for the sentences that appear in the location that you select using the radio buttons below the Type heading. Use a higher number for a sentence that should be included in the summary, or a lower number for a sentence of conditional value. For example, assign a weight of 100.

Gradient

create a value slope where each sentence that follows the most valued sentence loses the specified percentage of value assigned to the most valued sentence. Continuing with the example above, specify 10% to assign the next sentence a weight of 90, the next sentence a value of 80, and so forth.

Type

specifies the sentence location in the document that is assigned the highest ranking. In other words, this selection enables you to boost the relevance of sentences that are located within the specified location over those located in other parts of the document:- Top: Assign the first sentence in the document the highest ranking.- Middle: Assign the highest value to the sentence that is centrally located

in an input document. Sentences that are located at the top and bottom of this document are the least important. All of the sentences that appear above and below the middle sentence lose value according to the percentage specified in the Gradient field. In this case, the Gradient example above applies to two sentences, one above and one below the middle sentence.

- Bottom: Assign the highest weight to the last sentence in a document. In this case, the sentences lose value, according to the Gradient setting as they go from the bottom to the top of the section.

Note: The section level settings also affect these weights. For more information, see Section 7.3.2 About Sections on page 94.

42 SAS Text Summarizer Studio: User’s Guide

Page 49: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Ignore sentences that appear within quotes

click this check box to enable SAS Text Summarizer Studio to ignore any sentences that are direct quotes.

Minimal length of a sentence to qualify (as a sentence)

click either or to change the number of characters that are necessary to qualify as a sentence. The default setting is 1.

Important Concepts weight

specify the weight of matching Concepts when they occur in a sentence. This is the maximum weight to be added to a sentence for each Important Concept that is identified within this sentence.

Input Format

choose from the following selections in the drop-down menu—Plain

Text, HTML, XML, or SGML.

Note: When you set these parameters you must be careful to take the values of the section settings into consideration. Furthermore, you might also want to consider the definitions that you create for other Profiles in your SAS Text Summarizer Studio project.

SAS Text Summarizer Studio: User’s Guide 43

Page 50: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.8.4 Profile - Test Tab

Use the Profiles - Test tab to test your profiles against a selected document.Display 3-6 Profiles - Test tab

The Test tab components are listed below (without the advanced options).Doc

the title and file extension of this document is displayed here.Load button

load a document into the Test tab.Compute button

obtain the documents summaries.

44 SAS Text Summarizer Studio: User’s Guide

Page 51: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Advanced button

display the advanced summary options.Document window

displays the input document text.Previous button

view the last document that was loaded before the present document. Next button

use this button after you click Previous one or more times to view the text that was loaded after the document in the present view.

Results pane

specify results types.Summary window

use this window to view the document, paragraph, or section summary based on the results that you specified in the Results pane.

Export Result button

export your summary results as a text (.txt) file.Copy to Clipboard button

copy your summary results to the clipboard of your computer.For more information, see Section 8.5 Testing Profiles on page 111.

SAS Text Summarizer Studio: User’s Guide 45

Page 52: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.8.5 Testing Selections that are Available in the Profile - Test Tab

To choose the testing selections, click Advanced.Display 3-7 Profiles - Test tab with Advanced settings

46 SAS Text Summarizer Studio: User’s Guide

Page 53: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

The advanced testing operations are listed below:Number

click either or to change the number of sentences used to construct the summary. By default this selection is set to 5.

Summary Level

SAS Text Summarizer Studio extracts the specified number or percentage of sentences and constructs a summary based on the selection that you make when you click to make a selection- Sentences (default selection): Construct a summary of the entire

document.- Paragraph: Summarize each paragraph in the document.- Section: Summarize each section in the document.

Percentage

click either or to specify a percentage of sentences, instead of an absolute number that is used to construct the document summary. When you choose this selection, the summary is composed of the specified number of sentences with the additional sentences listed below the summary. By default this selection is set to 5%.

Relevancy Cutoff (optional) return all the sentences, paragraphs, or sections that contain an Important Concept, Special Concept, section, or paragraph ranking where the sum total of these weights meets or exceeds the Relevancy Cutoff setting. When this number is met or exceeded, all sentences, paragraphs, or sections (depending on your Summary Level setting) that match or exceed the Relevancy Cutoff specification are returned. Those that fall below this number are excluded.

SAS Text Summarizer Studio: User’s Guide 47

Page 54: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Select From Each Paragraph (by default this operation is not selected)use this operation with the Sentences selection in the Summary Level field—only. The specified number of sentences is extracted from each paragraph and returned as the document summary.

For more information, see Section 8.5.1 Choose Advanced Testing Selections on page 111.

3.8.6 Initial Section Node

Use this node, and any other section nodes that you add to the taxonomy to specify the weights, gradient value and ranking type for this section.

Display 3-8 Initial - Definition tab

48 SAS Text Summarizer Studio: User’s Guide

Page 55: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

There are four components of the Initial section Definition window and each of these components is used to specify how SAS Text Summarizer Studio will treat the sentences in this section:Section Weight

sets the relative weight for all of the sentences in this section, from a higher number (heaviest) to a lower number.

Rank Weight

enables you to boost the relative importance of each sentence in this section in relation to sentences located in other sections.

Rank Gradient

create a value slope where each sentence that follows the most valued sentence loses the specified percentage of value assigned to the most valued sentence.

Rank Type

specifies the sentence location that is assigned the highest ranking:- Top: Assign the first sentence in the document the highest ranking.- Middle: Assign the highest value to the sentence that is centrally located

in an input document. Sentences that are located at the top and bottom of this document are the least important. All of the sentences that appear above and below the middle sentence lose value according to the percentage specified in the Gradient field. In this case, the Gradient example above applies to two sentences, one above and one below the middle sentence.

- Bottom: Assign the highest weight to the last sentence in a document. In this case, the sentences lose value, according to the Gradient setting as they go from the bottom to the top of the section.

Note: Consider the selections that you make for the Profiles node when you make these selections.

SAS Text Summarizer Studio: User’s Guide 49

Page 56: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3.8.7 Other Section Nodes

When you add additional sections to a profile, the Definition tab includes the Marker field. Use this field to specify the name of the section. Alternatively, specify a Teragram Regular expression that enables SAS Text Summarizer Studio to locate multiple instances of this section type.

3.8.8 Special Concept - Definition Tab

Use the operations that are available in the Special Concept window for profiles in the same way that you use these operations for the Special Concept node. For more information, see Section 3.4 Using the Special Concept Window on page 29. There is one additional operation that is available in the Definition tab for this interface. By default, the Policy to Project-level Special Concept is set to Overwrite. You can select Extend to use the definition that you type into the pane below this operation to be used with the definition that you wrote for the entire project.

50 SAS Text Summarizer Studio: User’s Guide

Page 57: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Display 3-9 Special Concept - Definition tab

3.9 Use the Preferences Dialog Box

Use the Preferences dialog box to set your optional display preferences for the entire project.

SAS Text Summarizer Studio: User’s Guide 51

Page 58: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

To open and use the Preferences dialog box, complete these steps:

1. Use the General tab to perform the following operations:

a. Click either or in the Display items in recent files list to select the number of recent project files.

b. Click Clear to remove the list of project files from the File menu. When you select this operation, the term (Empty) appears to the right Recent Files.

2. Click the Summarization tab and you can perform the following operations:

a. To change the color of the Final Results, select the colored rectangle to the right of this title. The default color is blue.

b. To change the color of the Results from each paragraph, select the colored rectangle to the right of this title. The default color is green.

52 SAS Text Summarizer Studio: User’s Guide

Page 59: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

When you click either of these color boxes, the Select color dialog box appears.

You can make the following choices:- Select a new color.- Create a custom color.

c. Click OK.

d. Click OK in the Summarization tab.

3. Select Build —> Build Summarizer.

4. Select File —> Save.

SAS Text Summarizer Studio: User’s Guide 53

Page 60: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

54 SAS Text Summarizer Studio: User’s Guide

Page 61: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

4 Creating a New Project

- Overview of Creating a New Project- Plan Your Project- Create a New Project- Set Project-Wide Preferences- Save Your Project- Build the Summarizer

4.1 Overview of Creating a New Project

Before you create a new SAS Text Summarizer Studio project it is important to understand the major building blocks that you use. This chapter provides a step-by-step set of instructions on how to create a new project. After you have created a new project you can use the following chapters to add the required project information:

- Chapter 5 to import your Important Concepts and to define your Special Concept

- Chapter 6 to import your word tokenizer and to define your sentence and paragraph tokenizers

- Chapter 7 to create your the profiles that specify how SAS Text Summarizer Studio applies the Concepts and tokenizers to the sentences in the three sections of an input document.

- Chapter 8 to test your project

55

Page 62: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

4.2 Plan Your Project

You should also consider the types of summaries that you make available from incoming documents and the many possible ways to extract this information. To plan your project, complete these steps:

1. Clearly define the summaries that you want to return to the end user. Consider the source of your documents and the information that you want to mine from these texts.

2. Determine what types of documents you plan to summarize, for example, .html, .txt, or .xml.

3. Analyze how Concepts and summaries are related to each another. Concepts extract the key ideas from a document, while summaries contain the sentences where these Concepts are matched. Depending on how the matching sentences are weighted, they are included, or excluded from the summary.

In other words, in a summary what words do you expect to find in the returned sentences? Are their terms that should never be included in a summary sentence? Are there related terms that make both sentences important to the summary? Specify these words in your Concept definitions. List the key, or anchor, terms in the definition of your Special Concept.

4. Specify tokenization: Import the Teragram Tokenizer file that tells SAS Text Summarizer Studio how to break streams of text into words. Specify the characters that delimit sentences and those that delimit the paragraphs that form the sections of your input document.

5. Create profiles: Determine how SAS Text Summarizer Studio weighs the sentences located in the various sections of input documents when constructing the summary.

6. Test your Special Concept, your sentence and paragraph tokenizers, and your profiles and their sections: Use the Test tabs to import testing documents and to test the project or some of its components. For example, test sentence and paragraph tokenizers and the Concepts before you test the profiles.

56 SAS Text Summarizer Studio: User’s Guide

Page 63: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

4.3 Create a New Project

To create a new project, complete these steps:

1. Select File > New and the New Project dialog box appears.

2. Type the name of the new project into the Project Name field, for example, New_Project.

3. Click and the Browse For Folder dialog box appears where you can select a folder. The path to the new project appears in the Project Path field, after you select the folder location.

4. Click to select a Language, if you have purchased more than one language. By default, English (or one of the languages that you purchased) appears in this field.

SAS Text Summarizer Studio: User’s Guide 57

Page 64: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

5. Click OK. The Teragram Summarization Administrator user interface displays the name of the new project and the initial taxonomy.

6. Select File —> Save.

58 SAS Text Summarizer Studio: User’s Guide

Page 65: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

4.4 Set Project-Wide Preferences

The Preferences dialog box contains project-wide, optional operations. These selections enable you to change the number of projects listed in the File menu and the colors that are used to display the summary results.To set these optional, project-wide settings, complete these steps:

1. Select Edit > Preferences and the Preferences dialog box appears.

2. Use the setting in General tab to select the number of recent projects to display:

a. Click either or in the Display items in recent files list to select the number of recent project files. To see this number of recent projects, select File —> Recent Files and see the drop-down menu.

b. Click Clear to remove the existing list of recent files from the File drop-down menu. When you click this button, and reselect File —> Recent Files, (Empty) appears.

3. Click Summarization and the contents of this tab are displayed. Use the following selections to change the colors in your summarization results:

a. To change the color of the Final Results, select the colored rectangle to the right of this title. The default color is blue.

SAS Text Summarizer Studio: User’s Guide 59

Page 66: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

b. To change the color of the Results from each paragraph, select the colored rectangle to the right of this title. The default color is green.

When you click either of these color boxes, the Select color dialog box appears.

60 SAS Text Summarizer Studio: User’s Guide

Page 67: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

You can make the following choices:- Select a new color.- Create a custom color.

c. Click OK.

d. Click OK in the Summarization tab.

4. Select File —> Save.

4.5 Save Your Project

Select File —> Save to save your project. Use this operation frequently to create the .tsa file.

4.6 Build the Summarizer

Build the summarizer to ensure that your components conform to the requirements for SAS Text Summarizer Studio. You build the summarizer after adding the Teragram Tokenizer and then again each major addition or edit to your project.To build your summarizer, complete this step:Select Build > Build Summarizer and a SAS Text Summarizer Studio confirmation screen appears.

SAS Text Summarizer Studio: User’s Guide 61

Page 68: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

62 SAS Text Summarizer Studio: User’s Guide

Page 69: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

5 Specifying Concepts

- Overview of Specifying Concepts- Important Concepts- Defining the Special Concept- Prioritizing Sentence Location- Prioritizing Important Concepts

5.1 Overview of Specifying Concepts

A Concept is defined as an autonomous piece of data, for example, the name of a person, country, or business. Concepts identify key information in input documents that enables SAS Text Summarizer Studio to identify the sentences that summarize your input document.After you create a new SAS Text Summarizer Studio project, the next step is to begin defining the Concept sections of your taxonomy. The two basic types of Concepts that are used to create a SAS Text Summarizer Studio project are:Important Concepts

These pre-defined Concepts are imported into the SAS Text Summarizer Studio taxonomy structure. Define the Important Concepts branch of your taxonomy using the <language>.concepts and the <language>.concepts.xml files that you import from SAS Content Categorization Studio or purchase from SAS. Unlike the Special Concept that is defined within the SAS Text Summarizer Studio project, Important Concepts cannot be edited within your SAS Text Summarizer Studio project. You export the Important Concepts back into SAS Content Categorization Studio to make any changes.

63

Page 70: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Caution: Do not move these files, but keep them within their project folder, or the Concept definitions are not imported when the taxonomy is loaded.

Special Concept

Use the Special Concept definition to specify a list of strings that are the anchor terms for the summary sentences. This list of entities defines the key terms that are not included in the <language>.concepts, or the <language>.concept.xml file that you import to specify the Important Concepts. Unlike Important Concepts, the Special Concept is a single Concept with a definition that is a list of terms. This definition, unlike the definitions of the Important Concepts, can be edited within the SAS Text Summarizer Studio project. You can also use conditionals, or codes, to rank matching sentences and specify related terms.

Note: The Build —> Build Summarizer operation works only after you add the Teragram Tokenizer.

The following sections explain how to import, define, and use Concepts within your SAS Text Summarizer Studio project to build an effective summarizer for incoming documents.

5.2 Important Concepts

5.2.1 Understanding the Three Types of Important Concepts

There are three types of Important Concepts that are defined within the SAS Content Categorization Studio project:

64 SAS Text Summarizer Studio: User’s Guide

Page 71: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

- Classifier Concepts: These Concepts are based on lists of terms to match. For example, specify the names of government officials, titles, and countries.

Display 5-1 Classifier Concept definition example

You can also import existing thesauri as Classifier Concepts.

SAS Text Summarizer Studio: User’s Guide 65

Page 72: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Display 5-2 Imported thesauri example

Note: Imported thesauri are specified within the <language>.concepts and <language>.concept.xml file to be brought into the Concepts taxonomy.

- Regex Concepts: These Concepts are defined using Teragram Regular Expressions. Examples of Concepts that are best defined using Regular Expressions include—e-mail lists, URLs, telephone numbers, section codes, and regulations. Use Regex Concepts to locate terms that follow a known pattern. Regex Concept definitions begin with the line __REGEX__.

66 SAS Text Summarizer Studio: User’s Guide

Page 73: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Display 5-3 Regex Concept example

- Grammar Concepts: These Concepts rely on the syntactic pattern of language to identify matches, for example, noun phrases and parts of speech. Grammar Concept definitions begin with an asterisk (*) followed by the name of the Concept. Alternatively, they can define intermediate Concepts within their definitions using the line #ROOT =. If the Grammar Concept refers to an intermediate example, see Display 5-4 below, an asterisk (*) followed by the name of the intermediate Concept follows the string #ROOT =.

Display 5-4 Grammar Concept example

These three types of Important Concepts can also be defined as one of the following two types:

SAS Text Summarizer Studio: User’s Guide 67

Page 74: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Simple Concepts: Classifier and Regex Concepts are also defined as Simple Concepts. This is because classifier and Regex Concepts name a single entity. Relational Concepts: Relational Concepts can define the relationship between entities and Concepts that otherwise might appear to be unrelated. Use Grammar Concepts—only—to define relational Concepts. For example, identify George W. Bush and the position that he held as President.

5.2.2 How Important Concept Matches are Ranked

The Important Concept that is most frequently matched receives the highest ranking value. Frequency-based ranking is the sum of the matching instances.

5.2.3 Understanding the Two Types of Important Concepts Files

When you import your Important Concepts, you not only import predefined Concepts, but also a pre-built Concept taxonomy. The Important Concepts are imported into the SAS Text Summarizer Studio taxonomy as a branch that cannot be edited or changed, unless this branch of Concepts is imported back into SAS Content Categorization Studio where it was created:

68 SAS Text Summarizer Studio: User’s Guide

Page 75: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

<language>.concepts file

This is the binary file that contains the Concepts and their definitions. When this file is purchased from SAS, the Concept file is specific to the language that you are using, for example:

English.concepts

The <language>.concepts file includes all of the predefined Concepts for your project.

<language>.concept.xml file

The <language>.concept.xml file contains the taxonomy for the Concepts that you defined using SAS Content Categorization Studio. If you purchased predefined Concepts from SAS, this file is specified as English.concept.xml.

5.2.4 Import the Important Concept Branch

You import the Important Concepts for your project into SAS Text Summarizer Studio from SAS Content Categorization Studio. When you import these Concepts they are moved as a branch and stored in your SAS Text Summarizer Studio project as a taxonomy.

Caution: If these files do not remain in the SAS Content Categorization Studio folder where they were created, the definitions for these Concepts cannot be imported.

SAS Text Summarizer Studio: User’s Guide 69

Page 76: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

To import the predefined Concepts and their prebuilt taxonomy into your SAS Text Summarizer Studio project, complete these steps:

1. Right-click on the Important Concepts node and the Definition tab appears.

2. You have two choices when you import your Important Concepts. You can choose to use the Concepts and Concepts XML fields, or select Using concept project directory:

- To use the Concepts and Concepts XML fields, complete these steps:

i. Click Browse to open the Select a Concepts File dialog box. Select the <language>.concepts file that you want to import.

ii. Click Browse to open the Open File dialog box. Select the <language>.concept.xml file that contains the taxonomy for the Concepts.

- To use the Using concept project directory field, complete these steps:

i. Select Using concept project directory and the field beneath this box is activated.

70 SAS Text Summarizer Studio: User’s Guide

Page 77: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

ii. Click Browse to open the Browse For Folder dialog box where you can locate the SAS Content Categorization Studio directory that contains both the <language>.concepts file and the <language>.concept.xml file.

After you import your taxonomy, the taxonomy pane appears similar to the example displayed below:

SAS Text Summarizer Studio: User’s Guide 71

Page 78: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

5.2.5 Edit Your Important Concepts

You import the .concepts file back into SAS Content Categorization Studio to edit your Concept definitions or to change their taxonomy structure. To edit your Concepts, complete these steps:

1. To collapse the entire Concepts taxonomy, click on the plus (+) sign that is found to the left of the Top node in the taxonomy window.

72 SAS Text Summarizer Studio: User’s Guide

Page 79: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

2. Delete the path, or paths, to these files in the Definition tab. The Important Concepts branch disapears.

3. After the Concepts taxonomy is removed from your project you can import the .concepts file into the SAS Content Categorization Studio project. Edit the Concept names, their definitions, or the taxonomy structure.

4. Reopen your SAS Text Summarizer Studio project and use Step 1 on 72 through Step 3 above to reload your edited, Important Concepts.

SAS Text Summarizer Studio: User’s Guide 73

Page 80: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

5.3 Defining the Special Concept

5.3.1 Understanding the Function of the Special Concept

The Special Concept, or anchor words, are defined within your SAS Text Summarizer Studio project. The Special Concept definition is a variation of the Classifier Concept definition that it closely resembles. Like the Classifier Concept, the Special Concept specifies a list of entities that serve as anchors for the summary sentences. These anchor words, or significant terms, are not specified in the definitions for your Important Concepts. You can specify conditionals that tell SAS Text Summarizer Studio how to weight the specified matched term in an input document in relation to the Important Concepts’ matches. Also use conditionals to define the links between related sentences when you specify related terms. You cannot specify conditionals or related terms with Classifier Concepts.

5.3.2 Understanding Conditionals and the Weights that They Affect

Conditionals are the codes that SAS Text Summarizer Studio uses to determine how a match on a sentence is prioritized, or ranked, for the summary. For this reason, the conditionals that you specify for the matched terms in your Special Concept also affect the matches returned for Important Concepts. Use the following 10 conditionals with your Special Concept:must appear

matching sentences are weighted to exceed the value of any other match:CODE=6

the anchor word appears at the beginning of a sentence.CODE=7

the anchor word appears anywhere in the sentence.

74 SAS Text Summarizer Studio: User’s Guide

Page 81: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

should appear

matching sentences receive the same weight as the Important Concepts Weight that is set in the Profile - Definition tab:CODE=1

the anchor word should appear at the beginning of the sentence.CODE=10

the anchor word can appear anywhere in the sentence.must not appear

matching sentences never appear in the summary:CODE=8

if the anchor word appears at the beginning of the sentence, this sentence is not a summary sentence.

CODE=9 if the anchor word appears anywhere in the sentence, this sentence is not a summary sentence.

should not appear

the relevancy for matching sentences is decreased. CODE=5

matching sentences have the weight specified in the Important Concepts Weight field of the Profile - Definition tab subtracted from their value.

can appear

a match depends on the proximity of the match to other words or sentences. For this reason you specify a related term when you enter this code.CODE=2

matching sentences have a higher likelihood of appearing, if the other specified term also occurs. For example, specify evidence,2:adhere.

CODE=3a match on the first word in a sentence is definitely included in the summary, if the previous sentence is also included in the summary. For example, The evidence in this report was validated by the regulating authority. Adhere to these protocols.

SAS Text Summarizer Studio: User’s Guide 75

Page 82: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

CODE=4extends Code=3 above, to allow the conditional to appear anywhere in the sentence. For example, The findings in this report were validated by the regulating authority. You should adhere to these protocols.

Although there is no weight setting for the Special Concept, the weight that you specify for the Important Concept affects all of the Special Concepts occurrences:

- The number specified for the Important Concepts also represents the value of any Special Concept that is coded 1 or 10. In other words, this number is added to the weight of the sentence for each instance of a Special Concept that appears in a sentence.

- The weight specified for Important Concepts is subtracted from the weight of the sentence where an instance of a Special Concept occurs with a code of 5.

- A match on a Special Concept that is coded 6 or 7 means that this sentence has the weight set for the Important Concept added to its sum. On the other hand, if the matching Special Concept is coded 8 or 9, this sentence does not appear in the summary.

76 SAS Text Summarizer Studio: User’s Guide

Page 83: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

5.3.3 Write the Definition for the Special Concept

To write the definition for the Special Concept, complete these steps:

1. Review the definitions for the Important Concepts and make a list of the key words that do not appear in these definitions. These terms are your anchor words.

2. Click on the Special Concept icon in the taxonomy window. The Definition tab appears.

3. Type in a list of entities followed by a comma (,).

4. To the right of each comma, type in a conditional. For more information, see Section 5.3.2 Understanding Conditionals and the Weights that They Affect on page 74.

5. (Optional) If you want to enter one, or more related terms type a colon (:) before each related word entry.

6. Select File —> Save.

SAS Text Summarizer Studio: User’s Guide 77

Page 84: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

5.4 Prioritizing Sentence Location

If you want to prioritize the location of your sentences when summarizing, follow these directions:

- Leave the default setting for the Important Concepts Weight in the Profile - Definition tab at 0.00. In this case no additional weight is added when an Important Concept is located and no weight is added or subtracted to sentences containing a Special Concept.

- Do not use codes 6 or 7 (must appear), or codes 8 or 9 (must not appear) with Special Concept definition.

- Change the number specified in the Relevancy Cutoff field of the Profile - Test tab.

5.5 Prioritizing Important Concepts

You might want to return all sentences that contain one or more Important Concepts—only. To prioritize Important Concepts’ matches, choose from the following selections:

- Eliminate the Special Concepts definition from your SAS Text Summarizer Studio project.

- Keep the Special Concepts definition, but do not specify these codes: - either codes 6 or 7. In this case, matching sentences appear in the

summary, whether or not they contain a match on an Important Concept.

- a code of 5, because the weight set for the Important Concepts is subtracted from the overall sentence weight. If you set a Special Concept definition code to 5, the effect on the sentence containing this Special Concept could be the same as with a sentence containing an 8 or 9 code. In other words, the sentence might be eliminated from the summary.

78 SAS Text Summarizer Studio: User’s Guide

Page 85: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

- a code of 1 or 10. In this case there would be no practical differentiation between this sentence and a sentence containing an Important Concept. They are assigned the same weight.

For more information, see Section 5.3.2 Understanding Conditionals and the Weights that They Affect on page 74.

- Do not change the default settings for any of the Definition tabs, with the possible exception of the Important Concept Weight setting in the Profile - Definition tab.

- Increase the Important Concept Weight in the Profile - Definition tab above its default setting of 3.00 and return all the sentences containing an Important Concept to the summary.

- If you want to return only sentences containing, for example, two or more instances of Important Concepts you can set the Relevancy Cutoff to 2.00 in the Profile - Test tab.

SAS Text Summarizer Studio: User’s Guide 79

Page 86: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

80 SAS Text Summarizer Studio: User’s Guide

Page 87: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

6 Specifying Tokenizers

- Overview of Specifying Tokenizers- Import the Tokenizer- Define the Sentence Tokenizer- Define the Paragraph Tokenizer

6.1 Overview of Specifying Tokenizers

It is necessary to extract words from input streams of text before you summarize your documents. SAS Text Summarizer Studio then breaks your documents into sentences and paragraphs so that they can be summarized according to the matches on key Concepts. The processes of delimiting words, sentences, and paragraphs is referred to as tokenization.Until the words are defined from the input stream of text, sentences cannot be identified. Similarly, SAS Text Summarizer Studio uses paragraphs to define the three major sections of your input documents.The process of implementing tokenization technologies consists of the following steps that are discussed in detail in the remaining sections of this chapter:

1. Import the Teragram Tokenizer: This tokenizer is specific to the language that you purchased from SAS. By default the tokenizer is located in the following location:

C:/Program Files/SAS_Institute/SAS Text Summarizer Studio/tkzo.bin

For more information, see Section 6.2 Import the Tokenizer on page 82.

2. Define the Sentence Tokenizer: Specify the characters, or terms, that delimit sentences. You can also choose to specify characters that

81

Page 88: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

might be used to delimit sentences, but should not. For more information, see Section 6.3 Define the Sentence Tokenizer on page 85.

3. Define the Paragraph Tokenizer: Specify the characters that delimit paragraphs. You can also choose to specify characters that could delimit paragraphs, but should not. For example, you choose to exclude the HTML tag </title>. For more information, see Section 6.4 Define the Paragraph Tokenizer on page 87.

6.2 Import the Tokenizer

Import the Teragram Tokenizer that breaks input streams of text into words, for example, tkzo.bin. If you purchased SAS Text Summarizer Studio with several languages, import the tokenizer file that is specific to the language of the project that you are creating.To import the tokenizer, complete these steps:

1. Click on the Tokenizer icon in the taxonomy window and the Tokenizer field appears on the right side of the interface.

82 SAS Text Summarizer Studio: User’s Guide

Page 89: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

2. Click Browse to open the Select a Tokenizer File dialog box where you can select the tokenizer file for your project.

3. Click to the right of the Look in field to navigate to the directory that contains the tokenizer file, for example, tkzo.bin.

4. Select the file.

5. Click Open.

SAS Text Summarizer Studio: User’s Guide 83

Page 90: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

The path to the tokenizer file is automatically entered into the Tokenizer field.

6. Select Build —> Build Summarizer and a SAS Text Summarizer Studio confirmation screen appears.

7. Select File —> Save.

84 SAS Text Summarizer Studio: User’s Guide

Page 91: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

6.3 Define the Sentence Tokenizer

After you add the word tokenizer you can specify the sentence tokenizer. The sentence tokenizer enables SAS Text Summarizer Studio to extract the sentences that are used to develop a summary from the incoming text. Appropriate sentence extraction is key to accurately summarizing an input document. For this reason, you might also choose to define characters that should not delimit sentences. For example, choose to specify that periods (.) that follow an abbreviation do not delimit a sentence.To define the sentence tokenizer, complete these steps:

1. Click on the Sentence Tokenizer node in the taxonomy window and the Definition tab appears.

2. Type in the list of terms that SAS Text Summarizer Studio uses to identify sentences. The display above provides one example of this syntax and the following display provides a second example.

- Sentence delimiters are followed by the comma character (,) and code 1. The numeral 1 tells SAS Text Summarizer Studio to mark the end of a sentence.

- Special characters are used to indicate a newline (\n) and a carriage return (\r).

SAS Text Summarizer Studio: User’s Guide 85

Page 92: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

- Abbreviations, for example, are followed by the comma character (,) and code 2. The numeral 2 tells SAS Text Summarizer Studio not to mark the preceding term as the end of a sentence.

3. Click Syntax Check to check the accuracy of your sentence tokenizer.

If the syntax is correct, a SAS Text Summarizer Studio confirmation screen appears.

4. Click OK.

86 SAS Text Summarizer Studio: User’s Guide

Page 93: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

If the syntax is incorrect, the SAS Text Summarizer Studio status screen displays this information.

5. A screen also appears at the bottom of the SAS Text Summarizer Studio user interface providing the information necessary to make change the syntax.

6. (Optional) Click LineNo to reorder the information about errors from the last line in the definition to the first line.

7. Select Build —> Build Summarizer.

8. Select File —> Save.

6.4 Define the Paragraph Tokenizer

After you have added the sentence tokenizer you can specify the paragraph tokenizer that is used by the summarizer to locate paragraphs. The paragraph tokenizer delimits paragraphs within the incoming document. This feature enables SAS Text Summarizer Studio to extract summary sentences for the summary from each paragraph.

SAS Text Summarizer Studio: User’s Guide 87

Page 94: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

To define the paragraph tokenizer, complete these steps:

1. Click on the Paragraph Tokenizer icon in the taxonomy window and the Definition tab appears. See the example below.

2. Type in the list of paragraph delimiters. These delimiters are followed by a comma character (,) and either Code 1 or Code 2. The numeral 1 marks the end of a paragraph that can be tokenized, while the numeral 2 indicates that the preceding term does not delimit a paragraph.

3. Click Syntax Check.

4. The SAS Text Summarizer Studio confirmation screen appears.

5. Click OK.

88 SAS Text Summarizer Studio: User’s Guide

Page 95: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

6. (Optional) If the SAS Text Summarizer Studio confirmation screen states that the syntax is incorrect, a window appears at the bottom of the user interface. This window provides the information necessary to change the tokenizer syntax.

7. Select Build —> Build Summarizer.

8. Select File —> Save.

SAS Text Summarizer Studio: User’s Guide 89

Page 96: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

90 SAS Text Summarizer Studio: User’s Guide

Page 97: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

7Creating Profiles

- Overview of Creating Profiles- Plan Your Profile- Creating a New Profile and Sections

7.1 Overview of Creating Profiles

Profiles determine how the SAS Text Summarizer Studio handles various types of input document-types. You specify one document-type for each profile. In other words, if you want to summarize HTML, XML, and text documents, you create a profile for each document-type. Profiles also enable you to specify the sections for each document type where Concept matches take priority. For example, a news article contains key information in the first paragraph, while a clinical study might place this information at the end of the document. You can also choose how SAS Text Summarizer Studio treats sentences within these paragraphs that comprise each section.Choose to summarize the document in its entirety—that is, without prioritizing any sections. Alternatively, choose to summarize a document by section, or by paragraph.

91

Page 98: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

7.2 Plan Your Profile

Before you create a profile consider the structure of the texts that SAS Text Summarizer Studio is summarizing. To plan your profile, complete the following steps:

1. Consider the documents that you plan to summarize:

- Determine the document types.

- Specify the sections in each document-type where the key content sentences are located.

- Specify the Important Concepts.

2. Understand the basic ideas of weighting, ranking, and gradients:

- Determine the weight that specifies the relative value of this type of document compared to other document-types, and the sections that group the paragraphs in this document-type.

- Add value to key sentences by specifying weights.

- Determine how SAS Text Summarizer Studio depreciates the value of sentences that follow the key sentence.

3. Consider how to treat the various types of parameters that you specify:

- Include, or exclude, sentences that appear inside of quotation marks.

- Specify the minimum length of a sentence by the number of characters.

- Determine the wight of Important Concept matches.

92 SAS Text Summarizer Studio: User’s Guide

Page 99: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

7.3 Creating a New Profile and Sections

7.3.1 Create a Profile

This section explains all of the information is necessary to create a new profile. You can use this section with the Profiles, Profile, and Initial nodes.To create a new profile, complete these steps:

1. Right-click on the Profiles node in the taxonomy window and select Add a Profile from the drop-down menu that appears.

2. A new Profile node with a new Initial child node is automatically added to the taxonomy.

3. Right-click on the Profile1 node and select Rename from the drop-down menu that appears.

SAS Text Summarizer Studio: User’s Guide 93

Page 100: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

4. A SAS Text Summarizer Studio appears. Enter the name of the new type of document into the Enter new name field, for example, News_Articles.:

5. Click OK. The new name, for example, News_Articles, appears.

6. Define the new profile using the parameters in the Definition tab. For more information, see Section 3.8.3 Profiles - Definition Tab on page 41.

7. Select Build —> Build Summarizer.

8. Select File —> Save.

7.3.2 About Sections

There are two types of Sections that you need to define for your Profile:Initial section

the Initial node is automatically added to your project when you create a new project. This node refers to the first, or only, document section (if no other section is defined). Unless your document only has one section, you can create definitions for the other sections in the document type specified by this profile.

All other Section nodes

each section node that you add has the same Definition tab as the Initial node. The Definition tabs for the added sections have one additional field, the Marker field. Alternatively, you can choose to use a regular expression as the internal marker for this section. The regular expression enables SAS Text Summarizer Studio to match multiple instances of the same section type in the selected document. For more information, see Appendix A: Teragram Regex Syntax.

94 SAS Text Summarizer Studio: User’s Guide

Page 101: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

7.3.3 Add a Section

To add a section to your profile, complete these steps:

1. Right-click on a profile node, for example, Default.

2. Select Add A Section from the drop-down menu that appears.

3. Right-click on the newly-added section and select Rename from the drop-down menu that appears.

4. Type in the new name of the section in the SAS Text Summarizer Studio dialog box that appears.

A new section node with the name End is automatically added to the taxonomy

SAS Text Summarizer Studio: User’s Guide 95

Page 102: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

5. Type the name of the selected section into the Marker field.

96 SAS Text Summarizer Studio: User’s Guide

Page 103: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

(Optional) If you choose to specify a regular expression, click Syntax Check button to validate the syntax of this expression.See the example below:

6. Type in the number that sets the relative weight for the sentences in this section into the Section Weight field.

7. (Optional) If you want to boost the relative importance of the summary sentences that are located in this section, type this number into the Rank Weight field.

8. Select the Rank Type:

- Top: Assign the first sentence in the document the highest ranking.

- Middle: Assign the highest value to the sentence that is centrally located in an input document.

- Bottom: Assign the highest weight to the last sentence in a document.

SAS Text Summarizer Studio: User’s Guide 97

Page 104: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

9. After you define this section you should define the settings for the Initial node. For more information, see Section 7.3.3 Add a Section on page 95.

Note: Remember that you cannot change the name of the Initial node, and you cannot delete this node.

10. Select Build —> Build Summarizer.

11. Select File —> Save.

98 SAS Text Summarizer Studio: User’s Guide

Page 105: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

7.3.4 How to Specify the Important Concepts Weight in the Profile Definition Tab

If you choose to use the Relevancy Cutoff setting in the Profiles - Test tab for a profile, choose your Important Concepts Weight carefully. See the example below:

Figure 7-1 Relevancy Cutoff and Important Concepts Weight settings

You can set this number to any number. When you test a document the Summary Level setting—sentences, paragraph, or section, with the highest weight is included in the summary. Those sentences, paragraphs, or sections that fall below the cutoff are included in the summary.

SAS Text Summarizer Studio: User’s Guide 99

Page 106: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

100 SAS Text Summarizer Studio: User’s Guide

Page 107: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

8 Testing

- Overview of Testing- Assembling a Testing Directory- Testing Concepts- Testing Tokenizers- Testing Profiles

8.1 Overview of Testing

The testing operations for SAS Text Summarizer Studio are accessible when you select a taxonomy node and click the Test tab that appears on the right side of the user Interface. You can test the following components against a single document:Important Concepts

test at the Important Concepts node level, only.Special Concept

check the syntax of your definition and test the application of your definition to an input testing document.

Sentence tokenizer

test the delimiters that specify the end of a sentence, and any characters that you specified as sentence delimiters.

101

Page 108: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Paragraph tokenizer

test the delimiters that specify the end of a paragraph, and any characters that you specified as paragraph delimiters.

Profiles

apply all of the technologies that you specified to an input document-type, for example, HTML, XML, or text. The parameters that you specify for each document section are tested at the profile level.

Before testing your profiles, test your Concepts and tokenizers to ensure that they are working as planned. Test each profile against one, or more, input texts. These testing documents should be texts that you are familiar with so that you can gauge the accuracy of the summaries that they return. When the summaries that are returned are accurate, you can anticipate similar testing results for other documents of this type.

8.2 Assembling a Testing Directory

Before you can begin testing your documents and understanding the Test tabs you assemble approximately 10 documents that you are familiar with into a testing directory on your local machine. These documents should be of the same type as the profiles that you create. For example, if you create a profile for text documents, assemble ten .txt documents.

102 SAS Text Summarizer Studio: User’s Guide

Page 109: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

8.3 Testing Concepts

8.3.1 Before You Test Concepts

Before you test either the Important Concepts branch of the taxonomy, or the Special Concept definition, complete these steps:

1. Import the Teragram Tokenizer into the project using the Tokenizer node.

2. Select File —> Save.

3. Select Build —> Build Summarizer.

The Important Concepts that you import into your project identify the key terms in the summary sentences. These sentences are automatically given the highest weighting unless you perform one of the following operations:

- You assign a code of 10 to a term in your Special Concept definition. In this case, the matched Special Concept is assigned a weight that is equivalent to the Important Concept that appears the most number of times in a sentence.

- You enter the code 6 or 7 for a Special Concept. In this case, the weight of the sentence where this term is matched exceeds the sum of all of the other summary sentences.

Use the Test tab that is accessible when you select each of these taxonomy nodes to test known documents for these terms. If the expected results are not returned, you can redefine and retest your Concepts, reiteratively, until the desired results are returned.

SAS Text Summarizer Studio: User’s Guide 103

Page 110: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

8.3.2 Test the Important Concepts

After you import your Important Concepts and Teragram Tokenizer, you can test the Important Concepts against some of your testing documents. To test the Important Concepts, complete these steps:

1. Select the Important Concepts node in the taxonomy window.

2. Click Test to open the Test tab.

3. Click Load.

4. The tsa window appears. Use the Look in field to locate the testing folder.

5. Select a test file.

6. Click OK.

104 SAS Text Summarizer Studio: User’s Guide

Page 111: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

7. Click Test in the Test tab.

See these testing results in the Results pane: - A list of the matched terms appears in black font on the left side of

this pane.

- The matched Concepts appear in red font and are enclosed in parentheses (()) to the right of these terms.

8. Analyze the testing results shown in the Results pane to see if these results meet your expectations. For example, see if the tokenizer matched an abbreviation such as Corp.

9. (Optional) To change these definitions, names, or to add, or delete any of these Concepts, import this taxonomy branch into SAS Content Categorization Studio. For more information, see Section 5.2.5 Edit Your Important Concepts on page 72.

SAS Text Summarizer Studio: User’s Guide 105

Page 112: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

8.3.3 Test the Special Concept

After you test the Important Concepts, test the Special Concept. To test the Special Concept, complete these steps:

1. Click on the Special Concept node in the taxonomy window.

2. Follow Step 2 through Step 7 on page 105. The test results for the Special Concept appear under the Results heading.

See the testing results in the Results pane: - A list of the matched terms appears in black font on the left side of

this pane.

- The matched Concepts appear in red font and are enclosed in parentheses (()) to the right of these terms.

3. Analyze the testing results shown in the Results pane to see if these results meet your expectations. For example, see if the tokenizer matched an abbreviation such as Corp.

4. (Optional) If you do not see the expected results, click the Definition tab and edit your Concept.

106 SAS Text Summarizer Studio: User’s Guide

Page 113: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

8.4 Testing Tokenizers

8.4.1 Test the Sentence Tokenizer

To test the sentence tokenizer, complete these steps:

1. Select the Sentence Tokenizer node in the taxonomy window.

2. Follow Step 2 on page 104 through Step 7 on page 105. The test results for the sentence tokenizer that you defined appear under the Results heading.

See the testing results in the Results pane: - A number appears in bold font enclosed in square brackets ([])

above each sentence.

- The summary sentence appears below the number.

SAS Text Summarizer Studio: User’s Guide 107

Page 114: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

3. Analyze the testing results shown in the Results pane to see if these results meet your expectations. For example, see if the tokenizer matched an abbreviation such as Assn.

4. (Optional) If you do not see the results that you expect, click the Definition tab and edit the definition for this tokenizer.

108 SAS Text Summarizer Studio: User’s Guide

Page 115: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

8.4.2 Test the Paragraph Tokenizer

To test the paragraph tokenizer, complete these steps:

1. Select the Paragraph Tokenizer node in the taxonomy window.

2. Follow Step 2 on page 104 through Step 7 on page 105. The test results for the paragraph tokenizer that you defined appear under the Results heading.

See the testing results in the Results pane: - Paragraph numbers appear in bold font enclosed in square

braces ([]).

- Sections (SEC) are identified by their name.

SAS Text Summarizer Studio: User’s Guide 109

Page 116: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

- Paragraphs (PAR: 0) are numbered according to the section that they appear within.

3. Analyze the testing results shown in the Results pane to see if these results meet your expectations.

4. (Optional) If you do not see the results that you expect, click the Definition tab and edit the definition for this tokenizer.

110 SAS Text Summarizer Studio: User’s Guide

Page 117: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

8.5 Testing Profiles

8.5.1 Choose Advanced Testing Selections

Before you test a profile, specify the testing operations in the Advanced section of the Test window.To specify your advanced testing selections, complete these steps:

1. Click on a profile node in the taxonomy window.

2. Select Number or Percentage:

SAS Text Summarizer Studio: User’s Guide 111

Page 118: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

- If you select Number, click either or to change the number of sentences that comprise your summary. The default setting is 5.

- If you select Percentage, click either or to change the percentage of the document’s sentences that comprise its summary. The default setting is 5%.

3. Select Summary Level. This is the location where the summary sentences can be located.

- If you select Sentences, sentences from the entire, input document comprise the summary displayed in the Results pane.

- If you select Paragraph, sentences from each of the document’s paragraphs comprise the summary.

- If you select Section, sentences from each section are returned comprise the summary.

4. (Optional) Specify the Relevancy Cutoff. This setting is used to return all of the sentences that meet, or exceed, this threshold. SAS Text Summarizer Studio automatically adds the following weights to determine the relevancy of a sentence:

- The weights of all instances of Important Concepts that appear.

- The Special Concept, whether the matched Special Concept is coded to ensure that it:

- must always be part of the summary (codes 6 or 7)- can never be part of the summary (codes 8 or 9)- receives a weight equal to the Important Concept weight (code

1 or 10) for each occurrence of this Concept- has the Important Concept weight (code 5) deducted from the

overall weighting of the sentence containing one or more Special Concepts matching this code.

5. Set the profile section weights and rankings in each Definition tab.

6. Click Select From Each Paragraph and SAS Text Summarizer Studio extracts sentences from each of the paragraphs in the document.

112 SAS Text Summarizer Studio: User’s Guide

Page 119: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

8.5.2 Test a Profile

After you set your test selections using the Advanced button in the Test tab, you can test a profile.To test a profile, complete these steps:

1. Click Load and use the tsa dialog box appears where you can locate a testing file.

2. Click Compute.

3. See your results in the both the document and summary panes of this interface.

4. Use the components of the Test tab to see the summary and the related results.

8.5.3 Understanding Profile Testing Results

After you test your profile, use the results displayed in the Test tab to see if the summary is what you expected. If not, you can make changes in either the Test tab, or to the appropriate components of your project.The sentences highlighted in blue are the final summary results. The sentences highlighted in green are the summary sentences selected from each paragraph.Below the document window, and to the right of the Next button, there is a list of results for the summary sentences highlighted in blue. The appearance of these codes depends on the selection that you make in the Results pane.REL

the relevancy score for the first sentence that is highlighted in blue.ID

the sentence identification number.IMP_WT

the weight for this section.PAR

the paragraph where the sentence appears.

SAS Text Summarizer Studio: User’s Guide 113

Page 120: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

POS

the position of the sentence in the text.REL

the relevancy number is the sum of all instances of the Important and Special Concepts that occur within a specific sentence. The weight of each instance is added to the other instance weights and to the weight of the section, and all of the document sentences are ranked accordingly.

SENT

the number of the sentence in the text.SEC_WT

the weight for this section.The summary is displayed in the summary window below these results. Each of these sentences is followed by—REL, ID, and POS—as explained above.Use the Results pane to specify the type of results that you want SAS Text Summarizer Studio to return. It is not necessary to click Compute when you click one of the following selections:Result

see the comprehensive summary.Section

see the testing results according to the selected document section.Paragraph

see the document by numbered paragraphs.Sentence

see the relevancy for each sentence in the tested document.Section Weight:

see the section weight for each sentence in a document.Section Rank Weight

rank sentences in your results according to the settings that you specified in the section Definition tabs for this document-type.

Doc Rank Weight

the score that measure the confidence of these matches.

114 SAS Text Summarizer Studio: User’s Guide

Page 121: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Important Weight

see the matching Important Concepts and the relevancy for each, as well as the combined relevancy score.

Special Weight

see the matching terms in the Special Concept that you created. The weight for the Special Concept appears here, if you have defined this Concept.

The testing analyses that are performed depend on the Results selections that you click.

Table 8-1: Results Selections and Testing Analyses

Testing Option SEC PAR SENT SEC_WT

D_RNK_WT

IMP_WT SPL_WT REL

Result X

Section (SEC)

X

Paragraph (PAR)

X X

Sentence (SENT)

X X X X

Section Weight(SEC_WT)

X X X X X

Section Rank Weight

X X X X X

Doc Rank Weight (D_RNK_WT)

X X X X X X

SAS Text Summarizer Studio: User’s Guide 115

Page 122: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

8.5.4 +Export the Test Results

To export your testing results, complete these steps:

1. Click Export Result and the Choose a filename to save under dialog box appears.

2. Click in the Save in field to locate a folder for this file.

3. Type the name of the new text file that you are exporting into the File name field.

4. Click Save to save these results as a .txt file.

To copy your results onto the clipboard of your computer, click Copy to Clipboard. You can copy these results to another program on your local machine, for example, Notepad. Alternatively, use this operation to see your results in another file format.

Important Weight (IMP_WT)

X X X X X X*

Special Weight (SPL_WT)

X X X X X X X**

*: Relevancy (REL) is the sum of the Important Weights.**: Relevancy (REL) is the sum of the Important Weights plus Special Weights.

Table 8-1: Results Selections and Testing Analyses (Continued)

Testing Option SEC PAR SENT SEC_WT

D_RNK_WT

IMP_WT SPL_WT REL

116 SAS Text Summarizer Studio: User’s Guide

Page 123: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

9 Quick Start Guide

To create a SAS Text Summarizer Studio project, complete these steps:

1. Familiarize yourself with the SAS Text Summarizer Studio architecture. For more information, see Section 1.3 Architecture on page 5.

2. Understand how to use this interface. For more information, see Chapter 3.

3. Check your hardware requirements. For more information, see Section 2.2 Prerequisite System Requirements on page 9.

4. Install the SAS Text Summarizer Studio program. For more information, see Section 2.4 Installing on page 11.

5. Plan your SAS Text Summarizer Studio project. For more information, see, Section 4.2 Plan Your Project on page 56.

6. Create a new project using the following steps:

a. Set your project-wide settings. For more information, see Section 3.9 Use the Preferences Dialog Box on page 51.

b. Import your Important Concepts from SAS Content Categorization Studio or use the predefined Concepts binary files purchased from SAS. For more information, see Section 5.2 Important Concepts on page 64.

c. Define the Special Concept. For more information, see Section 5.3 Defining the Special Concept on page 74.

d. Specify the Teragram Tokenizer that breaks streams of input text into words. For more information, see Section 6.2 Import the Tokenizer on page 82.

e. Define the sentence tokenizer. For more information, see Section 6.3 Define the Sentence Tokenizer on page 85.

117

Page 124: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

f. Define the paragraph tokenizer by specifying sentence delimiters. For more information, see Section 6.4 Define the Paragraph Tokenizer on page 87.

7. Define your profiles to handle the types of documents that you summarize. For more information, see Chapter 7.

Note: If you plan to set the optional Relevancy Cutoff do not set relevancy or weights for the sections that comprise this profile.

8. Test the summarizer that you developed against your profiles using advanced selections. For more information, see Chapter 8.

9. Edit the settings for any of the SAS Text Summarizer Studio components until you obtain the summary results that you require.

118 SAS Text Summarizer Studio: User’s Guide

Page 125: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Reference Section

- Appendix A: Teragram Regex Syntax on page 121- Appendix B: Glossary on page 125

119

Page 126: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

120 SAS Text Summarizer Studio: User’s Guide

Page 127: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

A Teragram Regex Syntax

- What Rules and Restrictions Apply to Teragram Regular Expressions?- What Special Characters are Used with Teragram Regular

Expressions?- Special Cases

A.1 What Rules and Restrictions Apply to Teragram Regular Expressions?

The following rules and restrictions apply to Teragram regular expressions:- Any single character a (ASCII 1 through 252, subject to escaping

restrictions in 14 below) is a regular expression, and it matches precisely that character.

- If a and b are regular expressions, then so is ab that matches whatever a matches followed by whatever b matches (concatenation).

- If a and b are regular expressions, then so is a|b that matches either whatever a matches or whatever b matches.

- If a is a regular expression, then so is (?:a) that simply serves as a grouping mechanism without remembering what it was grouping. For example, (?:ababb)|b matches either abaab or b. This regular expression is difficult to express without the grouping mechanism.

- A character class is a regular expression. One or more characters inside square braces ([]), for example, [abc] matches any of the characters inside. A range inside a character class, a-z, for example, matches any ASCII character whose value is between a and z, inclusive. Any character can appear in a character class. However, \ (backslash), - (hyphen), and ] (close brace) must be preceded by a backslash, and ^

121

Page 128: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

(carat) must be preceded by a backslash if it is the first character in the character class.

- A negated character class is a regular expression. One or more characters are inside square braces, with ^ (carat) being the first character to indicate negation. For example, [^abc] matches any character except a, b, or c.

- If a is a regular expression, then so is a* that matches 0 or more occurrences of whatever a matches.

- If a is a regular expression, then so is a+ that matches 1 or more occurrences of whatever a matches.

- If a is a regular expression, then so is a? that matches 0 or 1 occurrences of whatever a matches.

- If a is a regular expression, then so is a{n,m} that matches at least n but no more than m concatenated occurrences of whatever a matches.

- If a is a regular expression, then so is a{n,} that matches at least n concatenated occurrences of whatever a matches.

- If a is a regular expression, then so is a(n) that matches exactly n concatenated occurrences of whatever a matches.

- If filename is the name of a file containing the binary representation of a sub-expression a, then the syntax (?$filename) inserts that sub-expression into the current regular expression. To create such a file, use the _treg utility as follows:

_treg -to_fso 'a' >filename

122 SAS Text Summarizer Studio: User’s Guide

Page 129: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

A.2 What Special Characters are Used with Teragram Regular Expressions?

Table C-1 below lists and gives extended meaning for special characters with Teragram regular expressions.

Table A-1: Special Characters in Teragram Regular Expressions

Character Meaning

\a Alarm (beep)

\n Newline

\r Carriage return

\t Tab

\f Form feed

\e Escape

\d Digit (same as [0-9])

\D Not a digit (same as [^0-9])

\w Word character (same as [a-zA-Z_0-9])

\W Non-word character (same as [^a-zA-Z_0-9])

\s Whitespace character (same as [ \t\n\r\f])

\S Non-whitespace character (same as [^ \t\n\r\f])

. Wildcard (matches any character)

\xh Hexidecimal number, where h is a hexidecimal digit

\xhh Hexidecimal number, where h is a hexidecimal digit

0o Octal number, where o is an octal digit

0oo Octal number, where o is an octal digit

SAS Text Summarizer Studio: User’s Guide 123

Page 130: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

A.3 Special Cases

There are four special cases for Teragram regular expressions. These are:

1. For metacharacters, for example, [,],(,),?,*,+,.,\,|, to have literal meaning, they need to be escaped with a backslash (\). If inside a character class, however, only those mentioned explicitly need escaping.

2. No support is provided for backward references or () as a remembering grouping mechanism.

3. No support is provided for ^ as the beginning-of-line, zero-width assertion, or $ as the end-of-line, zero-width assertion. Unlike Perl regular expressions, both of these markers are implicitly assumed.

4. ASCII values 0, 253, 254, and 255 are reserved characters that cannot be used in Teragram regular expressions. Regular expressions work only on single-byte characters.

124 SAS Text Summarizer Studio: User’s Guide

Page 131: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

B Glossary

Anchor words

these terms form the definition for the Special Concept. Also see Special Concept.

Branch

Important Concepts are imported into the SAS Text Summarizer Studio project as a pre-built taxonomy. They appear in a hierarchical taxonomy under the Important Concepts node in the taxonomy window.

Classifier Concept

this definition is comprised of strings to match in an input document.Concept

specifies the terms to be matched in an input document.Concepts file

build your own custom Concepts using SAS Content Categorization Studio <language>.concepts and <language>.concept.xml files, or you can choose to purchase these files from SAS.

Conditionals

these are codes written into the definition that defines a Special Concept. These numerals determine whether or not a sentence that contains a matched anchor word is included in the summary and the relative or absolute ranking of this sentence.

Delimiters

the characters that signal the end of a sentence or paragraph.Entity

this is an autonomous piece of information, or a string.Grammar Concept

defines Concepts by identifying parts of speech in an input document. Grammar Concepts can also refer to other Concepts and begin with the line #ROOT = *CODE.

125

Page 132: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Hierarchical taxonomy

contains both parent and child, or nodes and subnodes, in the taxonomy structure.

Important Concept

a branch that is imported as a hierarchical taxonomy from SAS Content Categorization Studio. Use the <language>.concepts and <language>.concept.xml files that you build in SAS Content Categorization Studio, or purchase from SAS. Also see Classifier Concept, Grammar Concept, and Regex Concepts.

Profile

determine how SAS Text Summarizer Studio uses the input documents, their sections, and sentences to create a summary.

Ranking

the relative hierarchical value of sentences that could be returned for the summary.

Regex Concepts

these Concepts are built using Teragram Regular Expressions.Relational Concepts

identify a relationship between two Concepts that would otherwise appear to be autonomous pieces of information. For example, the person of George W. Bush and the position that he held, president, together can be identified as a relational Concept.

Simple Concepts

name a single entity, and are defined using either Classifier or Regex Concepts.

Special Concept

the terms specified in the definition for this Concept are written into the SAS Text Summarizer Studio application. These terms are located in the sentences that are used to summarize the text. Also see anchor terms and Classifier Concept.

Summary

statements covering the key points discussed in the input document.

126 SAS Text Summarizer Studio: User’s Guide

Page 133: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Tokenization

these technologies break streams of text into words, sentences, and paragraphs.

Weight

a relative determination of the section, or sentence value to the summary.

SAS Text Summarizer Studio: User’s Guide 127

Page 134: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

128 SAS Text Summarizer Studio: User’s Guide

Page 135: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Index.concept file

defined ......................................................................................................... 24.concept.xml

usage ............................................................................................................ 69.concept.xml file

Concept taxonomy ....................................................................................... 24defined ..................................................................................................... 6, 24location .................................................................................................. 26, 71

.concepts filecontents .................................................................................................. 25, 70defined ........................................................................................................... 6import .......................................................................................................... 73usage ............................................................................................................ 69

.concepts filelocation .................................................................................................. 26, 71

.tsa filecreate ........................................................................................................... 61

Aabbreviations

sentence tokenizer ....................................................................................... 86Add a Profile

drop-down menu .......................................................................................... 93anchor words

Special Concept ........................................................................................... 74architecture ........................................................................................................... 5

compile .......................................................................................................... 6runtime ........................................................................................................... 7

BBottom

sentence location ..............................................................................42, 49, 97build

summarizer .................................................................................................. 61

129

Page 136: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

CChoose a filename to save under

usage ..........................................................................................................116Classifier Concepts

defined .........................................................................................................65Definition tab ...............................................................................................65Special Concept ...........................................................................................74thesauri .........................................................................................................65

CloseFile menu .....................................................................................................17

CODE ..................................................................................................................75Concept

defined .........................................................................................................63Concept taxonomy

.concept.xml file ..........................................................................................24Concepts

defined ...........................................................................................................4summaries ....................................................................................................56

Concepts filePredefined Concepts ....................................................................................24

Concepts taxonomycollapse ........................................................................................................72

Copy buttonusage ............................................................................................................19

Copy to Clipboard buttonusage ..........................................................................................................116

CPU .......................................................................................................................9Cut button

usage ............................................................................................................20

DD_RNK_WT

Doc Rank Weight .......................................................................................115Definition tab

Classifier Concepts ......................................................................................65components ..................................................................................................49Important Concepts ......................................................................................70Initial section ................................................................................................49Paragraph Tokenizer ....................................................................................88

130 SAS Text Summarizer Studio: User’s Guide

Page 137: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

paragraph tokenizer ..................................................................................... 37Profiles ......................................................................................................... 41Rank Gradient .............................................................................................. 49Rank Type ............................................................................................. 49, 97Rank Weight .......................................................................................... 49, 97Section Rank Weight ................................................................................. 114Section Weight ...................................................................................... 49, 97Sentence Tokenizer ..................................................................................... 85Special Concept ..................................................................................... 29, 77usage ...................................................................................................... 25, 94

directorypermissions .................................................................................................. 11

Doc Rank WeightD_RNK_WT ............................................................................................. 115Results pane ............................................................................................... 114summary window ...................................................................................... 115

EEdit

Preferences .................................................................................................. 18Enter new name

usage ............................................................................................................ 94Exit

File menu ..................................................................................................... 17

FFile

New ............................................................................................................. 57file

distribution ................................................................................................... 11File menu

Close ............................................................................................................ 17Exit .............................................................................................................. 17Import .......................................................................................................... 17New ............................................................................................................. 17Open ............................................................................................................ 17Save ............................................................................................................. 17Save Project As ........................................................................................... 17

SAS Text Summarizer Studio: User’s Guide 131

Page 138: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

filename .............................................................................................................122Find button

usage ............................................................................................................20

GGradient

sentence location ....................................................................................42, 49sentence weight ............................................................................................42

Grammar Conceptsdefined .........................................................................................................67

gzip command .....................................................................................................11

Hhardware

requirements ...................................................................................................9

IID

defined .......................................................................................................113Ignore sentences option

usage ............................................................................................................43IMP_WT

defined .......................................................................................................113Important Weight .......................................................................................116

ImportFile menu .....................................................................................................17

Important Concepttest ..............................................................................................................104

Important Concepts .......................................................................................25, 64branch ................................................................................................. 6, 68, 73defined .........................................................................................................63Definition tab ...............................................................................................70Important Weight .......................................................................................115ranking .........................................................................................................29Special Concept .....................................................................................29, 63Special Weight ...........................................................................................115

132 SAS Text Summarizer Studio: User’s Guide

Page 139: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

taxonomy window ....................................................................................... 22Test tab .................................................................................................. 26, 70

Important Concepts weightusage ............................................................................................................ 43

Important WeightIMP_WT .................................................................................................... 116Important Concepts ................................................................................... 115Results pane ............................................................................................... 115summary window ...................................................................................... 116

imported thesauriConcepts files .............................................................................................. 66

Initial nodename change ................................................................................................ 98taxonomy window ....................................................................................... 23

Initial sectiondefine ..................................................................................................... 94, 98Definition tab ............................................................................................... 49

installation kit contents ....................................................................................... 10interface

view ............................................................................................................. 15

Llanguage

specify ......................................................................................................... 57LineNo

usage ............................................................................................................ 87

Mmain window

components .................................................................................................. 16Marker field

defined ......................................................................................................... 96Menu bar

defined ......................................................................................................... 16Middle

sentence location ..............................................................................42, 49, 97Minimal sentence length

usage ............................................................................................................ 43

SAS Text Summarizer Studio: User’s Guide 133

Page 140: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

NNew

File ...............................................................................................................57File menu .....................................................................................................17

New buttonusage ............................................................................................................19

New Project dialog boxopen ..............................................................................................................57

Next buttonresults .........................................................................................................113

OOpen

File menu .....................................................................................................17Open button

usage ............................................................................................................19operating systems

supported ......................................................................................................10

PPAR

defined .......................................................................................................113Paragraph ...................................................................................................115usage ..........................................................................................................110

ParagraphPAR ............................................................................................................115Results pane ...............................................................................................114summary window .......................................................................................115

Paragraph Tokenizerdefine ...........................................................................................................82Definition tab ...............................................................................................88taxonomy window ........................................................................................23

paragraph tokenizerdefine ...................................................................................................87, 118Definition tab ...............................................................................................37specify ..........................................................................................................37

134 SAS Text Summarizer Studio: User’s Guide

Page 141: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Paste buttonusage ............................................................................................................ 20

POSdefined ............................................................................................... 113, 114

Predefined Concepts.concepts file ................................................................................................ 24

predefined Conceptsimport .......................................................................................................... 70

PreferencesEdit .............................................................................................................. 18

ProfilesUser interface .............................................................................................. 40

profilescreate ................................................................................................56, 92, 93define ......................................................................................................... 118defined ..................................................................................................... 4, 91Rename option ............................................................................................. 93taxonomy window ....................................................................................... 23test ............................................................................................................... 56

Profiles nodetaxonomy window ....................................................................................... 93

Program and Project title bardefined ......................................................................................................... 16

project namespecify ......................................................................................................... 57TSA interface .............................................................................................. 58

project pathspecify ......................................................................................................... 57

RRank Gradient

defined ......................................................................................................... 49Definition tab ............................................................................................... 49

Rank Typedefined ................................................................................................... 49, 97Definition tab ......................................................................................... 49, 97

Rank Weightdefined ................................................................................................... 49, 97Definition tab ......................................................................................... 49, 97

SAS Text Summarizer Studio: User’s Guide 135

Page 142: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

rankingImportant Concepts ......................................................................................29Special Concept ...........................................................................................29

Redo buttonusage ............................................................................................................20

Regex Conceptsdefined .........................................................................................................66Regular Expressions ....................................................................................66

Regular ExpressionsRegex Concepts ...........................................................................................66

RELdefined .......................................................................................................113

Relational Conceptsdefined .........................................................................................................68Grammar Concepts ......................................................................................68

relevancydeterminants ...............................................................................................116

Relevancy Cutoff .................................................................................. 47, 99, 112Rename

profiles .........................................................................................................93Result

Results pane ...............................................................................................114summary window .......................................................................................115

Results paneDoc Rank Weight .......................................................................................114Important Weight .......................................................................................115Paragraph ...................................................................................................114Result .........................................................................................................114Section .......................................................................................................114Section Rank Weight .................................................................................114Section Weight ...........................................................................................114sentence ......................................................................................................114Special Concept ......................................................................... 106, 107, 109Special Weight ...........................................................................................115test results ..................................................................................................114

136 SAS Text Summarizer Studio: User’s Guide

Page 143: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

SSave

File menu ..................................................................................................... 17Save button

usage ............................................................................................................ 19Save Project As

File menu ..................................................................................................... 17SEC

Section ....................................................................................................... 115usage .......................................................................................................... 109

SEC_WTdefined ....................................................................................................... 114Section Weight .......................................................................................... 115

Sectiondefine ........................................................................................................... 98Results pane ............................................................................................... 114SEC ............................................................................................................ 115summary window ...................................................................................... 115

section nodesDefinition tab ............................................................................................... 94

Section Rank WeightDefinition window ..................................................................................... 114Results pane ............................................................................................... 114summary window ...................................................................................... 115

Section Weightdefined ................................................................................................... 49, 97Definition tab ......................................................................................... 49, 97Results pane ............................................................................................... 114SEC_WT ................................................................................................... 115summary window ...................................................................................... 115

Select a Tokenizer File window ................................................................... 32, 83SENT

defined ....................................................................................................... 114sentence ..................................................................................................... 115

sentenceGradient field ......................................................................................... 42, 49Results pane ............................................................................................... 114SENT ......................................................................................................... 115summary window ...................................................................................... 115

SAS Text Summarizer Studio: User’s Guide 137

Page 144: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Sentence Tokenizerdefine ...........................................................................................................81sentence delimiters .......................................................................................85taxonomy window ........................................................................................22

sentence tokenizerabbreviations ................................................................................................86define ...................................................................................................85, 117

Simple Conceptsdefined .........................................................................................................68

Special Conceptanchor words ................................................................................................74Classifier Concepts ......................................................................................74codes ............................................................................................................75defined ...................................................................................................64, 74Definition tab ...............................................................................................29Definition window .......................................................................................77Important Concepts ......................................................................................63ranking .........................................................................................................29Results Pane ...............................................................................................107taxonomy node .............................................................................................29taxonomy window ........................................................................................22Test tab .........................................................................................................26usage ............................................................................................................29

Special conceptResults pane .......................................................................................106, 109

Special Conceptsmatching terms ...........................................................................................115

Special WeightImportant Concepts ....................................................................................115Results pane ...............................................................................................115Special Concept terms ...............................................................................115SPL_WT ....................................................................................................116summary window .......................................................................................116

SPL_WTSpecial Weight ...........................................................................................116

Standard toolbar ..................................................................................................19standard toolbar

usage ............................................................................................................19Start

menu .............................................................................................................15

138 SAS Text Summarizer Studio: User’s Guide

Page 145: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Status barlocation ........................................................................................................ 21usage ............................................................................................................ 21View ............................................................................................................ 18

summariesConcepts ...................................................................................................... 56types ............................................................................................................. 56

Summarizertaxonomy window ....................................................................................... 22

summarizertaxonomy window ....................................................................................... 22

summarizer.bin file ............................................................................................... 7summary window

Doc Rank Weight ...................................................................................... 115Important Weight ...................................................................................... 116Paragraph ................................................................................................... 115Result ......................................................................................................... 115Section ....................................................................................................... 115Section Rank Weight ................................................................................. 115Section Weight .......................................................................................... 115sentence ..................................................................................................... 115Special Weight .......................................................................................... 116

Syntax Check buttonusage ............................................................................................................ 30

system configurationspecified ......................................................................................................... 9

Ttar archive

UNIX ........................................................................................................... 10tar command ....................................................................................................... 11tar file .................................................................................................................. 11

archive ......................................................................................................... 11Taxonomy window

location ........................................................................................................ 16taxonomy window

Important Concepts ..................................................................................... 22Initial node ................................................................................................... 23Paragraph Tokenizer .................................................................................... 23profiles ......................................................................................................... 23

SAS Text Summarizer Studio: User’s Guide 139

Page 146: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Profiles node ................................................................................................93Sentence Tokenizer ......................................................................................22Special Concepts ..........................................................................................22Summarizer ..................................................................................................22summarizer ...................................................................................................22Tokenizer .....................................................................................................22Tokenizer node ............................................................................................32usage ............................................................................................................21

Teragram Tokenizerimport ...........................................................................................................81path .........................................................................................................32, 83

testConcepts .....................................................................................................101options ........................................................................................................101Profiles .......................................................................................................101profiles .........................................................................................................56Sections ......................................................................................................101Tokenizers ..................................................................................................101

test resultsResults pane ...............................................................................................114

Test tabImportant Concepts ................................................................................26, 70Special Concept ...........................................................................................26usage ............................................................................................................26

testing directoryassemble .....................................................................................................102

testing options ...................................................................................................115thesauri

Classifier Concepts ......................................................................................65tokenization

defined ............................................................................................... 4, 56, 81specify ..........................................................................................................56

Tokenizerspecify ........................................................................................................117taxonomy window ........................................................................................22

Tokenizer nodetaxonomy window ........................................................................................32

toolbardefined .........................................................................................................16

Toolbar optionView menu ...................................................................................................19

140 SAS Text Summarizer Studio: User’s Guide

Page 147: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Topsentence location ..............................................................................42, 49, 97

tsa windowopen ........................................................................................................... 104

Typesentence location ......................................................................................... 42

UUndo button

usage ............................................................................................................ 20UNIX

tar archive .................................................................................................... 10

VView

Status bar ..................................................................................................... 18View menu

Toolbar option ............................................................................................. 19

Wweight

defined ......................................................................................................... 92sentence location ......................................................................................... 42

SAS Text Summarizer Studio: User’s Guide 141

Page 148: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

142 SAS Text Summarizer Studio: User’s Guide

Page 149: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

Your TurnWe welcome your feedback.

- If you have comments about this book, please send them to [email protected]. Include the full title and page numbers (if applicable).

- If you have comments about the software, please send them to [email protected].

143

Page 150: Contents€¦ · In summary, SAS Text Summarizer Studio is a comprehensive tool that you use to identify and locate essential information within your documents that is then returned

144 SAS Text Summarizer Studio: User’s Guide


Recommended