+ All Categories
Home > Documents > Content Collector 3 Install Guide

Content Collector 3 Install Guide

Date post: 08-Nov-2014
Category:
Upload: lambert-leonard
View: 332 times
Download: 6 times
Share this document with a friend
772
IBM Content Collector Version 3.0 Administrator's Guide SH12-6980-00
Transcript

IBM Content CollectorVersion 3.0

Administrator's Guide

SH12-6980-00

IBM Content CollectorVersion 3.0

Administrator's Guide

SH12-6980-00

Note Before using this information and the product it supports, read the information in Notices on page 749.

This edition applies to version 3.0 of IBM Content Collector (product number 5724-V57) and to all subsequent releases and modifications until otherwise indicated in new editions. This edition replaces SH12-6914-01. Copyright IBM Corporation 2008, 2012. US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

Contentsibm.com and related resources . . . . viiHow to send your comments . Contacting IBM. . . . . . . . . . . . . . . . . . . viii . viii Installing Content Collector for use with one or more source systems and Content Manager. . . 72 Installing Content Collector for use with one or more source systems and FileNet P8 . . . . . 73 Installing Content Collector on several servers scale out . . . . . . . . . . . . . . 75 Installing individual components . . . . . . . 76 Installing or upgrading IBM Content Collector for Microsoft SharePoint . . . . . . . . . 76 Installing Content Collector Notes Client Extension . . . . . . . . . . . . . . 80 Installing Content Collector Server . . . . . 83 Performing the initial configuration . . . . . 85 Verifying and adjusting the initial configuration settings . . . . . . . . . . . . . . 108 Setting the Content Collector environment variables . . . . . . . . . . . . . . 110 Installing Content Collector on several servers 115 Configuring the web application server. . . . 122 Replacing the Lotus Notes mail template in all mailboxes . . . . . . . . . . . . . 136 Installing Content Collector Outlook Extension 136 Enabling offline repositories to allow access to archived content without network access . . . 139 Installing and configuring Content Collector Outlook Web App (formerly Outlook Web Access) support . . . . . . . . . . . 141

Part 1. Solution overview . . . . . . 1Content Collector overview . . . . . . 3What's new in Content Collector Version 3.0? . . New email management features . . . . . New source connector features . . . . . . New target connector features . . . . . . New indexing features in IBM Content Manager New indexing features in IBM FileNet P8 . . Further enhancements . . . . . . . . . 6 6 7 9 9 . 11 . 12 . . . . .

Content Collector architecture overview . . . . . . . . . . . . . . 15Definition of the email storage data model . IBM Content Manager data model . . . IBM FileNet P8 data model . . . . . . . . . . . . 16 . 17 . 19

Document archiving scenarios . . . . 23Scenario: Document archiving for storage purposes 23 Scenario: Archiving journal email . . . . . . . 24 Scenario: Document retention and disposition . . . 25 Scenario: Preparing the email repository for email analytics . . . . . . . . . . . . . . . 26

Removing Content Collector . . . . . 153

Part 2. Installing . . . . . . . . . . 29Installing Content Collector . . . . . . 31Prerequisites for the installation . . . . . . Hardware prerequisites . . . . . . . . Software prerequisites . . . . . . . . . Additional prerequisites and restrictions . . . Configuration worksheets . . . . . . . . Configuration worksheets for the Content Collector source systems . . . . . . . . Configuration worksheets for Content Collector repository systems . . . . . . . . . . Configuration worksheets for the Content Collector configuration database . . . . . Configuration worksheets for the Content Collector connectors . . . . . . . . . Configuration worksheets for the Content Collector general settings . . . . . . . . Upgrading to version 3.0 of IBM Content Collector Upgrading specific FileNet P8 task routes for email archiving . . . . . . . . . . . Additional steps for upgrading IBM Content Collector for Microsoft SharePoint . . . . . Installing Content Collector . . . . . . . . . . . . . 31 31 32 34 39

Part 3. Migrating . . . . . . . . . 155Migrating to Content Collector . . . . 157Moving from CommonStore to Content Collector 157 Restubbing documents archived using IBM CommonStore for Lotus Domino . . . . . . 157 Restubbing documents archived using IBM CommonStore for Exchange Server . . . . . 160 Moving from FileNet Email Manager or FileNet Records Crawler to Content Collector . . . . . 161

. 40 . 46 . 50 . 52 . 59 65 . 69 . 70 . 71

Part 4. Configuring . . . . . . . . 163Configuring Content Collector . . . . 165The Configuration Manager . . . . . . . . Enabling security in the Configuration Manager Signaling changes to the configuration database Adding, changing, or deleting configuration objects in the Configuration Manager . . . . Keyboard commands for Content Collector . . Setting up a configuration database . . . . . . Adding or editing data store connections . . . Deleting a data store connection . . . . . . Exporting or importing a configuration database 165 166 167 167 168 180 180 182 182

Copyright IBM Corp. 2008, 2012

iii

Starting the Task Routing Engine . . . . . . Configuring the task route service . . . . Checking if Content Collector is running . . Configuring the settings for LDAP lookups during task route processing . . . . . . Content Collector services . . . . . . . Content Collector processes . . . . . . Providing connections for collecting and archiving documents . . . . . . . . . . . . . Configuring connectors . . . . . . . . Source connectors . . . . . . . . . . Target connectors . . . . . . . . . . Utility connectors . . . . . . . . . . Configuring general settings . . . . . . . Configuring Content Collector for CommonStore for Exchange Server legacy support . . . . . . . . . . . . . Modifying the Configuration Web Service settings . . . . . . . . . . . . . Modifying the information center settings . . Modifying the settings for the Web Application Modifying client configuration settings . . . Configuring the access to archived data . . Modifying the settings for Content Search Services Support . . . . . . . . . . Modifying the settings for the Metadata Web Application . . . . . . . . . . . . Selecting the metadata form template . . . Configuring the metadata form definition . . Configuring metadata and lists . . . . . . Metadata and lists . . . . . . . . . . Adding, editing and sorting lists . . . . . Adding and editing user-defined metadata . System metadata . . . . . . . . . . Configuring task routes . . . . . . . . . Task routes . . . . . . . . . . . . Building task routes . . . . . . . . . Sample task route templates . . . . . . Task route traits and considerations . . . . Working with the Expression Editor . . . . Using extended processing functions . . . Collecting documents for archiving or processing . . . . . . . . . . . . Configuring tasks . . . . . . . . . . Using the setup tools . . . . . . . . . . Configuring an IBM Content Manager repository . . . . . . . . . . . . Configuring the Domino environment for Content Collector . . . . . . . . . . Enabling a Domino template for Content Collector . . . . . . . . . . . . . Enabling an IBM Content Manager repository for processing by the indexer for text search . Configuring an IBM FileNet P8 repository . . Enabling the access to archived data. . . . . About collections . . . . . . . . . . Enabling search for email documents . . . Enabling searching for documents archived by IBM CommonStore for Lotus Domino . . . Enabling searching for messages archived by IBM CommonStore for Exchange Server . .

. 183 . 183 . 186 . 186 . 187 . 195 . . . . . . 195 196 197 218 225 228

Customizing search and result fields . . . . Setting a default date range for the Email Search page . . . . . . . . . . . . . . . Changing the preview mode for Outlook . . . Enabling access to IBM Connections documents Enabling access to File System or Microsoft SharePoint documents . . . . . . . . . Handling erroneous documents . . . . . . . Blacklist . . . . . . . . . . . . . . Enabling Microsoft Outlook links. . . . . . Securing Content Collector communications . . . Replacing certificates for the embedded web application server . . . . . . . . . . . Client communication . . . . . . . . . URL protection . . . . . . . . . . . .

628 629 630 631 631 634 636 638 639 639 642 642

. 229 . 232 . 233 233 . 236 . 238 . 244 . . . . . . . . . . . . . . . 245 245 250 254 254 256 257 258 290 290 292 302 331 341 372

Part 5. Tutorials . . . . . . . . . 645Content Collector file system tutorials 647Archiving file system documents to FileNet P8 . . Moving documents off the network into IBM FileNet P8 . . . . . . . . . . . . . Detecting and processing duplicates, searching for archived and stubbed documents, and declaring documents as records . . . . . . Defining metadata to be used to process files for archiving . . . . . . . . . . . . . . 647 647

648 650

Part 6. Developing . . . . . . . . 653Developing with the Content Collector APIs . . . . . . . . . . . . . . . 655Creating requests for interactive archiving . . Document states . . . . . . . . . Developing with the Content Collector Web Application services APIs . . . . . . . RestoreAPI . . . . . . . . . . . ViewingAPI . . . . . . . . . . . Enabling security for the Web Application services APIs . . . . . . . . . . Developing with the Document Viewer. . . The Document Viewer configuration files . Document Viewer requests . . . . . . Configuring Workplace or Workplace XT for use of the Document Viewer . . . . . . . . . . . 655 . 658 . 659 . 660 . 662 664 670 670 675 678

. 405 . 460 . 558 . 558 . 562 . 563 . . . . . 564 565 570 571 610

. . . . . . . . the . .

Part 7. Monitoring . . . . . . . . 681Monitoring Content Collector system performance . . . . . . . . . . . . 683Using the system dashboard . . . . . . . . Information monitored in the system dashboard Using performance reporting . . . . . . . . Performance reporting database tables . . . . Using performance counters . . . . . . . . Performance counters . . . . . . . . . Tracking system log files . . . . . . . . . What logs to track . . . . . . . . . . . 683 684 685 687 687 688 692 692

. 625 . 626

iv

Administrator's Guide

File format and naming conventions for system log messages in Content Collector . . . . . 696 Log levels . . . . . . . . . . . . . 697 Using audit logs . . . . . . . . . . . . 698 Using event logs . . . . . . . . . . . . 699 Interpreting event logs . . . . . . . . . 700 Deleting event logs . . . . . . . . . . 700 Event IDs . . . . . . . . . . . . . 700

Part 8. Troubleshooting and support . . . . . . . . . . . . . 703Troubleshooting Content Collector 705705 705 706 706 708 708 709 709 710 710 711 711 713 713 Retrieving version information . . . . . . . Collecting troubleshooting data on Windows . . . Troubleshooting installation . . . . . . . . Troubleshooting scale out mode . . . . . . The installation of the web applications failed The installation, upgrade, or removal of Content Collector for Microsoft SharePoint failed . . . Creating the Content Collector configuration database on remote server fails . . . . . . The connection to the configuration database fails . . . . . . . . . . . . . . . The connection to the Oracle database fails . . Memory issues when running the initial configuration or the set-up tools . . . . . . IBM FileNet P8 validation fails using HTTPS connection in Initial Configuration/Setup Tools . The CommonStore server and the CSLD tasks fail to start . . . . . . . . . . . . . Troubleshooting configuration . . . . . . . . Troubleshooting source systems . . . . . .

Troubleshooting target repositories . . . . . Troubleshooting components . . . . . . . Troubleshooting task routes . . . . . . . Identifying document processing errors. . . . . Relevant event logs . . . . . . . . . . Checking event logs . . . . . . . . . . Checking if documents were collected . . . . SharePoint farm or web application collection fails for some site collections . . . . . . . Checking whether the source system connector started. . . . . . . . . . . . . . . Checking whether the source system collector started. . . . . . . . . . . . . . . Checking whether documents were submitted to the IBM Content Collector Task Routing Engine service. . . . . . . . . . . . . . . Checking whether documents were received by the IBM Content Collector Task Routing Engine service. . . . . . . . . . . . . . . Checking whether the expected task route was assigned . . . . . . . . . . . . . . Checking the IBM Content Collector deployment . . . . . . . . . . . . . Checking whether any tasks failed . . . . . Identifying whether metadata is missing . . . Checking whether the connector stopped . . . Analyzing task connector logs . . . . . . .

723 725 727 728 728 729 730 732 732 734

736

738 739 740 742 743 744 745

Part 9. Appendixes . . . . . . . . 747Notices . . . . . . . . . . . . . . 749 Index . . . . . . . . . . . . . . . 753

Contents

v

vi

Administrator's Guide

ibm.com and related resourcesProduct support and documentation are available from ibm.com.

Support and assistanceProduct support is available on the web. Simply click Support from the appropriate product website. IBM Content Collector http://www-01.ibm.com/software/data/content-management/contentcollector/ IBM Email Archive and eDiscovery Solution http://pic.dhe.ibm.com/infocenter/email/v3r0m0/index.jsp IBM CommonStore for Exchange Server http://www.ibm.com/software/data/commonstore/exchange/ IBM CommonStore for Lotus Domino http://www.ibm.com/software/data/commonstore/lotus/ IBM Content Manager http://www.ibm.com/software/data/cm/cmgr/mp/ IBM FileNet P8 http://www.ibm.com/software/data/content-management/filenet-p8platform/ IBM Enterprise Records http://www.ibm.com/software/data/content-management/filenet-recordsmanager/ IBM Records Manager http://www.ibm.com/software/data/cm/cmgr/rm/ IBM WebSphere Application Server http://www.ibm.com/software/webservers/appserv/was/ Lotus Notes and Domino http://www.ibm.com/software/lotus/notesanddomino/

Information centerYou can view the IBM Content Collector product documentation in an Eclipse-based information center. See the information center at http://pic.dhe.ibm.com/infocenter/email/v3r0m0/index.jsp.

PDF publicationsYou can view a PDF version of the IBM Content Collector installation and configuration guide by using the Adobe Acrobat Reader for your operating system. The guide is available from the IBM Publications Center. If you do not have the Acrobat Reader installed, you can download it from the Adobe website at http://www.adobe.com.

Copyright IBM Corp. 2008, 2012

vii

How to send your commentsYour feedback is important in helping to provide the most accurate and highest quality information. Send your comments by using the online reader comment form at https://www14.software.ibm.com/webapp/iwm/web/signup.do?lang=en_US &source=swg-rcf.

Contacting IBMTo contact IBM customer service in the United States or Canada, call 1-800-IBM-SERV (1-800-426-7378). To learn about available service options, call one of the following numbers: v In the United States: 1-888-426-4343 v In Canada: 1-800-465-9600 For more information about how to contact IBM, see the Contact IBM website at http://www.ibm.com/contact/us/.

viii

Administrator's Guide

Part 1. Solution overview

Copyright IBM Corp. 2008, 2012

1

2

Administrator's Guide

Content Collector overviewIBM Content Collector archives email and other digitized content in an external, central repository. Additional functions enable users to reduce the size of their mailboxes, reclaim space on their hard drives and Microsoft SharePoint servers, search for email in the repository, and restore archived email to their original locations. Archiving You can archive content from various sources. These include: v Mailboxes on Lotus Domino or Microsoft Exchange servers v Email that is received through the Simple Mail Transfer Protocol (SMTP) v v v v v Microsoft Exchange public folders and PST files Lotus Domino applications and local NSF archives Microsoft SharePoint sites IBM Connections content Documents in NTFS, DFS, and Novell file systems

Archiving means that the content of these documents is processed and then stored in a central repository. Terminology: IBM Content Collector uses documents as a generic term for email, messages, Microsoft SharePoint items, IBM Connections items, and file system documents. The central repository provides a single access point for all business-relevant documents, which means that sensitive data can be better controlled. Various security features are in place for the protection of business documents. Archiving methods include automatic and interactive archiving. v Automatic archiving means that an administrator centrally sets up an archiving schedule and selects the sources from which to archive content, such as email servers, applications, user groups, Microsoft SharePoint sites, IBM Connections deployments, or storage systems. v Interactive archiving on the client side enables Notes and Outlook client users to flag documents for archiving. Documents flagged by email client users are selected for archiving the next time the scheduled archiving process runs. Users can also specify additional archiving information before the documents are archived. When archiving email documents, IBM Content Collector always archives the entire email content, including the attachments. You can configure which parts are removed from the original document after it is archived, and when this happens. You can select documents from all connected mail clients, or from just a subset, according to predefined criteria, such as the size of mail databases, the age of documents, and so on. You can copy or move documents, including their attachments, from multiple Microsoft SharePoint sites, a single site, or selected libraries and lists. You can filter the archive collection based on content types or through additional task route filters, and map your custom site columns to corresponding metadata in your repository. Copyright IBM Corp. 2008, 2012

3

Content from multiple IBM Connections applications, from one or several deployments, can be copied to a repository. The collections of content can be filtered by users. File system documents can be processed depending on metadata and stored in a specific repository folder structure to facilitate search and retrieval. Accessing Content The preview and restore functions allow your client users to view and restore archived documents from the central repository, especially in cases where the archived content has been removed from the original documents. Client users can access the archived material in different ways depending on their source system. Email documents can be previewed or restored through links and hot spots provided in stub documents or through a web-based search interface, while documents from the file system or Microsoft SharePoint can be previewed through direct links. In IBM Content Collector, access to archived content is restricted. For email, access to a link is provided by the security of the user's mailbox, meaning the user will see only what the mailbox allows. For file system and Microsoft SharePoint, access to a link is determined by the user's access to the document's location within the file system or SharePoint list. Access to document content is also possible when using a repository client, either custom-built or out-of-the-box, where the credentials of a repository user are applied against a document's security to determine access. In IBM Content Collector, file system or SharePoint links can also be defined as secure links. Clicking a secure link prompts the user for specific user permissions to view the document content. Restriction: Archived SharePoint items cannot be restored from secure links. To remove the content of restored email in IBM Content Collector, you can define a schedule. This process is referred to as restubbing. Search (email) Installing and configuring the search functionality adds a search interface to the connected Lotus Notes or Outlook clients. From this interface, users can start full-text queries to search for archived content. The content of archived attachments is included in the search. For security reasons, the search capability is limited. Archive users can only search for content that was archived from their mailbox. They cannot search or restore content that is owned by other users. However, they can search content that was archived from mailboxes to which they have delegate access. For example, if an assistant has been given delegate access to a manager's mailbox, the assistant can search for content that was archived from this mailbox. Similarly, users can search the content that was archived from any Microsoft Exchange PST file or Notes Storage Facility (NSF) file that was assigned to them before it was archived. Users can also search for email metadata. This is information which resides in fields of the original email, such as the sender, recipient or subject field. The information in these fields is extracted during the archiving operation, and stored in corresponding fields in the repository. You can customize the list of email fields that you want to extract metadata from. It must be said,

4

Administrator's Guide

though, that metadata searches require the user to have a deeper understanding of the data in these fields. There is also a preview function. If a document looks promising in the result list, a user can select it to display its content in a web browser window. The search text is highlighted. If the document shows the desired content, users can click a Restore button to copy the content to an email document in their mailboxes. Search (Microsoft SharePoint, IBM Connections, and file systems) Microsoft SharePoint and file system document stubs contain all of the metadata related to a document. Users can perform a metadata search to locate documents. For all documents that were archived using the File System Connector, users can view the content by clicking the stub links. To search by content for Microsoft SharePoint, IBM Connections, and file system documents, users can use their target repository clients. For Microsoft SharePoint, if the target repository is FileNet P8, they can use IBM FileNet Connector for Microsoft SharePoint Web Parts to search by content. For searching by metadata for documents in a file system repository, users can apply the standard search tools provided by Windows. Document life cycle IBM Content Collector enables you to implement a range of document retention strategies, from simple deletion after processing to a formal declaration of documents as records in IBM Enterprise Records. You can remove parts of archived email documents or Notes application documents step-by-step from the original document until finally, the entire content is deleted. The removal of document content frees up space in the users' mailboxes or databases, and on the servers of your content management system. In Microsoft SharePoint source systems, you can replace entire documents with links to the archived document in the target repository. You can later update outdated links and remove orphan links from target repositories. To configure the email document life cycle, you can define a so-called stubbing life cycle. Stubbing means converting a document to a stub. A stub is a document from which parts of the content have been removed. For example, your stubbing life cycle might instruct IBM Content Collector to remove email attachments one week after the mail content has been archived. A second instruction in the schedule of the stubbing life cycle removes the main text or email body after four weeks so that just an empty shell of the original email remains. Finally, the stubbing schedule can be set up to delete the entire mail. The stubbing function can insert links in these email stub documents after archiving, which enable users to view the archived content by a mouse click. In addition, IBM Content Collector can be configured to insert brief texts in the original email to indicate that content has been removed, texts that inform users about the archiving of a particular piece of content. A separate task route can be configured to delete orphaned stubs, thus stub documents for content that has been deleted from the archive.

Content Collector overview

5

Related concepts: Content Collector architecture overview on page 15 Scenario: Preparing the email repository for email analytics on page 26 Scenario: Document archiving for storage purposes on page 23 Scenario: Archiving journal email on page 24 Scenario: Document retention and disposition on page 25 Related information: IBM Content Collector website

What's new in Content Collector Version 3.0?IBM Content Collector Version 3.0 provides the following new features. For the most current software requirements, including versions, see the System Requirements technote on http://www.ibm.com/support/docview.wss?uid=swg27024229.

New email management featuresIBM Content Collector Version 3.0 provides the following new email management features.

Email management enhancementsLotus Domino: Configure which icons to use for showing the document state When you customize the Lotus Domino template for IBM Content Collector you can select to use IBM Content Collector icons to represent the document state. The Content Collector icons then overwrite the default icons displayed in the attachment icons column. If you select to not use the IBM Content Collector icons, the original icons are preserved. Lotus Domino: Enable the template with basic IBM Content Collector functions only You can select to enable the Domino template with basic Content Collector functions only, which are required for archiving, but are invisible to the user. This means that the client menu does not contain any Content Collector elements that provide functions for searching and restoring documents or for collecting additional archiving information. Basic functions like automatic archiving or automatic retrieval of documents when they are opened are available, however. Lotus Domino: Default setting includes messages types in all mailbox management task route templates You can now specify to include message types, and not only to exclude message types. The default setting in all mailbox management task route templates now is to include all message types. Microsoft Exchange: Show the archiving status in Outlook You now have the option to have the archiving status of messages shown in Microsoft Outlook. You can select to add or remove an additional column that indicates the archiving status in the Outlook folder. Microsoft Exchange: Ribbon Support in Outlook Ribbon style of IBM Content Collector functions in Microsoft Outlook 2010 is supported.

6

Administrator's Guide

Legacy restubbing Documents that were archived using IBM CommonStore and are restored in IBM Content Collector can be restubbed. SMTP Connector enhancement The SMTP Connector now supports business process management and content classification scenarios. You can configure the SC Prepare Email for Archiving task to include the attachments of the document in the temporary files if the temporary files are used for business process management or as input for the IBM Content Classification task. Private items You can now explicitly exclude private items from being archived or, if private items are archived, you can limit delegate access to archived items that are not marked private. Cleanup of orphaned stubs New task route templates enable you to check mailboxes for orphaned document stubs. IBM Content Collector checks whether the document to which the document stub points still exists in the archive. If no associated document is found, the document stub is deleted. Enhanced blacklist UI You can now filter the blacklist to display only those entries that meet specified criteria. Arching local files in a scale-out environment IBM Content Collector now supports archiving local files (NSF and PST ) in a scale-out environment. PST or NSF files are processed by one dedicated node. Enhancements to the EC Copy to Mailbox task The EC Copy to mailbox task has been renamed to EC File Email in Mailbox Folder and now supports additional use cases. In addition to copying Microsoft Exchange messages from a local archive to the associated mailbox, messages can now also be copied or moved to a configurable folder that can be build from metadata, for example, the folder name is created from IBM Content Classification metadata. For Microsoft Exchange messages are copied, for Lotus Domino they are moved. Enhanced stubbing options The EC Create Email Stub task can now be configured to treat embedded attachments as part of the email body so that you can control whether embedded attachments are removed when attachments are removed or when the body of a message is truncated.

New source connector featuresIBM Content Collector Version 3.0 provides the following new source connector features.

File SystemFile re-collection You can collect new versions of files and have them added to IBM Content Manager or IBM FileNet P8 as new versions. Cleanup of orphaned stubs A new stub collector enable you to set up task routes for checking file systems for orphaned stubs. IBM Content Collector checks whether theContent Collector overview

7

document to which the stub points still exists in the archive. If no associated document is found, the stub is deleted. Stubs that are created with IBM Content Collector V3.0 contain the ID of the FileNet P8 or IBM Content Manager repository into which the document was archived. This ensures that the correct repository is accessed, thus preventing unintentional deletion of stubs. For stubs that were created with earlier versions, you can set the repository ID manually in the respective tasks. Metadata file collector To collect metadata files describing large numbers of content files, you can now configure task routes with a specific metadata file collector. The metadata file collector combines some of the functionality of the FSC Associate Metadata with the functionality of the file system collector. Working with a metadata file collector reduces memory requirements, makes better use of CPU, and ensures that the status of the metadata file is tied to the status of the documents. XML-based mapping of properties for files XML namespaces are now supported.

Microsoft SharePointThe following new features have been added to Microsoft SharePoint support: Collection levels and depth You can now configure a collection source to begin collection at the site, web application, or farm level. In addition, you can specify how deep you want to delve into a level. These features eliminate the need to create multiple site connections to traverse multiple web applications, sites and subsites. You can simply begin the collection process at the farm or web application levels and collect to any depth that you choose. Library or list type filtering You can now filter the collection process by selecting the library and list types that you want to collect. User filtering You can filter the collection process to select only content touched by specific users. Library and list types All library and list types are now supported. Column support enhancement All column data types are now supported and mapping options have been added. Re-collection enhancements It is no longer necessary to add an additional SP Collector when configuring re-collection. In addition, re-collection is enabled automatically during the installation process. Restore from link You can now restore a document from the target repository using a check out operation in SharePoint. Task route template enhancements The FileNet P8 and Content Manager Version Series templates now include list attachments.

8

Administrator's Guide

IBM ConnectionsThis is a new source connector. IBM Connections support comprises the following feature: Application support You can now capture and archive content from IBM Connections applications: profiles, activities, wikis, blogs, files, bookmarks, and forums.

New target connector featuresIBM Content Collector Version 3.0 provides the following new target connector features.

IBM Content ManagerHierarchical folders IBM Content Collector now supports the use of hierarchical folders in IBM Content Manager version 8.4.3 or later. Dynamic ACL support In addition to selecting from the access control lists (ACLs) that are available on the Content Manager server, you can now select to create a new ACL based on Content Collector ACL metadata or to define an expression to dynamically select an ACL. Support for additional document model parts The support for the IBM Content Manager document model has been enhanced to allow more flexibility in part selection.

IBM FileNet P8Indexing with IBM Content Search Services New tasks are available that support archiving email into a FileNet P8 repository that is configured to use IBM Content Search Services as its indexer. Indexing of additional document properties with IBM Legacy Content Search Engine The IBM Legacy Content Search Engine (formerly known as Verity or Autonomy) style sets were updated so that additional document properties are indexed into a separate zone. Mime type mappings You can now configure mime type mappings in Configuration Manager. Mappings that you configured in previous versions of IBM Content Collector are preserved. Maintenance task The configuration for the XIT consolidation task is now viewable and editable in Configuration Manager. Additionally, you can now configure a schedule for this task.

New indexing features in IBM Content ManagerIndexing in IBM Content Manager using IBM Content Collector Text Search Support provides the following new features.

Content Collector overview

9

Indexing features added in IBM Content Collector V2.2.0.2Additional processing in afuEnableItemType to support recognizing the TIEFLAG value in IBM CommonStore resource item types Additional processing in afuEnableItemType to support recognizing the TIEFLAG value in IBM CommonStore resource item types When you run the enable item type tool called afuEnableItemType on IBM CommonStore resource item types, the table of completed tasks is automatically filled with all the items that were already indexed. The IDXRC value that is given to these items correlates with the TIEFLAG value that defines which item parts are text-searchable. New configuration option added to the indexer process that acknowledges the TIEFLAG value used in IBM CommonStore A new indexer configuration option has been added to the indexer process that fills the TIEFLAG column in the item type component table in IBM Content Manager. Changes to the -reindexwarnings command-line argument used with afuIndexer The -reindexwarnings argument of the afuIndexer tool ignores items that were indexed by the fast indexer or the standard IBM Content Manager indexer with the IBM Text Search user exit and have an imported IDXRC value of between 10 and 19. New command-line option for afuEnableItemType to change the UDF buffer size A new command-line option called -udftransferbuffersize was added to afuEnableItemType that you can use if you need to specify a different size of the buffer which the AFUFetchFile UDF uses to load the temporary XML files for access by Net Search Extender. New indexer configuration option for handling items with sever errors By setting the configuration option IdxProcessSevereErrors to 1, items that might have caused the indexer worker process to stop unexpectedly are not moved to the table of completed tasks with an IDXRC of 200, but instead will be processed again without the embedded attachments the next time afuIndexer runs. Performing index validation and repair operations The indexer for text search index tool called afuIndexTool offers useful index operations that can be applied to an index to check for inconsistencies or can be used to update the index database tables to accommodate the IBM CommonStore TIEFLAG feature. Reindexing archived Lotus Domino mail documents that were not indexed correctly To identify those documents that might be affected and might need to be reindexed, use the tool named afuRepairCSN. This tool must be run on all item types containing Lotus Domino mail documents that were archived using an IBM Content Collector Server version between 2.1.1.1 LA006 and 2.1.1.3 LA006. Searching for encrypted email in the index A warning search notification message string called IcmFceWarning: IcmDocIsEncrypted is indexed when encrypted email (Exchange, Domino, and SMTP/MIME email) is processed by the indexer. The content of encrypted email cannot be indexed. Using the search message string, you can search for all encrypted email, decrypt the email, and reindex the email if you want to index the email content.

10

Administrator's Guide

Indexing embedded MSG files The textual content of embedded MSG files, even recursively embedded MSG files, can now be indexed. This means that the notification string IcmFceWarning: IcmUnhandledEmbeddedMsg is no longer used and cannot be searched. After you have applied the fix pack, search for all items in the index that have the notification string IcmFceWarning: IcmUnhandledEmbeddedMsg and reindex these items. Indexing IBM CommonStore Content Manager document item types You can index items in IBM CommonStore item types for the IBM Content Manager document model GENERIC_MULTIDOC and GENERIC_MULTIPART and the archiving type entire and attachment. Before you can use a CommonStore document item type in IBM Content Collector, the item type must be enabled for use in Content Collector.

Indexing features added in IBM Content Collector V3.0New indexer command-line arguments for reindexing items that were indexed with search strings Reindexes only those documents that were indexed during a previous indexing run and where the specified string was indexed with the document. New indexing mode that processes items with severe errors only You can run aufIndexer in a special mode in which only those items that were processed in an earlier indexing run and resulted in a severe error are reprocessed using configuration settings that are optimized for handling error situations, and not tuned for performance and high throughput. Additional support for item types containing IBM Connections documents The indexer for text search supports processing and indexing of IBM Connections documents. Support for Microsoft SharePoint item types that are created using the data model with embedded attachments The indexer for text search supports processing and indexing of Microsoft SharePoint documents in item types created using a new data model that supports handling embedded attachments. Index validation runs in parallel mode To increase performance, the index validation tool afuIndexTool performs index validation operations run in parallel.

New indexing features in IBM FileNet P8IBM Content Collector supports indexing in IBM FileNet P8 using both IBM Legacy Content Search Engine and IBM Content Search Services. The following new features have been added:

Indexing using the IBM Content Search Services indexing engineIBM Content Collector P8 Content Search Services Support IBM Content Collector P8 Content Search Services Support is an optional document constructor plug-in in IBM Content Search Services for custom preprocessing of all documents archived by using IBM Content Collector other than file system documents.

Content Collector overview

11

Further enhancementsIBM Content Collector Version 3.0 provides the following additional features.

Configuration ManagerEnhanced resilience When connection to the configuration database became invalid, the Configuration Manager is automatically reconnected to the database after the connectivity is back. Select more than one task route in the Explorer view You perform actions on more than one task route simultaneously.

Email clientsThe IBM Content Collector email clients now provide access to Content Collector client help documentation. For Microsoft Outlook, Outlook Web App, Lotus Notes, and Lotus iNotes the help documentation is available online. For Microsoft Outlook and Lotus Notes, the help documentation is available offline as well.

Expiration ManagerSupport for the IBM Content Search Services data model in IBM FileNet P8 The Expiration Manager now supports the FileNet P8 data model for IBM Content Search Services. Improved performance The performance of the Expiration Manager has been improved. Additional configuration options provide flexible control and allow for multi-thread processing.

MetadataConsolidate user-defined metadata and file system metadata The mechanism for specifying property mappings for files has been changed to make the configuration consistent with the configuration of user-defined metadata for email and for Microsoft SharePoint documents. You can now set up file system metadata as user-defined metadata properties. The properties are later mapped within a task route, in the FSC Associate Metadata task. Lists You can now import and export list values.

MonitoringPerformance reporting The new performance reporting component gathers statistical data about the performance of your IBM Content Collector installation. You can use the report viewer to generate a performance report from this data and display it. Additional performance counters IBM Content Collector now provides additional performance counters for system monitoring.

Search using IBM Content CollectorSorting the result list You can now sort the search result list by any column.

12

Administrator's Guide

Task route processingPerformance improvement The IBM Content Collector Task Routing Engine service is much more efficient than in previous versions. This was achieved by reimplementing the thread pool and work queuing mechanism.

Viewing documents in IBM FileNet Workplace or IBM FileNet Workplace XTYou can now configure IBM FileNet Workplace or IBM FileNet Workplace XT for viewing archived documents with the Document Viewer. The following redirections for viewing archived documents in IBM FileNet Workplace or IBM FileNet Workplace XT are no longer supported:# BRI file view redirect application/csbundled=/postRedirect?{QUERY_STRING}&redirectUrl=https://:11443/AFUWeb/CsnViewer.do # CSN file view redirect application/icccsn=/postRedirect?{QUERY_STRING}&redirectUrl=https://:11443/AFUWeb/CsnViewer.do

Content Collector overview

13

14

Administrator's Guide

Content Collector architecture overviewIBM Content Collector consists of several components, which interact with components of your Microsoft Exchange, Lotus Domino, NTFS, DFS, and Novell file systems, Microsoft SharePoint and IBM Connections environments, and repository servers. See the diagram.

IBM Content Manager

IBM FileNet P8Search/View/Restore

Target RepositoriesIBM Information Integrator for Content IBM FileNet P8 Content Engine Web Service

IBM Content Manager connector

IBM FileNet P8 connector

Metadata form connector Derby database Configuration Manager

Task Routing Engine

Web Application ServerSearch/ Specify View/ additional Restore archiving information Search/ View/ View View/ Restore Restore

Source connector

IBM Content CollectorMicrosoft Exchange Lotus Domino Microsoft SharePoint IBM Connections Files SMTP email

Outlook clients Notes clients SharePoint clients

Source

Figure 1. Interaction diagram including IBM Content Collector components, email clients, email servers, Microsoft SharePoint, IBM Connections, file systems, and repository servers

Source system A system that contains documents that you want to collect with IBM Content Collector. This can be Microsoft Exchange, Lotus Domino, SMTP email, NTFS, DFS, and Novell file systems, or Microsoft SharePoint or IBM Connections environments. Source connector A source connector provides an interface to a third-party system that contains documents that you want to work with in IBM Content Collector. It is responsible for the communication between email servers, file servers, Microsoft SharePoint, or IBM Connections and IBM Content Collector.

Copyright IBM Corp. 2008, 2012

15

Documents that are routed to IBM Content Collector for archiving pass this layer before they are processed and stored in a repository. Target connector A target connector provides an interface to the third-party system that serves as the target repository for IBM Content Collector. It is responsible for the communication between a IBM Content Manager repository, a IBM FileNet P8 repository, or a File System repository, and IBM Content Collector. Documents that are routed from IBM Content Collector for archiving pass this layer before they are stored in a repository. Task Routing Engine A service that monitors most of the collector services that run in IBM Content Collector. Configuration Manager A graphical user interface for the administration of IBM Content Collector. Web application server The IBM Content Collector web application server. This can be the embedded web application server or an external web application server. Metadata Form Connector A connector to a database where metadata is stored temporarily. Text Extraction Connector An interface to the Oracle Outside In Technology filters, which are used to convert binary data, for example from email attachments, into a plain-text representation. Utility Connector A container for those tasks that provide the intrinsic functions of IBM Content Collector. Derby database A temporary storage for any additional archiving information that a user specified when manually archiving a document. Related concepts: Definition of the email storage data model Content Collector overview on page 3 Scenario: Preparing the email repository for email analytics on page 26 Scenario: Document archiving for storage purposes on page 23 Scenario: Archiving journal email on page 24 Scenario: Document retention and disposition on page 25 Related reference: Additional prerequisites and restrictions on page 34 Related information: IBM Content Collector website

Definition of the email storage data modelIBM Content Collector uses a prescriptive email storage data model for compliance archiving, space management, and duplicate management. The benefits of such a data model are that it supports ingestion of high volumes of email, enables effective deduplication on email and email attachments across multiple email sources, and that it supports searches across the entire content of the email and electronic discovery by using IBM eDiscovery Manager.

16

Administrator's Guide

The email data model describes how Content Collector stores email in the repository: the entire content (email body, all attachment text, and metadata), deduplicated instances, and searchable XML. However, in business process management scenarios or in cases where search across the entire email content is not required, you do not have to work with the Content Collector email data model. IBM FileNet P8 now supports two content search engines: IBM Content Search Services and IBM Legacy Content Search Engine (formerly Autonomy K2 or Verity). Therefore, an additional email data model had to be introduced. Content Collector now offers these email data models for archiving into a FileNet P8 repository: v FileNet P8 data model for IBM Legacy Content Search Engine (also referred to as IBM Legacy Content Search Engine data model) v FileNet P8 data model for IBM Content Search Services (also referred to as IBM Content Search Services data model) The IBM Legacy Content Search Engine data model was enhanced to allow for a different way to create and update the XML Instance Text (XIT) object, which contains the email content to be indexed for text search. These changes improve resilience in the processing of duplicate email documents and of email documents that failed to be processed completely in a previous archiving attempt. The IBM Content Search Services data model is a simplified data model that not only supports the new FileNet P8 content search engine but also goes without an XIT object, thus saving database and file storage. There is no formal data model for Microsoft SharePoint, IBM Connections, and File System documents. IBM Content Collector offers a sample repository configuration for each. You can choose not to use the samples at all or choose to use some of the properties from the samples, depending on your business case.

IBM Content Manager data modelIn IBM Content Manager, all documents archived using IBM Content Collector are stored in item types. You must have at least one IBM Content Manager item type for each source system that you configure in IBM Content Collector. Deduplication on email, Microsoft SharePoint, and File System documents is only available within one and the same item type and not across item types. Email is stored in an email item type. The email item type is an IBM Content Manager resource item type containing one or more distinct email instances (DEIs). A DEI is the root item and is the common binary email object in one of these formats: v Notes binary (CSN) format v Multipurpose Internet Mail Extensions (MIME) format v Microsoft Exchange mail document (MSG) format The root holds all common email data and attributes that are shared across all instances of the email. It contains the hash that is used to ensure that the email is stored only once in the repository. The DEI is the item that is required by an application, for example, in a workflow process, for records management or for viewing purposes. A DEI has two child components:Content Collector architecture overview

17

v The email instance (EI) child that tracks the references of all duplicates of the same email archived from different mailboxes or the journal. It contains the properties of each email duplicate which are needed to restore each individual copy of the email, the varying properties. For journal archiving, the varying properties contain the additional journal attributes produced during the journal process. v The attachment instance (AI) child that tracks the references to the email attachments that are archived separately. As an email can have multiple attachments, this reference child can have none, one, or many entries pointing to attachments. Not only are the references to the attachments stored but also additional meta data required for viewing and restoring the email with its attachments, for example, the attachment file name and a correlation key which is used to restore the attachment to the original location in the email.

Varying properties

Varying properties from journal

EI

EI

Hash, Common properties, Search result list properties

DEI

Email object

Text index

AI Email item type

When a DEI is removed, all associated objects will be removed as well. To prevent accidental deletion of the DEI, for example, by a client user, the expiration date is monitored and only if the current date is past the expiration date, removal is allowed. Email attachments are stored in an attachment item type. The attachment item type is a resource item type and can contain attachments from different email source system item types. The attachment item type contains one or more distinct attachment instances (DAIs). A DAI represents the attachment object itself and is the master object that controls the deletion of the associated content and objects. A DAI is referenced by one or more AIs from an email instance (a DEI). A DAI can only be removed if no other instances are pointing to it. The only attribute required by a DAI is the hash used to calculate a unique deduplication hash key that ensures that only one copy of the attachment is kept in one item type, no matter how many times the same attachment was archived by different users.

18

Administrator's Guide

AI

Hash

DAI

Attachment object

Attachment item type

IBM FileNet P8 data modelIn IBM FileNet P8, all documents archived using IBM Content Collector are stored as document objects in an object store. The object store must be dedicated to archiving with IBM Content Collector. The same object store can be used to store email, Microsoft SharePoint, and File System documents, and with object stores that are configured for use of IBM Content Search Services also IBM Connections documents. FileNet P8 offers two Content Search Engine components, IBM Legacy Content Search Engine and IBM Content Search Services, which you can run in parallel for indexing and search. However, object stores that are used for email archiving with Content Collector must be configured to use either IBM Legacy Content Search Engine or IBM Content Search Services.

FileNet P8 data model for IBM Legacy Content Search EngineThis data model is used for storing objects into a FileNet P8 repository for which IBM Legacy Content Search Engine (formerly Autonomy K2 or Verity) provides the indexing and search capability. To store an email in FileNet P8 the following objects are used: v A distinct email instance (DEI) that is the root document object for the email consisting of one or more content elements: The first content element is the email document from different mailboxes or the journal. All subsequent content elements are the attachments. The ID of the DEI document is based on a unique hash that is used to ensure that the email is stored only once in the repository. It also contains the properties that are common to all duplicates of the email. v An XML Instance Text (XIT) that is an indexable XML file containing the data of the content elements of the DEI that needs to be indexed for text search. It also contains the search result list properties. These properties are intended for use in the search result list only and contain truncated values. Do not use them for other processing. With IBM Legacy Content Search Engine, you cannot index the content of documents with more than one content element. The XIT document provides a workaround for this limitation. All data from the email that must be indexed, including the email body and the content of any attachments, is stored in the single content element of the XIT document. Search tools that are compatible with the Content Collector email data model can locate the additional parts of an email, that is the DEI and the email instances, as soon as they found the XIT document.Content Collector architecture overview

19

v An email instance (EI) that is a custom object. For each duplicate of an email (mailbox or journal instances of the DEI), one EI tracks the data that is unique to this copy. In addition, an annotation object is created for each duplicate email that is found. The content element of the annotation contains the information that is required to update the XIT object. Annotation objects are deleted as soon as the XIT is updated.

Varying properties from mailbox

Varying properties from journal

EI

EI

Email object 1 Hash, Common properties

DEI

Attachment object 1

Attachment object n

Temporary annotations

Search result list properties

XIT

Text object

Text index

Email document object

Email deduplication is provided by Content Collector, whereas attachment deduplication is managed by FileNet P8 or on the storage device layer. When a DEI is removed, all associated objects will be removed as well. An exception to this is if IBM eDiscovery Manager placed a legal hold on the XIT. In this case, the archived email cannot be deleted. Deletion constraints are put on the DEI and XIT so that an accidental deletion of the XIT is prevented. Attempts to delete the XIT result in an error. This ensures that the indexing for the email is not lost. To prevent accidental deletion of the DEI, for example, by a Workplace user, an expiration date can be set on the DEI. When an expiration date is set, an event handler checks this property on deletion and only if the current date is past the expiration date, the DEI can be deleted.

FileNet P8 data model for IBM Content Search ServicesIBM Content Search Services can be used with IBM FileNet P8 to index documents and enable search. It is a new approach to full-text indexing optimized for email and compliance solutions. To be able to write and read index information for email stored into FileNet P8 by using IBM Content Search Services, Content Collector requires an email storage data model that is different from the data model that is used with IBM Legacy Content Search Engine. The IBM Content Search Services data model does not require an XML Instance Text (XIT) object to contain the email content to be indexed for text search.

20

Administrator's Guide

To store an email in FileNet P8 the following objects are used: v A distinct email instance (DEI) that is the root document object for the email consisting of one or more content elements: The first content element is the email from different mailboxes or the journal. All subsequent content elements are the attachments. The ID of the DEI document is based on a unique hash that is used to ensure that the email is stored only once in the repository. It also contains the properties that are common to all duplicates of the email and the search result list properties. The search result list properties are intended for use in the search result list only and contain truncated values. Therefore, do not use them for other processing. The DEI object is enabled for content based retrieval and is text indexed. v An email instance (EI) that is a custom object. For each duplicate of an email (mailbox or journal instances of the DEI), one EI tracks the data that is unique to this copy. Index information for the DEI and EI objects is created or updated when Content Collector creates, updates, or deletes such an object.

Varying properties from mailbox

Varying properties from journal

EI

EI

Email object 1 Hash, Common properties, Search result list properties

Text index

DEI

Attachment object 1

Attachment object n

Email document object

Email deduplication is provided by Content Collector, whereas attachment deduplication is managed by FileNet P8 or on the storage device layer. When a DEI is removed, all associated objects will be removed as well. To prevent accidental deletion of the DEI, for example, by a Workplace user, an expiration date can be set on the DEI. When an expiration date is set, an event handler checks this property on deletion and only if the current date is past the expiration date, the DEI can be deleted. A DEI also cannot be deleted if a legal hold is placed on it by IBM eDiscovery Manager. The source instances can be deleted any time, unless they are also under the control of another application, such as a records management application or IBM eDiscovery Manager.

Content Collector architecture overview

21

22

Administrator's Guide

Document archiving scenariosDocument archiving refers to the long-term storage of email and other documents in a central repository, and, in a broader sense, to capabilities of finding, viewing, and restoring archived content. The document archiving scenarios describe how IBM Content Collector helps companies address issues such as storage problems, regulatory compliance, and internal policy compliance. One scenario focuses on the preparation of a repository that is to be used as a knowledge base for analytics with tools such as IBM eDiscovery Manager and IBM eDiscovery Analyzer.

Scenario: Document archiving for storage purposesThis scenario describes how employees in ExampleCo. Enterprises, a fictitious company, address document storage and performance problems on client workstations and email, Microsoft SharePoint, and NTFS file servers. ExampleCo. Enterprises decides to implement new processes to archive documents because the performance of the company's servers has degraded considerably. The volume of email and SharePoint documents has nearly doubled in the last two years. Email often contains attachments of more than 2 MB in size, so the mailboxes of most users grow rapidly. SharePoint servers can quickly fill with graphics or video files. Users sometimes wait several minutes when they search for email in their own mailbox or documents on the SharePoint server. The documents occupy a lot of disk space on the users' workstations and, more importantly, on the servers. Increasing the server disk space will not improve, and can degrade, server performance. So ExampleCo. Enterprises decides to use IBM Content Collector to archive documents to a central repository. After copying email and documents to a central repository, the original portions of the documents can be removed from the mail system. This method of storing documents significantly reduces the disk space requirements. Less data needs to be read, scanned, and handled, and the performance of the source systems improves. The managers discuss the archiving requirements with Judy Jameson, an IT administrator for ExampleCo. Enterprises. Judy implements the following rules and processes: v Automatically archive email with attachments that are larger than 2 MB one week after their creation or receipt, and all other documents after four weeks. v Retain documents on the source server for three months, to avoid impeding the work of users who are working offline. v After three months, remove the SharePoint documents, files, and large email attachments and replace them with links called stubs. Users can follow the links in the stubs to view and restore the documents. v After one year, remove the stubs from the source servers, including NTFS file servers. Users with access to the target repository can search and restore the documents. v Email users can manually archive documents at any time. To meet these requirements, Judy decides to use and modify one or more of the task route templates that IBM Content Collector delivers. The templates provide an Copyright IBM Corp. 2008, 2012

23

easy way of setting up the system and do not require in-depth system skills. She can adapt the templates to accommodate future needs, but at present she needs to make only minor adjustments to make the templates fit the document management requirements of ExampleCo. Enterprises. Related concepts: Collecting documents for archiving or processing on page 405 Content Collector overview on page 3 Content Collector architecture overview on page 15 Related tasks: Creating a task route on page 292

Scenario: Archiving journal emailThis scenario describes how ExampleCo. Enterprises, a fictitious company, employs IBM Content Collector to archive email that is journaled by the company's email infrastructure. For compliance purposes and to avoid accidental or intentional deletion of email, ExampleCo. Enterprises keeps a journal of all incoming and outgoing email. Currently, all email is automatically journaled to a journal mailbox on each of the company's email servers. As it is much easier for the compliance department of ExampleCo. Enterprises to work with journals that are archived in a central enterprise archive with extended full text search capabilities instead of distinct journals that are located on several email servers at different locations with limited text search capabilities, ExampleCo. Enterprises decides to use IBM Content Collector to create one archive of the journal copies of all email from all email servers, so that they do not need to be retained locally. The managers ask Judy Jameson, the IT administrator, to investigate the options for archiving journal email. She identifies two possible strategies: v Configure Content Collector to archive directly from the existing journal mailboxes. v Configure the email servers to send journal copies of all email to the Content Collector server, so that Content Collector can archive them. Because the company already uses journal mailboxes, the first option, namely to archive journals from existing journal mailboxes, is straightforward to implement. However, Content Collector must frequently crawl the journal mailboxes on each email server to identify, collect, and process the journal email. As the email servers are decentralized at different locations, they must be accessed over a wide area network (WAN) instead of a local area network (LAN). For Microsoft Exchange email users, the network delays in a WAN network can lead to particularly poor archiving performance when mailboxes are accessed through the MAPI RPC protocol that Content Collector uses to connect to the clients. As a consequence, using the second option, namely sending all journal copies to the Content Collector server, is advisable in a Microsoft Exchange environment. In addition, using the SMTP archiving mechanism results in significant storage savings because it uses the much more efficient email archive file format (EML) in contrast to the less efficient MSG email file format that is used when messages are archived directly from the journal mailboxes. For IBM Lotus Domino based email systems, the WAN network latency impact is not as great and the standard archiving format (CSN) is very efficient. The journal

24

Administrator's Guide

mailbox based archiving approach might even provide advantages due to the high degree of deduplication that is provided if, besides journal archiving, a mailbox management use-case is present as well. To not risk performance impairment, Judy decides to implement the second option and configure the email servers to send journal copies of all future email to the Content Collector server. Content Collector then collects all journal email that it receives from the different servers and stores them in the same archive. To receive and process the journal email in Content Collector, Judy configures the SMTP Connector, which receives email through the Simple Mail Transfer Protocol, and sets up a task route to archive the received email. Then she modifies the journaling configuration of each of the company's email servers to deliver the journal email to the Content Collector server through an SMTP connection instead of storing it in a journal mailbox. Related concepts: The SMTP Connector on page 207 Content Collector overview on page 3 Content Collector architecture overview on page 15 Related tasks: Creating a task route on page 292 Collecting SMTP documents on page 429

Scenario: Document retention and dispositionThis scenario describes how the real estate firm ExampleCo. Enterprises uses IBM Content Collector to retain and remove electronic documents. To avoid the accidental or intentional deletion of documents, the company currently journals its email and backs up every document from its Microsoft SharePoint and file servers. The method works, but not well, because it requires manual disposition of documents and is certain to overwhelm the source servers, degrading performance and eating up more and more disk space. Retrieval of backed up documents is painful if not impossible. Before they can implement a better solution, ExampleCo. Enterprises must determine the level of control they need over retention life cycles and outcomes. Their records administrator, Alexandra Jackson, informs the chairperson that they need to declare a significant subset of documents as records. The majority of their electronic documents, however, require only simple retention: keep them for three years, then delete. The company decides to use IBM Content Collector to retain their email and other documents. The application offers two levels of retention: v The Calculate Expiration Date task calculates a deletion date that is based on metadata, for example, user, group, or an automatic classification that IBM Content Classification supplies. This date is used with the Create Document task to set an expiration date on each archived document as it is added to the repository. v The Declare Record task hands more complex retention tasks, such as the application of variable retention periods and disposition options, to IBM Enterprise Records.

Document archiving scenarios

25

Because the company requires the basic retention of some documents and the declaration of other documents as records, they decide to use both options. Alexandra asks Judy Jameson, the IT administrator, to set up IBM Content Collector to declare as records all documents that need to be records, and to archive and retain all other documents for a period of three years, after which time they should be automatically deleted. Judy uses the task route templates to create two sets of task routes, one set to process the documents that need to be records, and another to process all other documents. To each task route in the first set she adds a Declare Record task that declares each document as a record in IBM Enterprise Records. To each task route in the other set she adds the Calculate Expiration Date task and sets it to calculate the retention date to three years after the document's creation date. She sets the expiration date property on the document in the Create Document task to use this calculated value when the document is created. Related concepts: Collecting documents for archiving or processing on page 405 Content Collector overview on page 3 Content Collector architecture overview on page 15 Related tasks: Creating a task route on page 292

Scenario: Preparing the email repository for email analyticsExampleCo. Enterprises, a fictitious company that builds electronic appliances, must go to court to contest patent claims by other companies. Email, among other evidence, can prove that ExampleCo. Enterprises is the legal owner of their inventions. This scenario describes how employees in ExampleCo. Enterprises prepare their email archiving system to be able to find email that is relevant to a lawsuit.

Content Collector

Install and configure Configure for savings and retention in repository Establish retention in Records Manager

eDiscovery Manager

eDiscovery Analyzer

Retain significant email Archive significant email Configure archiving including automatic versus manual

Sometimes, competitors copy ExampleCo. Enterprises inventions illegally. When ExampleCo. Enterprises learns of such a case, it considers a lawsuit against the infringing party or demands compensation. The company needs to prove that it is the legal owner of these innovations. To do that, ExampleCo. Enterprises provides a law firm with blue prints, meeting minutes, product specifications, patents, patent applications, and email that date back to the time when a product was

26

Administrator's Guide

developed. The law firm analyzes the material and, based on the results, tries to negotiate a settlement with the accused party. The email of the engineers at ExampleCo. Enterprises proves that ideas evolved at ExampleCo. Enterprises before they could have possibly been discussed by the competitor. Of special interest is email of engineers who left ExampleCo. Enterprises to work for the competition. Some of these documents contain hints that a technology was developed when the engineer still worked for ExampleCo. Enterprises and that therefore ExampleCo. Enterprises has the exclusive right to use this technology. In cases like this, the former managers and co-workers of the engineer must be identified so that they can testify if needed. The information in the email can help the attorneys trace the departments that a person worked for and ensure that they find the right person. For that reason, Chris Marsh, the head of the corporate litigation department, and Alexandra Jackson, the legal case administrator, want to search for department numbers, for unique employee identifiers, and for the managers of employees and departments. To collect and preserve the email, they use a tool such as IBM eDiscovery Manager, and to analyze the email, they use email analytics tools like IBM eDiscovery Analyzer. To provide the appropriate search results, these tools require additional attributes to be set in the repository. Chris and Alexandra ask Judy Jameson, the IT administrator, to set up the repository appropriately. Judy creates additional attributes in the content management system that is serving as the repository for the email documents. These attributes must contain department numbers, identifiers, and manager names. IBM Content Collector will store the information for each email that is archived in the repository. Because the information cannot be found in the email, it will be extracted from the company's Active Directory when the email is archived. Judy also adds the new attribute names to the configuration file for the text-search indexer that is provided by IBM Content Collector. This causes an extraction of the attribute values when the index is built, which adds this information to the text-search index. The information in the text-search index is used by IBM eDiscovery Manager.

Document archiving scenarios

27

28

Administrator's Guide

Part 2. Installing

Copyright IBM Corp. 2008, 2012

29

30

Administrator's Guide

Installing Content CollectorInstall IBM Content Collector according to your requirements. Check the prerequisites before you start the installation. Related information: System requirements

Prerequisites for the installationRead the release notes and check the prerequisites before you install IBM Content Collector.

Hardware prerequisitesCheck the hardware that you need for IBM Content Collector, for the source systems that contain the documents to be archived, and for the repositories in which you want to archive the documents. For the most current hardware requirements, see the System Requirements technote on http://www.ibm.com/support/docview.wss?uid=swg27024229. In addition, consider the following requirements: v You need a distinct computer or virtual machine that runs on one of the supported Windows operating systems. This computer or logical machine must be connected by a TCP/IP network to the servers on which your repositories and your source systems are installed. For Microsoft Exchange, this computer must be in the same domain as the Microsoft Exchange server. Microsoft Outlook must also be installed. v Content Collector uses a multiple process architecture which supports to use multiple GB of main memory efficiently. In addition, performance is greatly improved if there is a sufficient amount of memory available for the operating system disk cache. The minimum requirement of 4 GB of memory is sufficient for Content Collector servers that are used for basic document collection and archiving. However more memory (the recommended amount is 8 GB) is required: For production servers that are intended for large workloads, in other words, for servers that process many, potentially large documents in parallel For Content Collector servers that also provide search, viewing, and restore services through web applications, for example, in mailbox management scenarios For the servers that run the SMTP Receiver component of the SMTP Connector More than 4 GB can efficiently be addressed on systems using a 64-bit version of the operating system. v Working directories, like the directory that the Email Connector uses to create and store temporary files or the seedlist directory of the IBM Connections Connector, must be on a separate and fast disk. v Use a Raid 5 array for your operating system and for the Content Collector components. Raid 5 array helps avoid system downtime in the case of hard-disk Copyright IBM Corp. 2008, 2012

31

failures. However, do not use a Raid 5 array for the working directory of the email server connector because of the write penalty of Raid 5.

Software prerequisitesEnsure that you have the necessary software installed at version levels that this release of IBM Content Collector supports. Check the requirements for the software that you need for IBM Content Collector, for the source systems that contain the documents to be archived, and for the repositories in which you want to archive the documents. For the most current software requirements, including versions, see the System Requirements technote on http://www.ibm.com/support/docview.wss?uid=swg27024229. In addition, consider the following requirements: v To use Lotus Domino as a source system: Install Lotus Domino Server on the IBM Content Collector server and disable the Lotus Domino service. Install Lotus Domino Client on client machines. v To use Microsoft Exchange as a source system: 1. Install Microsoft Outlook, including the latest service packs and patches on the IBM Content Collector server. 2. Start Microsoft Outlook and verify its connection to the email server: Create a profile and then log on to Microsoft Exchange with the user ID that you intend to use as the user account for the IBM Content Collector Email Connector service. 3. Make Microsoft Outlook the default email client. 4. Configure Microsoft Outlook to prompt for a profile every time Outlook is started. 5. Stop Microsoft Outlook before you install IBM Content Collector Server. v To use Content Manager as your repository: Install and configure the IBM Information Integrator for Content connector on the server on which Content Collector is to be installed. Ensure that the IBM Information Integrator for Content connector installation on the IBM Content Collector server is always at the same software version level (such as fix packs) as the IBM Content Manager server installation. If you want to search documents that are archived in IBM Content Manager, install IBM Content Collector Text Search Support on the IBM Content Manager server before you install Content Collector. On the Solaris Operating Environment, the text-search component requires the iconv package. v To use IBM FileNet P8 as your repository: Depending on the version of FileNet P8 that you are using, you need to install different supporting software. See the System Requirements technote. IBM FileNet P8 Content Engine Server must be installed and configured. If you want to support content-based searches, IBM FileNet P8 Content Engine Server must also be configured for content-based retrieval (CBR). For further information, see the section on configuring Content Engine for CBR in the FileNet P8 documentation. IBM FileNet P8 Content Engine .NET Clients must be installed to enable communication between the FileNet P8 Content Engine Server, and the IBM

32

Administrator's Guide

Content Collector Configuration Manager, and the IBM Content Collector FileNet P8 Connector service. This installer is integrated in the FileNet P8 Content Engine Server installer's .NET Clients option. Optional: Install IBM FileNet Enterprise Manager on the machine where the Content Engine Server is installed. This is a subitem under the .NET Clients option in the FileNet P8 Content Engine Server installer. FileNet P8 Content Engine Client and the FileNet Java Client API must be installed before IBM Content Collector Server is installed. If the FileNet Java Client API is not installed, a warning icon (a red exclamation mark) will be shown in the initial configuration and no FileNet P8 connection can be created. To install the FileNet P8 Content Engine Java clients component, run the FileNet P8 Content Engine Client installer. Select Other Applications on the FileNet P8 Content Engine Client installer. This will trigger the installation of the FileNet Java Client API libraries. These are required by the Content Collector Initial Configuration, Content Collector Web Services, and the Configuration Manager General Settings. Ensure that the FileNet P8 Content Engine client installation on the IBM Content Collector server is always at the same software version level (such as fix packs) as the FileNet P8 Content Engine server installation. Note that the Java libraries in CEClient\lib are copied to the Content Collector Web Services deployment during installation. If a FileNet Java Client update is required, the newer versions of these files must be copied to AFUWeb\lib. Install IBM FileNet Content Search Engine on a server other than the IBM FileNet Content Engine server. Install the client application for one or both of the search engine options that are supported by FileNet P8 Content Engine: IBM Content Search Services Searching and index creation are resource-intensive operations. Depending on the available system resources, the following configurations might apply: - Multiple instances of IBM Content Search Services can be collocated on the same server. - IBM Content Search Services can be collocated with IBM Legacy Content Search Engine. Before collocating, contact your IBM representative for assistance with system sizing. IBM Legacy Content Search Engine You cannot collocate the Master Administration Server and an Administration Server on the same server. Searching and index creation are resource-intensive operations. As a best practice, do not collocate IBM Legacy Content Search Engine with other FileNet P8 components. For further information, see the section on installing Content Search Engine in the FileNet P8 documentation. v For indexing documents that are archived with IBM Content Collector with IBM Content Search Services, install IBM Content Collector P8 Content Search Services Support on the IBM Content Search Services server. v Depending on the type of database that you want to use for the configuration data, you must install different clients and supporting software:

Installing Content Collector

33

DB2 database Install DB2 Runtime Client on Content Collector Server to establish a connection. SQL Server database Install a JDBC driver. Oracle database Install a JDBC driver Install the Oracle Client tools v A web application server is required to provide access to configuration data and archived documents. It hosts the Content Collector web applications. If you do not want to use the embedded web application server to store the configuration data, install another WebSphere Application Server or use an existing one. The IBM Content Collector web applications support these web browsers: Apple Safari, Microsoft Internet Explorer, and Mozilla Firefox.

Additional prerequisites and restrictionsCheck the listed additional prerequisites and restrictions before you install IBM Content Collector and ensure the prerequisites are met.

Considerations for the source systemTable 1. Considerations for the source system Source system Lotus Domino Prerequisites and restrictions v If you want to use iNotes (formerly Domino Web Access (DWA)), configure iNotes on Lotus Domino Server. Important: For Lotus iNotes in Lotus Domino V8.5.1 and above, specify the Extension Forms File Forms85_x.nsf, which must exist in the iNotes directory on the Lotus Domino server. If the file does not exist, you must create one before you can enable the Content Collector features on Lotus iNotes. For information about how to create an Extensions Forms File, see the topic about customizing the look of Lotus iNotes in the IBM Lotus Domino and Notes information center at http://publib.boulder.ibm.com/ infocenter/domhelp/v8r0/index.jsp. v Ensure that the Lotus Domino server that IBM Content Collector archives from is restarted after all enablement for IBM Content Collector has been completed. v When you use IBM Lotus Domino Attachment and Object Store (DAOS) and want to restore your documents back to Lotus Notes, the attachments of the documents are not restored to DAOS. v You can make the IBM Content Collector functions available on Citrix on a virtual desktop, as installed application that is accessed from a server, as application that is streamed to server, or as application that is streamed to client. For further information, see the topic Microsoft Exchange v You can make IBM Content Collector Outlook Extension available on Citrix on a virtual desktop, as installed application that is accessed from a server, as application that is streamed to server, or as application that is streamed to client. For further information, see the topic v Microsoft Exchange Server 2010 only: Make sure that the client throttling policies that are turned on by default do not unintentionally restrict archiving operations on the Content Collector server as this can lead to a considerable throughput reduction. Either adapt the default client throttling policies or create a tailored throttling policy for the Content Collector archiving user accounts. For details, refer to Disabling throttling for a Content Collector service account in Microsoft Exchange Server 2010 on page 37.

34

Administrator's Guide

Considerations for the target systemTable 2. Considerations for the target system Target system Content Manager Prerequisites and restrictions v Use only the characters a-z, A-Z, and 0-9 of the Latin-1 character set in the names of the index directory and the index working directory. v To use IBM Content Collector Text Search Support to index and search your documents in a IBM Content Manager repository, these considerations apply: Install the text-search component and enable the repository for search before you install Content Collector Server because the server uses the files and functions that are installed by this component. For more information, see the section on enabling an IBM Content Manager repository for search. If IBM Content Manager is installed on more than one server, install Content Collector Text Search Support on the IBM Content Manager machine where the library server and Net Search Extender are installed and not where the resource manager is installed. On Linux, use a shell that uses the .profile script. Otherwise, the RC file of the instance owner user ID is not updated. On Linux and UNIX, the library server name is case-sensitive. If, during the installation of the text-search component, you create a directory in the library server administration directory with the same name as the library server, the name must match with regard to the case. On Windows, install the text-search component on the server on which DB2 is installed. Before you install the text-search component and run any of the indexer tools, you must define the environment variable DB2HOME. This environment variable is used to determine Net Search Extender template configuration settings and must point to the DB2 installation directory, for example on Windows, to C:\Program Files\IBM\sqllib It is recommended that you define this environment variable permanently on all platforms. If you install the text-search component on a IBM Content Manager machine where the default DB2 administrator ID is not administrator, but db2admin1 for example, the installation might fail because the Net Search Extender service cannot be stopped. Stop the Net Search Extender service manually before you start installing the text-search component.

Installing Content Collector

35

Table 2. Considerations for the target system (continued) Target system IBM FileNet P8 Prerequisites and restrictions Without content based retrieval A FileNet P8 object store with a file storage area must exist. See the section on creating object stores in the FileNet P8 documentation for further information. Important: Set up your target object store with a file storage area as the default content store. A file storage area stores content in a network-accessible directory. Enabled for content based retrieval v A FileNet P8 object store with a file storage area must exist. To support content-based searches, the object store must be enabled for content-based retrieval (CBR). See the section on creating object stores in the FileNet P8 documentation for further information. Important: Set up your target object store with a file storage area as the default content store. A file storage area stores content in a network-accessible directory. To prepare your system for index area creation, each file storage area that will be full-text indexed must be accessible by both FileNet P8 Content Engine and the server that will perform the full-text indexing. The index area is required for retrieving email and other documents by searching their content. For performance reasons it is recommended that the FileNet P8 Content Engine has direct access to the file storage area and that the index servers access this area remotely. Conversely, it is strongly recommended that the index server has direct access to index and temporary directories and that the FileNet P8 Content Engine accesses these remotely. v With Content Collector Version 3.0, you have the option to


Recommended