RSCTC 2008 © 2008 ZL Technologies, Inc. Email Archiving Arvind Srinivasan Gaurav Baone.

Post on 28-Dec-2015

216 views 2 download


RSCTC 2008 © 2008 ZL Technologies, Inc.

Email ArchivingArvind SrinivasanGaurav Baone

RSCTC 2008 15181

Imagine this is what happens

to your business records

at the end of every month ….

RSCTC 2008 15181

If this looks absurd …

That’s exactly what we do to email!

Regulators now treat email like hard copy records

Practically every major transaction, project, and contract, is recorded in email

SEC 17a-4

NASD 3010, 3110


FDA 21 CFR 11

DoD 5015.2


Non-compliance fines and legal liabilities are rising . . .

ZipLip, Inc.

And the courts agree (FRCP, Dec 2006)

RSCTC 2008 15181

Just How Much Scalability Does Archiving Require?

7 Years Retention

4.47 Billion Emails For Archive System To Index & Search

4.28 Billion Web-Pages Indexed by

source: Google Press Release, Feb 17, 2004


25,000 Employees averaging 70 mails/day


Functionality needs to scale to these volumes

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008


Email Capture Methods

Business Drivers

Archive Functionality

Retention & Deletion

Surveillance & Compliance

E Discovery


ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008

Email Capture Methods

Active Capture Methods – PRO-ACTIVE Archiving– Journaling

– Mailbox crawling

– SMTP Gateway Capture

Historical Capture Methods – REACTIVE Archiving– Restore from backup tapes

– Crawl for PST / NSF files from desktops

– Forensic captures

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008

Journaling – 100% Capture

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008

Mailbox Crawling – Policy Based

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008

Reactive Archiving

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008

Not Just Email

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008

Primary Business Drivers - Regulations and Laws

Investment Advisors Act


Gramm-Leach-Bliley Act NASD 3010

NASD 3011


SEC 17a-4

Sarbanes-Oxley Act

CA SB1386

Mutual Funds Rule 38a-1

Hedge Funds Rule 203(b)

UK Freedom of Information Act

US Freedom of Information Act

Japan Personal Information Protection Act

Florida Sunshine Law

Basel II


ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008

Functional Requirements


Surveillance and Compliance

e Discovery

Common Theme - Classification

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008

Real-time Categorization of Mail


Content (Subject, body, attachment)

User Input (Which folder it was found, Manual Tagging)

Retention & Deletion

Conflicting Requirements:

Laws & Regulation => Retain for “x” years.


Company Liability/Risk and Cost

Retention Periods and Policies

Regulation Type of Record

Retention Period

Age Discrimination in Employment


Hiring Documents

One year from date of decision

Fair Labor Standards

Payroll ,sales and Personal


Three Years

Rehabilitation Act

Handicap discrimination


Three Years

Civil Rights Act Records One Year

Occupational Safety and Health Act

Health Records

30 Years

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008

Retention & Deletion (cont’d)

"a priori" and "a posteriori“ based Retention.

Event Driven – Deletion of mail from user folder, Reclassification of mail by end user

Legal Hold – Court Orders to retain evidence relating to certain subject matters.

Single Instance Storage

Same Email in Multiple Mailboxes

Same Attachment in Multiple Emails

Significant storage savings.

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008


Conflicting Requirements:

Regulation require review of documents


Effort spent into reviewing the documents.

Examples of Compliance Categories

Category Content Action

Adult Offensive language


Confidential SSN Numbers, Bank Account


Pre-Review to prevent

confidential information from going


Legal Issues Words like attorney, charge*.

Phrases like breach* and

agreement within 6 words

Post/Pre Review

Compliance Hype

Stocks and sell between 3 words

of each other

Pre-Review in Financial Industries

Real-time Flagging of Mail

Lexical Based – Key words, word associations, wild-cards

Policy Based – Eg. Mail from is newsletter.

Custom Code – Detect Vacation Response, Read Receipts, DSN’s

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008


Real-time Flagging is a categorization problem

Current Systems suffer from lot of false positive.

Transparent and Deterministic rules preferred over Blackboxes.

Disclaimers (Internal and External) tend to get flagged as it contains the very terms that we try to flag.

Use Reviewer feedback to adapt the rules.

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008


Conflicting Requirements:

Produce electronic docs. to satisfy court-orders


Providing insufficient, not relevant, privileged Information

Search Type Court-dictated Required Search

Full text "acidosis"

Boolean "cardiac" OR "respiratory"

Phrase "in-custody death"

Proximity "pre-existing" within 10 words of "condition"

Wildcard "epilep*"

Wildcard proximity

"mental*" within 5 words of "condition"

Dual wildcard proximity

"continu*" within 10 words of "discharg*"

Wildcard sentence-level

"caus*" within same sentence as "death"

┼ Source: Williams v. Taser Int’l, Inc., 2007 WL 1630875 (N.D. Ga. June 4, 2007)

Discovery Request

Certain number of custodians

Date Range

Pertaining to certain subject matter; usually described by a set of Search terms.

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008


Landmark case Zubulake vs. UBS Warburg (2003)

Primarily driven by Federal Rules of Civil Procedure (FRCP) established in 2006.

Litigants are entitled to obtain electronic information from the adverse party.

Voluntary Initial Disclosures need to be made pertaining to each litigant

Today, almost all cases have some sort of electronic documents as evidence.

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008


Parties face Sanctions if they do not provide all the relevant documents. (Numerous precedence, eg. Metrokane vs Built NY 2008). Validation occurs when receiving party can prove existence of other document through hard-copy printout or other means.

Lawyers from both parties routinely negotiate keywords to define Search Concepts

Manual Review of Documents for Relevance and Privilege. Numerous product cluster similar documents (near deduplication) to present similar documents to reviewers to improve efficiency.

Chain of Custody – To prove that the document has not be tampered or altered.

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008

Palin’s e-mail at $15m per request

NBC's price quote for e-mails sent to Todd Palin: $15 million.

AP's price quote for e-mails between state employees and the campaign headquarters of Sen. John McCain: $15 million.

AP's price quote for e-mails between state employees and the National Park Service: $15 million.

Cost to retrieve e-mail for 1 mailbox

6 Hours to assemble email for 1 employee mailbox

2 Hours for “security” checks

5 Hours to filter by requested keyword or topic

13 Total hours per mailbox

$73.87 Hourly rate

$960.31 Cost to retrieve e-mail for1 mailboxCost to retrieve e-mail for all


$960.31 Cost to retrieve email for 1 mailbox

16,000 Full-time employees

$15.3 million

Cost to retrieve e-mail for all employees

ZL Technologies, Inc.


ZLTI Unified Archival

RSCTC 2008

Conclusion Most challenges in archiving can be reduced to Classification problem.

Segmentation Problems: Detect internal and external disclaimers

Detect change in Email behavior through email profile analysis

Understanding mails: Need to develop Analysis techniques to understand the contents

Visualization and Grouping Similar mails – Control the order in which mails and documents are viewed.

Consistent way of defining Subject Matters – Beyond just a set of keywords.

Extract more meta data about attachments such as images, audio and video files.

And all the above are required in muliple languages – English, Japanese, Spanish, Chinese, and others.