Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Text Mining and Continuous Assurance

Kevin Moffitt


Continuous Assurance

• Allows for the automated and frequent review of business data

• Current focus is on the structured data

– General ledgers

– Financial statements

– XBRL

• However, we cannot ignore the information found in unstructured

data

– Textual data, for example narrative portion of financial disclosures

• Up to 85% of the data in financial disclosures is in the form of text


Text Mining

• Many methods for extracting data from text

• One popular method is to use dictionaries/word lists

• E.g. Dictionary to identify positive language in business

documents…

SATISFIES

PREEMINENT

REWARDED

BENEFITTING

SOLVING

COLLABORATIONS

BOOST

TREMENDOUS

GREATEST

PERFECTLY

DELIGHTING

COMPLIMENTING

EXCITING

REBOUNDED

CONCLUSIVE

ASSURE

INNOVATED

ENJOYING

CREATIVE

GREATLY


Drawbacks of Dictionary Method

• Single words

– Context Free

– Naïve


Lexical Bundles

• Frequent multi-word sequences in a given corpus (e.g. financial

reports, history journals, biology journals)

• More context in phrases than in individual words

• Criteria for identifying lexical bundles

– Sequences of words four words or longer

– Occurred in at least 15% unique documents

– Occurred at a rate of at least 20 times per million words

Example Lexical Bundles from Annual

Reports

the fair value of

be adversely affected by

as a percentage of

assets and liabilities and


Lexical Bundles

• Research objective - Use Lexical Bundles to discriminate between

Fraudulent and Non-fraudulent Financial Reports


Research Questions

• RQ1: What are the most frequently used lexical bundles in fraudulent and

non-fraudulent Management Discussion and Analysis section (MD&A) of

annual reports?

• RQ2: Which lexical bundles are used at a considerably different rate in

fraudulent and non-fraudulent MD&As?

• RQ3: Can lexical bundles be used to classify fraudulent and non-fraudulent

MD&As at a rate greater than chance?


Sample Selection

• Identified 101 fraudulent annual reports (10-Ks) from set of SEC investigations

• Analyzed the Management Discussion and Analysis (MD&A) section

of 10-K

– Gives investors view of company from management’s perspective

– contains some of the least structured language in the 10-K

– Most read part of 10-K


Sample Selection

Sample selection criteria for fraudulent 10-Ks

Companies identified as fraudulent by

searching through AAERs 141

Disqualified because fraud did not involve 10-

Ks (20)

Disqualified because 10-K was not available

from the EDGAR DB (10)

Disqualified because 10-K did not contain

management discussion section (10)

Final count of qualifying fraudulent 10-Ks used

in the sample 101


Sample Selection—Types of Fraud

Type of Fraud Companies

Overstatement of revenues 44

Combination of overstating revenue and

understating expenses

25

Disclosure issue 10

Overstatement of inventory 6

Other income increasing effects 6

Understatement of provisions for loan-

loss reserves

5

Other 5


Sample Selection – Non-Fraudulent sample

• 101 Matching Non-Fraudulent 10-Ks were identified


Results


Lexical Bundle Identification

• 560 Lexical Bundles were identified


Creative Accounting

Lexical Bundle

Fraud Bundles Per

Million Words

NonFraud Bundles

Per Million Words

%

difference

in process

research and

development

199 76 160%

goodwill and other

intangible assets 121 82 47%


Big Bath Charges

• Wholesale aggressive restructuring to improve

cost and expense structure for the future

– Disposition of long-lived assets

Lexical Bundle

Fraud Bundles Per

Million Words

NonFraud Bundles

Per Million Words

%

difference

disposition of long

lived assets and

49 21 139%


Fair Value Accounting

• Subjective method for assigning value to an asset

– Change value of assets

– Understate debt obligations

– Misrepresent foreign currency exchange adjustments

Lexical Bundle

Fraud Bundles

Per Million Words

NonFraud Bundles

Per Million Words

%

difference

the fair value of 257 171 50%

in foreign

currency

exchange

41 21 97%


Lexical Bundles used more Frequently in Non-Fraudulent

MD&As

• Conservative language for accounting practices

Lexical Bundle

Fraud Bundles Per

Million Words

NonFraud Bundles

Per Million Words

%

difference

to continue as a

going concern 15 91 513%

disclosures about

market risk 85 115 36%

material impact on

the 38 52 35%


Principal Component Analysis

• Variable reduction procedure

– Combines correlated variables into principal components

• Principal components

– First component accounts for maximum amount of total variance in the observed variables

– Components are uncorrelated

• Components are made up of correlated variables

– Overlapping lexical bundles are combined

Correlated bundles transformed into one principal component

4-word bundles 6-word component

there can be no

there can be no assurance

can be no assurance there can be no

assurance that can be no assurance that

be no assurance that


Principal Component Analysis

• 560 Lexical Bundles were reduced to 88 principal

components


Component 1

principles generally accepted in

accounting principles generally accepted

generally accepted in the

accepted in the united

with accounting principles generally

affect the reported amounts

reported amounts of assets

that affect the reported

to make estimates and

factors that could cause

actual results to differ

results to differ materially

of assets and liabilities

actual results may differ

to differ materially from

differ materially from those

forward looking statements this

in the united states

allowance for doubtful accounts

are expected to be

company believes that the


Component 1

with accounting principles generally accepted in the united states

that affect the reported amounts of assets and liabilities

are expected to be

company believes that the

to make estimates and

factors that could cause

forward looking statements this

allowance for doubtful

accounts

actual results to

actual results may differ materially from those

“GAAP and expected results”


Component 2

have a material adverse

material adverse effect on

a material adverse effect

adverse effect on the

business financial condition and

could have a material

effect on the company's

can be no assurance

be no assurance that

there can be no

assurance that the company

of one or more

the company will be

no assurance that the

of the company's products

that the company will

and will continue to


Component 2

could have a material adverse effect on the company's

there can be no assurance that the company will be

business financial condition

and

of one or more

of the company's products

and will continue to

“Could be bad”


Classification Results

• Discriminant Analysis

– 71% of cross-validated cases were correctly

classified

Discriminating factor (PC) Beta Discriminating factor (PC) Beta

Impact and exposure .464 Price and offsets .335

Material difference -.421 COGS and change

in accounting

principle

.330

Common stock and

adverse affects

.412 Fair market value .313

Going concerns .363 Exercise of stock

Options

.298

New product

introductions

.339 Number of Factors -.287


Confusion Matrix

Predicted Class

Fraudulent Non-Fraudulent

Actual Class

Fraudulent 70 31

Non-Fraudulent 28 73


Confusion Matrix Results

FNFPTNTP

TNTPAccuracy

TNFP

FPFPR

FNTP

TPTPR

FPTP

TPecision

Pr Precision = .714

True Positive Rate = .693

False Positive Rate = .277

Accuracy = .708

Predicted Class

Fraudulent Non-Fraudulent

Actual

Class

Fraudulent 70 (TP) 31 (FN)

Non-Fraudulent 28 (FP) 73 (TN)


Conclusion

• Lexical bundles have more contextual meaning than unigrams

– Results are easier to interpret

• Lexical bundles may be used to classify documents

• Lexical bundle analysis can be used in any type of textual dataset

• This process and other text mining processes can be integrated into

continuous assurance solutions

– Rapid identification of suspicious documents

Date post:	18-Aug-2015
Category:	Technology
Upload:	tecsi-fea-usp
View:	67 times
Download:	0 times

Text Mining and Continuous Assurance Kevin Moffitt - 12th CONTECSI 34th WCARS

Technology