Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa...

transcript

Relational Data Pre-Processing Techniques for Improved

Securities Fraud Detection

Andrew Fast, Lisa Friedland, Marc Maier,

Brian Taylor, and David JensenKnowledge Discovery LaboratoryDepartment of Computer Science

University of Massachusetts Amherst

Henry G. Goldberg and John KomoroskeNational Association of Securities DealersFinancial Industry Regulatory Authority

`Reps are required to file disclosures for incidents ranging in severity from customer complaints to criminal charges.

Individual characteristics are not distinctive,…

…it is the collection of related entities that sets this rep apart.

AgeYears in Industry

Overall GoalIdentify new methods and improve existing methods for detecting securities fraud that consider the relationships among reps, branches, and firms.

Goals of This Talk

1) Describe the pre-processing steps needed to prepare the data for knowledge discovery.

2) Demonstrate that pre-processing is both necessary and beneficial for knowledge discovery in relational domains.

Data Pre-ProcessingChallenge 1: Inputs

The Knowledge Discovery Process

Challenge 2: Class Labels

Challenge 1: Preparing Inputs

Small-scale social structure between reps is important.

Branch Associations

Tribes

NOT IN RAW DATA

(Cortes, Pregibon, and Volinsky 2001)(Neville et al. 2005)

Identifying Branch Associations

110 Wall Street , NEW YORK, NY, 10005

311 S. WACKER DRIVE , CHICAGO, IL,

1400 World Trade Center, St. Paul, MN, 55101

30 East 7th Street Suite 1400, St. Paul, MN, 55101

110 Wall Street , NY, NY, 10005

110 Wall Street , MANHATTAN, NY, 10005

110 Wall Street , 22ND FLOOR, NEW YORK, NY, 10005

30 East 7th Street Suite 1400, St. Paul, MN, 55101

311 S. WACKER DRIVE , CHICARGO, OH,

1400 World Trade Center, St. Paul, MN, 55101

Self-Reported Addresses are messy!

Use String Matching Algorithm

Unmatched 30% (~1.43 million)

Inferred Branches

Matched 70%(~3.35 million)

Identifying Tribes

Why did they move?

1. Looking for better location to commit fraud.

2. Friend inviting friends to better jobs.

3. Geographic Limitations.

4. Branches merged or acquired.

Anomalous Movement

Background Movement

Identifying Tribes

Tribe - |trīb| noun

A group of reps that works together at many statistically unlikely branches.

• Reps in tribes found by our algorithm: – move between branches

in more zip-codes than the rest of the population.

– have more disclosures than the rest of the population.

For more details on how to find tribes please see talk by Lisa Friedland:

Finding Tribes: Identifying Close-Knit Individuals from Employment Patterns

R11: Pattern Discovery (II) , Tuesday 10:30 am ~ 11:50 am, Regency 2

Challenge 2: Class Label

Whether a rep is planning to commit fraud is unknown …

… instead we use a surrogate class label called a risk score based on disclosure history.

With NASD guidance, we assigned a weight to each disclosure type based on its severity.

+ = risk score

For branches, compute average risk score over reps.

+ + += risk score

For reps, sum over disclosures to determine risk score for a given year.

Risk Scores to Class Labels

Simple Approach1) Sort by risk score2) Choose Top N

• But Risk scores are not distributed evenly.

• Scores vary by:– Geography – Year– Firm Demographics

• Want high-risk, but relative to similar branches or reps.

Creating a Normalized Class Label

0 1 2 4 5 6 7 8 93

Small Branch, Small Firm

Small Branch, Large Firm

Medium Branch, Small Firm

Medium Branch, Large Firm

Large Branch, Large Firm

Stratify by year, zip-code, and firm demographicsFor each bin, choose top 5% of all scores, as long as also above median of non-zero scores

kdl.cs.umass.edu/proximity

Learning Models of Risk

• Two approaches for learning models:– Model highest-risk entities from each bin with a

single model– Model each bin independently (stratified).

• Two possible class labels– Normalized Class Label– Global Class Label (no normalization).

Evaluation

Random = 0.5

Non-Normalized = 0.72

Normalized = 0.77

Stratified Norm = 0.62

Stratified Non-Norm = 0.62

Evaluation

Random = 0.5

Non-Normalized = 0.79

Normalized = 0.80

Stratified Norm = 0.69

Stratified Non-Norm = 0.70

Relational Probability Tree

(Neville et al 2003)

Challenge 1: Scorecard

• Enabled all other analyses.• Branch Features appeared many times in learned trees.

Branch Associations

• Useful as a local pattern.

• Not wide-spread enough to be useful as a global pattern despite desirable characteristics.

Tribes

REQUIRED

Normalized Class Label

Stratified Models

• Allows Improved Branch models.• No improvement in Rep models.

• Stratified Models do not improve performance.

• No aggregation is required for reps.• Normalization has no effect.

• Risk score on branches is aggregate over reps.• Normalization accounts for discrepancies in sizes.

• Small number of positive instances in each bin leads to lack of generalization.

Summary

• Finding small-scale social structure critical for relational domains.– Branch matching essential, all analyses required branches.– Tribes - interesting local pattern not widely spread for global

modeling.• Continuing to explore other techniques to capitalize on dynamic

structure of complex domains.

• Under the right circumstances, normalization can be useful when creating a class label for relational domains.– Helpful when aggregating over lower level entities.

Andrew Fast

afast@cs.umass.edukdl.cs.umass.edu/~afast

kdl.cs.umass.edu/proximity

Poster #35 tonight.

Questions?

Consolidation and Link Formation

Consolidation - identify entities of interest

Link Formation - construct structured relations between consolidated entities.

(Goldberg and Senator 1995)

– Raw data don’t always explicitly contain entities and structure of interest.

– These are critical for knowledge discovery in relational data.

– In the securities industry, social structure among reps useful for detecting misconduct.

(Neville et al. 2005)

For more details on tribes please see talk by Lisa Friedland:

Finding Tribes: Identifying Close-Knit Individuals from Employment Patterns

R11: Pattern Discovery (II) , Tuesday 10:30 am ~ 11:50 am, Regency 2

How did we do?– Reps in tribes move between branches in more zip-codes than the general population of reps– Reps in tribes are almost 8 times more likely to be at high-risk of fraud.

Identifying Tribes

• Rather than limiting ourselves to static relations, we can consider groups of reps who coordinate their movement from branch to branch.

• Not all group movement is coordinated, consider:– Two firms in a small town. (Geography)– Branch is sold to another firm. (Acquisitions and Mergers)– Branch is closed. (Branch Closings)

• Not all coordinated movement is nefarious.– Friends inviting friends to better jobs.

Tribe - |trīb| noun

group of reps that coordinate their movement between statistically unlikely branches.

• Normalized Class Label – Allows better models for branches, not for reps.

• Risk score on branches is aggregate over reps.– Normalization accounts for discrepancies in sizes.

• No aggregation for reps is needed.– A high-scoring rep is high-risk no matter which bin they

belong to.

• Stratified Models– Stratified models perform worse than combined

models.• Number of positives instances per bin is small, does not

generalize well.

Challenge 2: Class Label

• “Will commit fraud in future” flag is not given.– Create a surrogate class label or risk score from

collection of disclosures on reps.

• Risk is not uniformly distributed across the data.– Market conditions vary over time. (Temporal)– Laws vary from state to state. (Geographic)– A small firm may have different market pressures

than a large firm (Demographic)

Weighted Risk Score

• Assign each disclosure type a score based on its severity. – Customer complaints– Bankruptcy– Regulatory Action

• Sum over disclosure types for each rep for each year.• Average over reps for branches.• Who is high-risk?

– Order entire population by risk score, choose top n. – But not uniformly distributed.

Check out your broker:http://brokercheck.finra.org/

http://www.nasdbrokercheck.com

Normalizing Risk

• Stratify data into bins with uniform risk score.– Consider each year independently.– Segment USA by zip-code.– Create 5 categories of branches based on branch and firm

size.• Small branch, Small firm.• Small branch, Large Firm.• Medium branch, Small Firm.• Medium branch, Large firm.• Large branch, large firm.

• For each year 5 branch types X 10 zip-code regions = 50 bins.

• Who is high-risk?– Order entire population by risk score, choose top. each of the bins

Relational Data Pre-Processing Techniques for Improved Securities Fraud Detection Andrew Fast, Lisa...

Documents