Post on 16-Dec-2015
transcript
Relational Data Pre-Processing Techniques for Improved
Securities Fraud Detection
Andrew Fast, Lisa Friedland, Marc Maier,
Brian Taylor, and David JensenKnowledge Discovery LaboratoryDepartment of Computer Science
University of Massachusetts Amherst
Henry G. Goldberg and John KomoroskeNational Association of Securities DealersFinancial Industry Regulatory Authority
`Reps are required to file disclosures for incidents ranging in severity from customer complaints to criminal charges.
Individual characteristics are not distinctive,…
…it is the collection of related entities that sets this rep apart.
AgeYears in Industry
…
Goals
Overall GoalIdentify new methods and improve existing methods for detecting securities fraud that consider the relationships among reps, branches, and firms.
Goals of This Talk
1) Describe the pre-processing steps needed to prepare the data for knowledge discovery.
2) Demonstrate that pre-processing is both necessary and beneficial for knowledge discovery in relational domains.
Data Pre-ProcessingChallenge 1: Inputs
The Knowledge Discovery Process
Challenge 2: Class Labels
Challenge 1: Preparing Inputs
Small-scale social structure between reps is important.
Branch Associations
Tribes
NOT IN RAW DATA
(Cortes, Pregibon, and Volinsky 2001)(Neville et al. 2005)
?
Identifying Branch Associations
110 Wall Street , NEW YORK, NY, 10005
311 S. WACKER DRIVE , CHICAGO, IL,
60606
1400 World Trade Center, St. Paul, MN, 55101
30 East 7th Street Suite 1400, St. Paul, MN, 55101
110 Wall Street , NY, NY, 10005
110 Wall Street , MANHATTAN, NY, 10005
110 Wall Street , 22ND FLOOR, NEW YORK, NY, 10005
30 East 7th Street Suite 1400, St. Paul, MN, 55101
311 S. WACKER DRIVE , CHICARGO, OH,
60606
1400 World Trade Center, St. Paul, MN, 55101
Self-Reported Addresses are messy!
Use String Matching Algorithm
Unmatched 30% (~1.43 million)
Inferred Branches
Matched 70%(~3.35 million)
BA
Identifying Tribes
Why did they move?
1. Looking for better location to commit fraud.
2. Friend inviting friends to better jobs.
3. Geographic Limitations.
4. Branches merged or acquired.
Anomalous Movement
Background Movement
Identifying Tribes
Tribe - |trīb| noun
A group of reps that works together at many statistically unlikely branches.
• Reps in tribes found by our algorithm: – move between branches
in more zip-codes than the rest of the population.
– have more disclosures than the rest of the population.
For more details on how to find tribes please see talk by Lisa Friedland:
Finding Tribes: Identifying Close-Knit Individuals from Employment Patterns
R11: Pattern Discovery (II) , Tuesday 10:30 am ~ 11:50 am, Regency 2
The Knowledge Discovery Process
Challenge 2: Class Label
Whether a rep is planning to commit fraud is unknown …
… instead we use a surrogate class label called a risk score based on disclosure history.
With NASD guidance, we assigned a weight to each disclosure type based on its severity.
+ = risk score
For branches, compute average risk score over reps.
+
+ + += risk score
For reps, sum over disclosures to determine risk score for a given year.
3
Risk Scores to Class Labels
Simple Approach1) Sort by risk score2) Choose Top N
• But Risk scores are not distributed evenly.
• Scores vary by:– Geography – Year– Firm Demographics
• Want high-risk, but relative to similar branches or reps.
Creating a Normalized Class Label
0 1 2 4 5 6 7 8 93
Small Branch, Small Firm
Small Branch, Large Firm
Medium Branch, Small Firm
Medium Branch, Large Firm
Large Branch, Large Firm
2005
Stratify by year, zip-code, and firm demographicsFor each bin, choose top 5% of all scores, as long as also above median of non-zero scores
The Knowledge Discovery Process
kdl.cs.umass.edu/proximity
Learning Models of Risk
• Two approaches for learning models:– Model highest-risk entities from each bin with a
single model– Model each bin independently (stratified).
• Two possible class labels– Normalized Class Label– Global Class Label (no normalization).
Evaluation
Random = 0.5
Non-Normalized = 0.72
Normalized = 0.77
Stratified Norm = 0.62
Stratified Non-Norm = 0.62
Evaluation
Random = 0.5
Non-Normalized = 0.79
Normalized = 0.80
Stratified Norm = 0.69
Stratified Non-Norm = 0.70
Relational Probability Tree
(Neville et al 2003)
Challenge 1: Scorecard
• Enabled all other analyses.• Branch Features appeared many times in learned trees.
Branch Associations
• Useful as a local pattern.
• Not wide-spread enough to be useful as a global pattern despite desirable characteristics.
Tribes
REQUIRED
Normalized Class Label
Stratified Models
Challenge 2: Scorecard
• Allows Improved Branch models.• No improvement in Rep models.
• Stratified Models do not improve performance.
• No aggregation is required for reps.• Normalization has no effect.
• Risk score on branches is aggregate over reps.• Normalization accounts for discrepancies in sizes.
• Small number of positive instances in each bin leads to lack of generalization.
Summary
• Finding small-scale social structure critical for relational domains.– Branch matching essential, all analyses required branches.– Tribes - interesting local pattern not widely spread for global
modeling.• Continuing to explore other techniques to capitalize on dynamic
structure of complex domains.
• Under the right circumstances, normalization can be useful when creating a class label for relational domains.– Helpful when aggregating over lower level entities.
Andrew Fast
afast@cs.umass.edukdl.cs.umass.edu/~afast
kdl.cs.umass.edu/proximity
Poster #35 tonight.
Questions?
Consolidation and Link Formation
Consolidation - identify entities of interest
Link Formation - construct structured relations between consolidated entities.
(Goldberg and Senator 1995)
– Raw data don’t always explicitly contain entities and structure of interest.
– These are critical for knowledge discovery in relational data.
– In the securities industry, social structure among reps useful for detecting misconduct.
(Neville et al. 2005)
For more details on tribes please see talk by Lisa Friedland:
Finding Tribes: Identifying Close-Knit Individuals from Employment Patterns
R11: Pattern Discovery (II) , Tuesday 10:30 am ~ 11:50 am, Regency 2
How did we do?– Reps in tribes move between branches in more zip-codes than the general population of reps– Reps in tribes are almost 8 times more likely to be at high-risk of fraud.
Identifying Tribes
• Rather than limiting ourselves to static relations, we can consider groups of reps who coordinate their movement from branch to branch.
• Not all group movement is coordinated, consider:– Two firms in a small town. (Geography)– Branch is sold to another firm. (Acquisitions and Mergers)– Branch is closed. (Branch Closings)
• Not all coordinated movement is nefarious.– Friends inviting friends to better jobs.
Tribe - |trīb| noun
group of reps that coordinate their movement between statistically unlikely branches.
Challenge 2: Scorecard
• Normalized Class Label – Allows better models for branches, not for reps.
• Risk score on branches is aggregate over reps.– Normalization accounts for discrepancies in sizes.
• No aggregation for reps is needed.– A high-scoring rep is high-risk no matter which bin they
belong to.
• Stratified Models– Stratified models perform worse than combined
models.• Number of positives instances per bin is small, does not
generalize well.
Challenge 2: Class Label
• “Will commit fraud in future” flag is not given.– Create a surrogate class label or risk score from
collection of disclosures on reps.
• Risk is not uniformly distributed across the data.– Market conditions vary over time. (Temporal)– Laws vary from state to state. (Geographic)– A small firm may have different market pressures
than a large firm (Demographic)
Weighted Risk Score
• Assign each disclosure type a score based on its severity. – Customer complaints– Bankruptcy– Regulatory Action
• Sum over disclosure types for each rep for each year.• Average over reps for branches.• Who is high-risk?
– Order entire population by risk score, choose top n. – But not uniformly distributed.
Check out your broker:http://brokercheck.finra.org/
http://www.nasdbrokercheck.com
Normalizing Risk
• Stratify data into bins with uniform risk score.– Consider each year independently.– Segment USA by zip-code.– Create 5 categories of branches based on branch and firm
size.• Small branch, Small firm.• Small branch, Large Firm.• Medium branch, Small Firm.• Medium branch, Large firm.• Large branch, large firm.
• For each year 5 branch types X 10 zip-code regions = 50 bins.
• Who is high-risk?– Order entire population by risk score, choose top. each of the bins