Date post: | 01-Dec-2014 |
Category: |
Technology |
Upload: | dirk-fahland |
View: | 1,019 times |
Download: | 0 times |
Process Mining for ERP Systems
Erik Nooijen,
Boudewijn v. Dongen, Dirk Fahland
PAGE 1
Process Discovery
event
log
process
discovery
algorithm
process
model
c1: A B C D E
c2: A C B D E
c3: A F D E
…
assumptions
• case = sequence of events of this case
• cases are isolated:
event A in c1 happens only in c1 (and not in c2)
• cases of the same process
• one unique case id,
• each event associated to exactly one case id
PAGE 2
Typical Process in an ERP System
Build to Order
Material A
Material B order
product X Alice
order
product Y
Material B
Material C
Bob
Material B
Material B
Material A
Material C
ACME Inc.
Mega Corp.
Manufacturer
order
materials
order
materials
PAGE 3
n-to-m relations database
poID cust. … created processed built shipped
po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
ProductOrder
poID moID type added
po1 mo3 B 30-08 13:13
po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16
OrderedMaterial
moID suppl. … completed sent received
mo3 ACME 30-08 13:15 30-08 14:15 01-09 9:05
mo4 MEGA 30-08 13:17 30-08 16:12 01-09 10:13
MaterialOrder
cust. address …
Alice … …
Bob … …
Customer id attributes time-stamp attributes
relations
id attributes relations data attributes
process
discovery
algorithm
process
model
MaterialOrder
- moID
- supplier
- completed
- sent
- received
OrderedMat.
- poID
- moID
- type
- added
Customer
- cust
- …
ProductOrder
- poID
- cust
- created
- processed
- built
- shipped
PAGE 4
Process Discovery for ERP Systems
process
discovery
algorithm
process
model
reality: data in a relational DB
• events stored as time-stamped
attributes in tables
• multiple primary keys
multiple notions of case
• tables are related
one event related to
multiple cases
1
0..*
1
1..* 1
1..*
PAGE 5
Process Discovery for ERP Systems
process
discovery
algorithm
process
model
reality: data in a relational DB
• events stored as time-stamped
attributes in tables
• multiple primary keys
multiple notions of case
• tables are related
one event related to
multiple cases
MaterialOrder
- moID
- supplier
- completed
- sent
- received
OrderedMat.
- poID
- moID
- type
- added
Customer
- cust
- …
ProductOrder
- poID
- cust
- created
- processed
- built
- shipped
1
0..*
1
1..* 1
1..*
PAGE 6
Outline
process
model
decompose by primary keys
log f.
PO
log f.
MO discovery
model f.
PO
discovery
model f.
MO
related by
primary foreign-key
relations
PAGE 7
Find Artifact Schemas
process
model
decompose by primary keys
log f.
PO
log f.
MO discovery
model f.
PO
discovery
model f.
MO
related by
primary foreign-key
relations
document schema vs. actual schema identify
• column types (esp. time-stamped columns)
• primary keys
• foreign keys
various (non-trivial) techniques available
key discovery is NP-complete in the size of the
table(s)
result:
PAGE 8
Step 0: discover database schema
= schema summarization
PAGE 9
Step 1: decompose schema into processes
ProductOrder MaterialOrder
1. sets of
corresponding
tables
2. links between
those
find:
Automatic Schema Summarization
= group similar tables
through clustering
define a distance between
any 2 tables
• by relations
• by information content
tables that are close to
each other
same cluster
# of clusters: user input
PAGE 10
Automatic Schema Summarization
1. structural distance
between tables
fanout ~ avg. # of child
records related to the
same parent record
PAGE 11
A
1
2
A B
1 X
2 Y
A B
1 X
1 Y
2 Z
2 U
A B
1 X
1 Y
fanout: 1
fanout: 2
fanout: 1 = (2+0)/2
Automatic Schema Summarization
1. structural distance
between tables
fanout ~ avg. # of child
records related to the
same parent record
matched fraction ~
1 / (fraction of records in
parent with matching child
record)
PAGE 12
A
1
2
A B
1 X
2 Y
A B
1 X
1 Y
2 Z
2 U
A B
1 X
1 Y
fanout: 1
fanout: 2
fanout: 1
m.fr: 1
m.fr: 1
m.fr: 2 = 1/ (1/2)
Grouping by Clustering
1. structural distance
2. information distance
importance of each table
= entropy (is maximal if all
records are different)
distance: 2 tables with high
entropies large distance
3. weighted distance by
structure + information
4. k-means clustering:
k clusters based on
weighted distance
PAGE 13
most important table of cluster
= table with least distance to all
key attribute of the cluster
PAGE 14
Artifact Schema Artifact Log
process
model
decompose by primary keys
log f.
PO
log f.
MO discovery
model f.
PO
discovery
model f.
MO
related by
primary foreign-key
relations
poID moID type added
po1 mo3 B 30-08 13:13
po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16
poID cust. … created processed built shipped
po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
PAGE 15
Log Extraction
log f.
PO
cluster = set of related tables
+ primary key of most important table
case id
po1:
po2:
(created, poID=po1, time=30-08 9:22, …)
poID moID type added
po1 mo3 B 30-08 13:13
po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16
poID cust. … created processed built shipped
po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
PAGE 16
Log Extraction
log f.
PO
cluster = set of related tables
+ primary key of most important table
time-stamped attribute event
case id
po1:
(created, poID=po1, time=30-08 9:22, cust.=Alice, …)
poID moID type added
po1 mo3 B 30-08 13:13
po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16
poID cust. … created processed built shipped
po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
PAGE 17
Log Extraction
log f.
PO
cluster = set of related tables
+ primary key of most important table
time-stamped attribute event
case id
related attributes event attributes
po1:
(created, poID=po1, time=30-08 9:22, cust.=Alice, …)
poID moID type added
po1 mo3 B 30-08 13:13
po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16
poID cust. … created processed built shipped
po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
PAGE 18
Log Extraction
log f.
PO
cluster = set of related tables
+ primary key of most important table
time-stamped attribute event
case id
related attributes event attributes
po1:
(processed, poID=po1, time=30-08 13:12, …)
(created, poID=po1, time=30-08 9:22, cust.=Alice, …)
poID moID type added
po1 mo3 B 30-08 13:13
po1 mo4 A 30-08 13:14
po2 mo3 B 30-08 13:15
po2 mo4 C 30-08 13:16
poID cust. … created processed built shipped
po1 Alice 30-08 9:22 30-08 13:12 01-09 15:12 03-09 10:15
po2 Bob 30-08 10:15 30-08 13:14 01-09 16:13 03-09 17:18
PAGE 19
Log Extraction
log f.
PO
cluster = set of related tables
+ primary key of most important table
time-stamped attribute event
case id
related attributes event attributes
po1:
(processed, poID=po1, time=30-08 13:12, …)
(added, poID=po1, time=30-08 13:13, moID=mo3, …)
refers to artifact “MaterialOrder”
PAGE 20
Outline
process
model
decompose by primary keys
log f.
quote
log f.
order discovery
model f.
order
discovery
model f.
quote
compose by
primary foreign-key
relations
PAGE 21
Resulting Model(s)
create
processed
added
built
shipped
added
completed
sent
received
1..*
1..*
Product Order Material Order
(addded, poID=po1, …, moID=mo3)
prototype tool
• input: relational database (via JDBC), .csv tables
• steps
− discover database schema (types, keys, relations)
− discover artifact schema
− by k-means clustering
− by user picking tables
− extract logs ProM
PAGE 22
Implementation & Evaluation
> 300 tables, > 40 GiB of data
schema extraction
clustering
log extraction
PAGE 23
Evaluation: SAP System of Sligro
time-stamp attributes: 15 hrs
primary keys: 4 hrs
foreign keys: 5 hrs (single col)/
6 days (double col.)
entropies: 17 hrs
table distances: 5 hrs
clustering: a few seconds
~20 different artifacts found
largest: 47 tables, 869 columns
extract 1000 traces of > 246,000 events
query database: 1 hrs
write log file: 32 hrs
PAGE 24
Sligro: Artikel lifecycle model
performance
• key discovery: NP-complete in R (# of records)
• foreign key discovery: NP-complete in R2
• problem is in the “hard part” of NP
• sampling of data, domain knowledge, semi-automatic
requires good database structure
• proper relations, proper keys
• otherwise wrong clusters are formed
• events don’t get right attributes
• semi-automatic approach
events shared by multiple cases… working on it…
PAGE 25
Open issues
Process Mining for ERP Systems
Erik Nooijen,
Boudewijn v. Dongen, Dirk Fahland