Date post: | 07-Nov-2014 |
Category: |
Technology |
Upload: | wil-van-der-aalst |
View: | 262 times |
Download: | 1 times |
On the Role of Fitness, Precision, Generalization and Simplicity in Process Discovery
Joos Buijs
Boudewijn van Dongen
Wil van der Aalst
http://www.win.tue.nl/coselog/
Advances in Process Mining
• Many process discovery and conformance checking algorithms and tools are available (cf. the various ProM packages).
• Also commercial software based on these ideas: Disco (Fluxicon), Reflect (Futura/Perceptive), BPMOne (Pallas Athena/Perceptive), ARIS Process Performance Manager (Software AG), Interstage Automated Process Discovery (Fujitsu), QPR ProcessAnalyzer/Analysis (QPR Software), flow (fourspark), Discovery Analyst (StereoLOGIC), etc.
• We applied process mining in over 100 organizations.
PAGE 2
More than 75 people involving more than 50 organizations created the Process Mining Manifesto in the context of the IEEE Task Force on Process Mining.
Available in 13 languages
Example Process Discovery(Vestia, Dutch housing agency, 208 cases, 5987 events)
PAGE 3
Example: Conformance Checking (WOZ objections Dutch municipality, 745 objections, 9583 event, f= 0.988)
PAGE 4
Challenge: Four Competing Quality Criteria
PAGE 5
process discovery
replay fitness
precisiongeneralization
simplicity
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
Example: one log four models
PAGE 6
astart register
request
bexamine thoroughly
cexamine casually
d checkticket
decide
pay compensation
reject request
reinitiate requeste
g
hf
end
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N3 : fitness = +, precision = -, generalization = +, simplicity = +
N2 : fitness = -, precision = +, generalization = -, simplicity = +
astart register
request
bexamine
thoroughly
cexamine casually
dcheck ticket
decide
pay compensation
reject request
reinitiate request
e
g
h
f
end
N1 : fitness = +, precision = +, generalization = +, simplicity = +
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N4 : fitness = +, precision = +, generalization = -, simplicity = -
aregister request
dexamine casually
ccheckticket
decide reject request
e h
a cexamine casually
dcheckticket
decide
e g
a dexamine casually
ccheckticket
decide
e g
register request
register request
pay compensation
pay compensation
aregister request
b dcheckticket
decide reject request
e h
aregister request
d bcheckticket
decide reject request
e h
a b dcheckticket
decide
e gregister request
pay compensation
examine thoroughly
examine thoroughly
examine thoroughly
… (all 21 variants seen in the log)
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
process discovery
replay fitness
precisiongeneralization
simplicity
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
Model N1
PAGE 7
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
astart register
request
bexamine
thoroughly
cexamine casually
dcheck ticket
decide
pay compensation
reject request
reinitiate request
e
g
h
f
end
N1 : fitness = +, precision = +, generalization = +, simplicity = +
Model N2
PAGE 8
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N2 : fitness = -, precision = +, generalization = -, simplicity = +
Model N3
PAGE 9
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
astart register
request
bexamine thoroughly
cexamine casually
d checkticket
decide
pay compensation
reject request
reinitiate requeste
g
hf
end
N3 : fitness = +, precision = -, generalization = +, simplicity = +
Model N4
PAGE 10
acdeh
abdeg
adceh
abdeh
acdeg
adceg
adbeh
acdefdbeh
adbeg
acdefbdeh
acdefbdeg
acdefdbeg
adcefcdeh
adcefdbeh
adcefbdeg
acdefbdefdbeg
adcefdbeg
adcefbdefbdeg
adcefdbefbdeh
adbefbdefdbeg
adcefdbefcdefdbeg
455
191
177
144
111
82
56
47
38
33
14
11
9
8
5
3
2
2
1
1
1
# trace
1391
astart register
request
cexamine casually
dcheckticket
decide reject request
e hend
N4 : fitness = +, precision = +, generalization = -, simplicity = -
aregister request
dexamine casually
ccheckticket
decide reject request
e h
a cexamine casually
dcheckticket
decide
e g
a dexamine casually
ccheckticket
decide
e g
register request
register request
pay compensation
pay compensation
aregister request
b dcheckticket
decide reject request
e h
aregister request
d bcheckticket
decide reject request
e h
a b dcheckticket
decide
e gregister request
pay compensation
examine thoroughly
examine thoroughly
examine thoroughly
… (all 21 variants seen in the log)
Another challenge: Huge search space
PAGE 11
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
M
M
M
M
M
M
MM MMM
MM
M
M
M
M M MM
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
M
M
M
M
M
M
MM MMM
MM
M
M
M
M M MM
M
MMM M
M
MM
M
M
M
M
M
M
MM
M
M
MMM M
M
MM
M
M
M
M
M
MM
MM
M
MMM
MM M
MM M
M
MM
M
M
M
M
M
MM
MM
M
MM
MMM M
M
MM
M
M
M
M
M
MM
MM
M
M
M
M
MM
M
M
M
M
M
M
MM
M
MM
M
M
M
M
M
M
MM
MM
M
M M
MM
M
M
M
M
M
M
M
MM
M
M
M
M
MM
M
M MM M M M
… with just a few interesting candidates
PAGE 12
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
MM
M
MM
MM
M
M
MM
M
M
M
M
M
M
MM MMM
MM
M
M
M
M M MM
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM M
M
MMM
M
M
M
M
M
M
MM
MM
M
M
MM
MMM MM
M
MM
M
M
M
M
M
M
MM
MM
M
M
MM
M
M
M
M
M
M
MMM
MM MM
M
M
M
M M MM
M
MMM M
M
MM
M
M
M
M
M
M
MM
M
M
MMM M
M
MM
M
M
M
M
M
MM
MM
M
MMM
MM M
MM M
M
MM
M
M
M
M
M
MM
MM
M
MM
MMM M
M
MM
M
M
M
M
M
MM
MM
M
M
M
M
MM
M
M
M
M
M
M
MM
M
MM
M
M
M
M
M
M
MM
MM
M
M M
MM
M
M
M
M
M
M
M
MM
M
M
M
M
MM
M
M MM M M M
Two requirements
1. It should be possible to seamlessly balance the different quality criteria based on user-defined preferences.
2. The algorithm should always return a "correct" process model and not waste time on model having deadlocks and other anomalies.
PAGE 13
process discovery
replay fitness
precisiongeneralization
simplicity
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
Proposal: Evolutionary Tree Miner (ETM)
• Process trees as representation (= limit search space to "good" models).
• Genetic approach (= very flexible)• Fitness function uses all four criteria (= seamlessly
balance the different "forces")
PAGE 14
Representational Bias: Process Trees
• Always sound because of the block structure
• Also Loop and OR operator
A (BA)*
PAGE 15
A
DC
E
→
X
∧
→
B
A
Petri Net Semantics(used for comparison and conformance checking only)
PAGE 16
A
B
B
A
A
B
A
B
Sequence
Exclusive Choice
Loop
Parallellism
Or Choice
A B
Steps of the Genetic ETM Algorithm
PAGE 17
CreateInitial
Population
ReturnBest
Individual
MeasureQuality
ChangePopulation
Stop?
YesYes
NoNo
Population Change
PAGE 18
Population i Population i+1
Elite
Crossover Mutation
Selection
Replace
Four Metrics (see paper)
PAGE 19
process discovery
replay fitness
precisiongeneralization
simplicity
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
1 = optimal0 = very bad
Example
PAGE 20
B
C
D
E
F
A G
A = send e-mail, B = check credit, C = calculate capacity, D = check system, E = accept, F = reject, G = send e-mail
unknown
Conventional Algorithms (1/3)("best effort" mapping to process trees to allow for comparison)
PAGE 21
alpha miner
ILP miner
language-based region miner
sound
sound
sound
low fitness
low precision
low fitness
… lucky
Conventional Algorithms (2/3)
PAGE 22
heuristic miner
multi-phase miner
unsound
unsound
(relaxed
sound)
Conventional Algorithms (3/3)
PAGE 23
state-based region miner
genetic miner
unsound
sound
Often unsound result and no mechanism to seamlessly balance the four criteria
PAGE 24
unsound models
ignored criteria
Genetic Mining (ETM) While Considering Only One Criterion
PAGE 25
best value possible
for this log
ETM with weight zero to three out of four perspectives.
Considering Replay Fitness and One Other Criterion
PAGE 26ETM with weight zero to two out of four perspectives.
Considering 3 of 4 Criteria
PAGE 27replay fitness needs to have a larger weight
Considering All Four Criteria with Emphasis on Fitness
PAGE 28
fitness has weight 10
Initial Model Versus Discovered Model
PAGE 29
B
C
D
E
F
A G
discovered by ETMsimulated
Discovered model
outperforms initia
l
model with respect
too all crite
ria!
Better than
existing algorithms
(but patience is
needed)!
Real-Life Event Logs
• Event log L0 is the event log used before. L0 contains 100 traces, 590 events and 7 activities.
• Event Log L1 contains 105 traces, 743 events in total, with 6 different activities.
• Event Log L2 contains 444 traces, 3.269 events in total, with 6 different activities.
• Event Log L3 contains 274 traces, 1:582 events in total, with 6 different activities.
PAGE 30
Event logs L1, L2 and L3 are extracted from the information systems of municipalities participating in the CoSeLoG project (http://www.win.tue.nl/coselog/).
Results
PAGE 31
Equal weights for all criteria.
If unsound , the sound behavior is approximated when creating the process tree.
Conclusion
• First algorithm that allows for balancing all four perspectives.
• Genetic algorithm is very flexible, but also very slow.
• Process trees only used internally (choose your favorite representation)
• Future work:− Improve speed
− Distribute PM tasks
− Discover configurable process trees
PAGE 35
process discovery
replay fitness
precisiongeneralization
simplicity
“able to replay event log” “Occam’s razor”
“not overfitting the log” “not underfitting the log”
PAGE 36
www.processmining.org
www.win.tue.nl/ieeetfpm/