Title: Desire Lines in Big Data
Name: Wil M.P. van der Aalst
Affil./Addr.: Eindhoven University of Technology
Department of Mathematics and Computer Science
PO Box 513, NL-5600 MB, Eindhoven, The Netherlands
E-mail: [email protected]
Desire Lines in Big Data
Synonyms
process mining, business process intelligence, distributed process mining, process dis-
covery
Glossary
Event log: multiset of traces.
Trace: sequence of events.
Event: occurrence of some discrete incident (e.g., completion of an activity).
Process mining: collection of techniques to discover, monitor and improve real pro-
cesses by extracting knowledge from event data.
Process discovery: extracting process models from an event log.
Conformance checking: monitoring deviations by comparing model and log.
Definition
Processes leave footprints in information systems just like people leave footprints in
grassy spaces. Desire lines, i.e., the tracks formed by erosion showing where people
2
really walk, may be very different from the formal pathways. When people deviate
from the official path there is often a good reason and room for improvement. The goal
of process mining is to extract desire lines from event logs, e.g., to automatically infer
a process model from raw events recorded by some information system.
Process mining techniques and tools should be able to deal with huge heteroge-
neous event logs. For example, the increasing ability to record events (cf. sensor data,
internet of things, remote monitoring, and service orientation) may make it infeasible
to store all events over an extended period. Therefore, on-the-fly discovery techniques
have been developed, i.e., techniques to learn process models without storing excessive
amounts of events. Moreover, techniques to distribute process mining techniques over
a network consisting of many computing nodes are being developed. The techniques
exploit modern computing infrastructures and make process mining scalable. This way
it is possible to discover desire lines in Big Data.
Introduction
Desire lines refer to tracks worn across grassy spaces – where people naturally walk
– regardless of formal pathways (see Figure 1). A desire line emerges through ero-
sion caused by footsteps of humans (or animals) and the width and degree of erosion
of the path indicates how frequently the path is used. Typically, the desire line fol-
lows the shortest or most convenient path between two points. Moreover, as the path
emerges more people are encouraged to use it, thus stimulating further erosion. Dwight
Eisenhower is often mentioned as one of the persons that noted this emerging group
behavior. Before becoming the 34th president of the United States, he was the pres-
ident of Columbia University. When he was asked how the university should arrange
the sidewalks to best interconnect the campus buildings, he suggested letting the grass
grow between buildings and delay the creation of sidewalks. After some time the de-
3
sire lines revealed themselves. The places where the grass was most worn by people’s
footsteps were turned into sidewalks.
normative or expected path
desire line
Fig. 1: Desire lines reveal the actual and not the assumed behavior of people, machines,
and organizations.
The term “desire line” has been used for decades in urban planning. A desire
line shows where people naturally walk. The width and degree of erosion of such an
informal path indicates how frequently the path is used. Often the desire line is very
different from the formal pathway. Therefore, some planners simply let erosion tell were
the paths need to be. For example, the paths across Central Park in New York were
reconstructed using this approach [24, 26].
Good information systems do not show signs of erosion. Nevertheless, they often
contain a wealth of event data providing clues about the paths followed by the users
of the system. Therefore, it is possible to determine desire lines in organizations, sys-
tems, and products. Besides visualizing such desire lines, we can also investigate how
these desire lines change over time, characterize the people following a particular de-
4
sire line, etc. There may also be desire lines that are “undesirable” (unsafe, inefficient,
unfair, etc.). Uncovering such phenomena is a prerequisite for process and product
improvement.
The potential value of desire lines in “big data” (say event logs containing mil-
lions of events) is enormous. The identification of such information can be used to
redesign procedures and systems (“reconstructing the formal pathways”), to recom-
mend people taking the right path (“adding signposts were needed”), or to build in
safeguards (“building fences to avoid dangerous situations”).
More and more information about (business) processes is recorded by informa-
tion systems in the form of so-called “event logs”. IT systems are becoming more and
more intertwined with these processes, resulting in an “explosion” of available data that
can be used for analysis purposes. Today’s information systems already log enormous
amounts of events. Classical workflow management systems (e.g. FileNet, TIBCO iPro-
cess Suite, Global 360), ERP systems (e.g. SAP, Oracle), case handling systems (e.g.
BPM|one), PDM systems (e.g. Windchill), CRM systems (e.g. Microsoft Dynamics
CRM, SalesForce), middleware (e.g., IBM’s WebSphere, Cordys), hospital information
systems (e.g., Chipsoft, Siemens Soarian), etc. provide very detailed information about
the activities that have been executed. Not just information systems record data; many
physical devices are connected to the Internet and objects (products and resources) are
tagged and monitored. Providers of high-tech systems (ASML, Philips Healthcare, etc.)
are recording terabytes of data on a daily basis. In fact, according to MGI, nearly all
sectors in the US economy have at least an average of 200 terabytes of stored data
per company (for companies with more than 1,000 employees) and many sectors have
more than 1 petabyte in mean stored data per company [21]. Until 2000 most data
was still stored in analog form (books, photos, etc.). Since 2000 data storage has grown
spectacularly, shifting markedly from analog to digital [18].
5
Data will continue to grow at a spectacular rate. Moreover, the digital universe
and the physical universe are becoming more and more aligned, e.g., money has become
a predominantly digital entity. When booking a flight over the Internet, the customer is
interacting with many organizations (airline, travel agency, bank, and various brokers),
often without actually realizing it. If the booking is successful, the customer receives
an e-ticket. Note that an e-ticket is basically a number, thus illustrating the tight
coupling between the digital and physical universe. When the SAP system of a large
manufacturer indicates that a particular product is out of stock, it is impossible to sell
or ship the product even when it is available in physical form. Technologies such as
RFID (Radio Frequency Identification), GPS (Global Positioning System), and sensor
networks will stimulate a further alignment of data and reality, e.g., RFID tags make
it possible to track and trace individual items. Hence, there will be more and more
high-quality data that can be used to reveal desire lines in any industry.
Since we are interested in analyzing processes based on the data recorded, we
focus on events that can be linked to relevant activities. The order of such events is
important for deriving the actual process. Fortunately, most events have a timestamp
or can be linked to a particular date. Hence, the event data needed for process mining
are omnipresent.
Consider for example Philips Healthcare, a provider of medical systems that are
often connected to the Internet to enable logging, maintenance, and remote diagnostics.
For example, more than 1500 Cardio Vascular (CV) systems (i.e., X-ray machines) are
monitored by Philips. On average each CV system produces 15,000 events per day,
resulting in 22.5 million events per day for just their CV systems. The events are
stored for about three years and have many attributes. The error logs of ASML’s
lithography systems have similar characteristics and also contain about 15,000 events
per machine per day. These numbers illustrate the fact that many organizations are
6
storing terabytes of event data. Earlier applications of process mining in organizations
such as Philips and ASML, show that there are various challenges with respect to
performance (response times), capacity (storage space), and interpretation (discovered
process models may be composed of thousands of activities).
Many organizations are using so-called Business Intelligence (BI) software, e.g.,
Business Objects (SAP), Cognos (IBM), Hyperion (Oracle), etc. Common functions
offered by these BI tools are reporting, online analytical processing, data mining, busi-
ness performance management, benchmarks, and predictive analysis. However, these
tools assume that the process is known and they typically look at data-related aspects
(e.g., correlations) or view the process at an aggregate level (e.g., a dashboard showing
the average response time). BI tools typically provide some form of data mining and
there are dedicated data mining tools such as Weka, SPSS Clementine, RapidMiner,
etc. Typical techniques supported are classification, clustering, association rules, etc.
However, these systems do not allow for the discovery of processes based on event
logs. In fact, an explicit process notion is missing. This led to the formation of a new
research domain: process mining.
Key Points
The spectacular growth of event data is providing opportunities and challenges for
process mining. Process discovery and conformance checking can be used to analyze and
improve operational business processes in any sector. However, as event logs are growing
in size it may be impossible to store, manage, and analyse event data using traditional
algorithms and tools. Moreover, process mining is increasingly used on online settings
where processes need to be analyzed on-the-fly. Process mining algorithms and tools
need to be adapted to this new reality.
7
case id event id properties
timestamp activity resource cost . . .
35654423 30-12-2011:11.02 A John 300 . . .
1 35654424 30-12-2011:11.06 B John 400 . . .
35654425 30-12-2011:11.12 C John 100 . . .
35654426 30-12-2011:11.18 D John 400 . . .
35655526 30-12-2011:16.10 A Ann 300 . . .
2 35655527 30-12-2011:16.14 C John 450 . . .
35655528 30-12-2011:16.26 B Pete 350 . . .
35655529 30-12-2011:16.36 D Ann 300 . . .
. . . . . . . . . . . . . . . . . . . . .
Table 1: A fragment of some event log: each line corresponds to an event.
Process Mining
In this section, we first introduce process mining using a small example. Then we
elaborate on ways to deal with huge event sets.
Process mining techniques attempt to extract non-trivial and useful information
from event logs [1, 19]. One aspect of process mining is control-flow discovery, i.e., au-
tomatically constructing a process model (e.g., a Petri net or BPMN model) describing
the causal dependencies between activities [7, 9, 29]. The basic idea of control-flow
discovery is very simple: given an event log containing a set of traces, automatically
construct a suitable process model “describing the behavior” seen in the log. Such dis-
covered processes have proven to be very useful for the understanding, redesign, and
continuous improvement of business processes [1].
To illustrate the notion of process discovery, consider Table 1. The table shows a
small fragment of some larger event log. Only two traces are shown, both containing 4
8
events. Each event has a unique id and several properties. For example, event 35654423
is an instance of activity A that occurred on December 30th at 11.02, was executed
by John, and costs 300 euros. The second trace starts with event 35655526 and also
refers to an instance of activity A. Note that each trace corresponds to a case, i.e., a
completed process instance.
1 〈A02, B06, C12, D18〉
2 〈A10, C14, B26, D36〉
3 〈A12, E22, D56〉
4 〈A15, B19, C22, D28〉
5 〈A18, B22, C26, D32〉
6 〈A19, E28, D59〉
7 〈A20, C25, B36, D44〉
Table 2: A simplified event log. Each line corresponds to a trace represented as a
sequence of activities with timestamps.
The information depicted in Table 1 is the typical event data that can be ex-
tracted from today’s information systems. To make the example more manageable, we
now focus on the activities and their timestamps only. Table 2 shows another view on
the same event log. Now each line corresponds to a process instance, e.g., the first trace
〈A02, B06, C12, D18〉 refers to a process instance where activity A was executed at time
2, activity B was executed at time 6, activity C was executed at time 12, and activity
D was executed at time 18. Note that the first two traces in Table 2 correspond to the
fragment shown in Table 1 (using simplified timestamps).
Using existing process mining techniques it is possible to extract a process model
from Table 2. For example, by applying the α algorithm [9] we obtain the process model
shown in Fig. 2. This simple Petri net model [25] describes the process that starts with
9
A
B
C
DE
start complete
p1
p2
p3
p4
Fig. 2: A process model discovered from Table 2 using the α algorithm.
A and ends with D. In-between A and D either E or B and C are executed (in any
order).
Clearly, process mining – in particular control-flow discovery – is related to
the classical work on inductive inference. However, there are also notable differences
because, unlike most of the classical work, process mining focuses on higher order
representations which explicitly model concurrency (e.g., Petri nets, UML ADs, EPCs,
BPMN, etc.) rather than lower level representations (e.g., Markov chains, finite state
machines, or regular expressions). Moreover, we do not assume negative examples (i.e.,
there are no events stating that an activity cannot happen) and deal with issues such
as incompleteness (i.e., if something did not happen, it may still be possible) and
exceptional behavior. See [1] for an overview of existing process discovery approaches.
Process mining is not limited to control-flow discovery [1]. First of all, besides
the control-flow perspective (“How?”), other perspectives such as the organizational
perspective (“Who?”) and the case/data perspective (“What?”) may be considered.
Second, process mining is not restricted to discovery. Typically three basic types of
process mining are considered: (a) discovery, (b) conformance, and (c) enhancement
[1]. In this article we will focus on process discovery, i.e., discovering a model from raw
events. Discovery serves as the starting point for the two other types of process mining.
The second type of process mining is conformance [27, 23]. Here, an existing process
model is compared with an event log of the same process. Conformance checking can be
used to check if reality, as recorded in the log, conforms to the model and vice versa. The
10
third type of process mining is enhancement [8]. Here, the idea is to extend or improve
an existing process model using information about the actual process recorded in some
event log. Whereas conformance checking measures the alignment between model and
reality, this third type of process mining aims at changing or extending the a-priori
model. For instance, by using timestamps in the event log one can extend the model
to show bottlenecks, service levels, throughput times, and frequencies.
For example, the event log in Table 2 shows timestamps. When replaying the
event log on the process model shown in Fig. 2, we can measure the time spent in the
places in-between the various activities. This can be used to identify bottlenecks and
predict the remaining flow time for running cases [1, 8].
THORAX 2R 386002(complete)
126
1E CONSULT 410100(complete)
165
0,78644
GYN.-KORT-KO 10107(complete)
137
0,88932
SKELETSC.TOT 304022B(complete)
3
0,51
CYTOL.VULVA 355428(complete)
1
0,51
SCC EIA 376480A(complete)
69
0,821
CYTOL.ASCIT. 355401(complete)
30
0,66721
CYTOL.VAGINA 355427(complete)
13
0,52
OND.V.ELDERS 383333(complete)
3
0,51
TARIEF CONS. 419100(complete)
495
AS-ERY. SCR. 378607(complete)
285
0,92371
TEL.CONS. KO 415100(complete)
183
0,96861
MRI ABDOMEN 387090(complete)
41
0,87520
OESTRADIOL 378431(complete)
4
0,6672
PROGESTERON 372442A(complete)
2
0,51
ECHO MAMMA 386970(complete)
3
0,51
CT BEKKEN MC 389142(complete)
2
0,51
ALBUMINE 378453A(complete)
238
ALK.FOSFAT. 370423(complete)
187
0,992144
CALCIUM 377498A(complete)
240
0,95241
CRP 378452(complete)
95
0,84236
NATRIUM VLAM 377842C(complete)
2
0,51
VANCOMYCINE 377410G(complete)
2
0,51
VERV.CONSULT 411100(complete)
676
0,94469
CA-19.9 379414(complete)
3
0,52
NO SHOW 380000(complete)
2
0,51
BILI. GECON. 370401(complete)
144
0,991131
BILI TOTAAL 370401C(complete)
193
0,95229
BEAD.ANESTH. 40032(complete)
4
0,51
AMYLASE 370415(complete)
11
0,757
GENTAMYCINE 377410D(complete)
4
0,51
0,975137
HB FOTOELEKT 370407D(complete)
410
LEUKO TELLEN 370712B(complete)
289
0,967266
HEMATOCRIET 370711(complete)
39
0,92326
FT 4 RIA 376406(complete)
9
0,6672
LWK 2R 383302(complete)
1
0,51
0,992151
HAPTO. 375101(complete)
4
0,53
TROMB TELLEN 370715A(complete)
263
0,977177
DIFF.AUTOM. 370701(complete)
284
0,96662
ICC-KL.ANAES 413489(complete)
11
0,51
BEAD.ANESTH. 40031(complete)
4
0,51
CT PULMON.MC 385442(complete)
2
0,51
G-GLUT-TRANS 372417(complete)
185
0,991136
KALIUM POTEN 370443(complete)
490
0,95850
CHLORIDE 370420(complete)
52
0,92931
SGOT KIN. S 370489T(complete)
3
0,51
NATRIUM VLAM 370135(complete)
5
0,53
AANNAME LAB 370000(complete)
2444
0,915223
CEFALINETIJD 370737C(complete)
29
0,91720
ANTITROMB. 375553D(complete)
4
0,52
ANF 375408B(complete)
1
0,51
GLUCOSE 370402(complete)
215
0,992146
0,95724
0,9881212
LIGDAGTARIEF 40014(complete)
1745
0,909586
KRUISPR. 375075(complete)
292
0,857116
STAGLAP.OMCT 335512J(complete)
4
0,54
INFUUS INBR. 339956(complete)
33
0,72729
B.O.Z. 1R 387001(complete)
20
0,516
KALIUM POTEN 377842A(complete)
3
0,53
ONDERZ.KWEEK 370504A(complete)
228
0,941121
HEUP R. 2R 389202R(complete)
1
0,51
CT THORAX MK 386042(complete)
24
0,7518
MICR.ONDERZ. 370501F(complete)
15
0,87513
ALBUMINE SP 378453S(complete)
53
0,81649
ART.PUNCT.CR 339954A(complete)
6
0,6676
CT A.PULM.MC 385442A(complete)
3
0,53
B.O.Z. 2R 387002(complete)
4
0,53
CLOSTRIDIUM 378216A(complete)
7
0,757
CHOLESTEROL 370425(complete)
3
0,6672
EIWIT COLOR. 370172(complete)
4
0,6674
THORAX 1R 386001(complete)
6
0,6675
DRL.THORAX 386000(complete)
1
0,51
CPK 370488H(complete)
7
0,6677
AMMONIAK 370483(complete)
1
0,51
BEKKEN LIGG. 389101(complete)
3
0,52
ERY ELEC S 377131S(complete)
1
0,51
DUPLEX-VEN. 339849W(complete)
6
0,55
LCR 378546(complete)
5
0,84
ALUMINIUM 378437(complete)
2
0,52
VIT. B1-THM. 378624(complete)
4
0,53
AFWEZIGH.DAG 610002(complete)
1
0,51
LYMFADENECT. 333727(complete)
2
0,51
LAPAROTOMIE 335512C(complete)
4
0,6674
IMM.FIX. 377450(complete)
1
0,51
IGA 370476B(complete)
2
0,52
ICC-KL.UROLO 413406(complete)
2
0,51
MYCOBAC PCR 378697F(complete)
1
0,51
0,993181
CREATININE 370419(complete)
483
0,986438
APCA 330398A(complete)
2
0,51
MELKZUUR 376482C(complete)
18
0,813
CA-125 MEIA 378619A(complete)
188
CEA MEIA BL 376400D(complete)
107
0,95796
CT THORAX ZK 386041(complete)
5
0,53
CYTOL.DIVER. 355499(complete)
2
0,51
CA 15.3 MEIA 378619E(complete)
6
0,755
ANES.VERV. 339992Z(complete)
1
0,51
PARAPROT.TYP 375128(complete)
2
0,51
0,83333
LDH KINET. 370488J(complete)
184
0,979159
NATRIUM VLAM 370442(complete)
494
0,975258
CDE FENOTYP 375003A(complete)
13
0,51
0,85738
0,978156
MAGN. DIV. 378858(complete)
61
0,91715
A1-FETOPROT. 378449(complete)
8
0,754
TRIGLYCERIDE 370460E(complete)
4
0,52
ORDERTARIEF 379999(complete)
14430,992891
OP. UTERUS 337105(complete)
5
0,52
0,8100
HIST.GR.PREP 356133(complete)
54
0,837
0,90961
HIST.KL.PREP 356132(complete)
26
0,66720
OP.BUIK 335519A(complete)
3
0,51
TESTOSTERON 376487D(complete)
2
0,52
ANESTHESIE 339090N(complete)
2
0,52
HIST.BIOPTEN 356134(complete)
49
0,7527
ECHO NIER 388170(complete)
10
0,6678
ECHO GEN.INT 339486E(complete)
30
0,7517
B-SUBUN. HCG 370828A(complete)
6
0,754
PACLITAXEL 686405(complete)
50
0,88930
ECHO ABDOMEN 387070A(complete)
2
0,52
LYMFSC.SCH.W 302211F(complete)
5
0,6673
VULVECT.LIES 337440(complete)
5
0,53
FACT. 8 ACT. 375552A(complete)
1
0,51
EXC. UTERUS 337101B(complete)
3
0,6672
TITR.DIR.CMB 375012(complete)
2
0,51
RENOGR.LASIX 307031G(complete)
2
0,52
MR BIJNIER 388890(complete)
1
0,51
RETI TELLEN 370716(complete)
4
0,51
FQ1 - FQ2 710290(complete)
6
0,754
OV.OP.CLITOR 337436(complete)
1
0,51
BLD.GRP.KIDD 378610(complete)
2
0,51
VULVECT.LIES 337451(complete)
1
0,51
ICC-KL.INTER 413413(complete)
1
0,51
EIWIT BEP. 700050(complete)
3
0,6673
EXC.ADNEX DZ 336950(complete)
1
0,51
ECHO BUIKW. 387970(complete)
1
0,51
PLEURAPUNCTI 332610(complete)
1
0,51
ANTIST.KOUD 375009(complete)
2
0,51
ECHO CAROT.L 381670L(complete)
1
0,51
ECHO DOPPLER 339482A(complete)
1
0,51
VIT. A 377439(complete)
1
0,51
0,987
SGOT-ASAT 370488E(complete)
215
0,994201
UREUM 370403(complete)
246
0,97864
CREATININE 370129(complete)
4
0,51
SGPT-ALAT 370488G(complete)
217
0,984207
0,993169
CPK-MB S 378403S(complete)
6
0,51
TOT.EIW. 370480A(complete)
7
0,754
FOLIUMZUUR 370465Q(complete)
3
0,51
0,951198
0,93884
0,833202
BNP 376425A(complete)
6
0,6674
IJZER 370437(complete)
5
0,51
OSMOLALITEIT 372107(complete)
1
0,51
FERRITINE 372454A(complete)
5
0,51
MICROALBUM. 378173B(complete)
1
0,51
ABO RH 370604(complete)
286
0,929199
BLD.GRP.KELL 375004(complete)
16
0,756
IRREG.AS ERY 378609S(complete)
12
0,8335
ICC-KL.CHIR. 413403(complete)
2
0,51
EIWITFRACT. 376478(complete)
1
0,51
RH-D CENTRIF 370606(complete)
286
0,981273
CT B.BUIK MC 387042(complete)
2
0,51
GEB. A.S.ERY 378609N(complete)
3
0,51
CT HERSEN.MC 381342(complete)
2
0,51
0,903224
0,6672
GEB. A.S.ERY 378609M(complete)
6
0,6672
CT HERSEN.ZC 381341(complete)
3
0,51
EIW.SPEC.KWN 370433F(complete)
2
0,51
0,831
COUPE INZAGE 355111(complete)
40
0,88
TOTAAL T4 376406B(complete)
1
0,51
CT ABDOM.MC 387042A(complete)
90
HAEMOGLOB. S 370701S(complete)
502
0,85727
CYTOL.NIERC. 355426(complete)
4
0,51
ZWARE DAGVPL 619700(complete)
1
0,51
0,945
0,981131
SHBG 377447(complete)
2
0,52
GYN.-JAAR-KO 10307(complete)
61
0,754
DAGVERPL. 619600(complete)
64
0,7512
ERYS ELEKTR. 378731(complete)
1
0,51
BOTDICHT.LWK 304360E(complete)
1
0,51
ECHO ROUTINE 339486G(complete)
3
0,51
DUPL.BEEN EZ 389073F(complete)
1
0,51
ELEKTROCARD. 330001B(complete)
70
0,66728
0,92315
KLIN.KRT.ANS 20189(complete)
8
0,52
KLASSE 3B 613000(complete)
13880,923618
REGIO-TOESL. 614400(complete)
1065
0,824560
STAGLAP.OMCT 335512N(complete)
2
0,52
ONTSTEK.TOT. 302622H(complete)
1
0,51
AS-HBS. KWN 375140(complete)
1
0,51
DIGOXINE 376454A(complete)
2
0,52
AUD KRT 1.5 659030(complete)
1
0,51
CYSTOSCOPIE 339161(complete)
2
0,51
ECHO ONDBUIK 388070A(complete)
1
0,51
VIT. B3 370474G(complete)
1
0,51
0,909276
0,9841061
KLIN.OPNAME 610001(complete)
312
0,942292
STAGLAP.REDU 335512H(complete)
7
0,6673
CYTOL.LYMFEK 355409(complete)
7
0,6672
IRREG.AS ERY 378609R(complete)
11
0,6674
PROT-S.TOT. 375581J(complete)
1
0,51
ECHO BO.BUIK 387070(complete)
6
0,54
KATHET.STOMA 334899(complete)
1
0,51
RIB STERN.2R 386802(complete)
1
0,51
ICC-KL.LONGZ 413422(complete)
1
0,51
ICC-KL.NEURL 413409(complete)
1
0,51
CT RETROP.MC 388942(complete)
1
0,51
CYTOL. BUIK 355435(complete)
1
0,51
ANTI-HAV.IG 371115(complete)
1
0,51
HOOGFR.AUDIO 657026(complete)
1
0,51
OSMOLALITEIT 370496(complete)
1
0,51
CYTOL.PLEURA 355454(complete)
3
0,53
TZ1 710072(complete)
2
0,52
CT HERSENEN 381343(complete)
1
0,51
MR GR.HERSEN 381390(complete)
1
0,51
VIT. B2 370474B(complete)
1
0,51
0,941122
LEUCO ELEC S 377121S(complete)
294
0,954272
SINUS 2R 382102(complete)
1
0,51
0,838197
CYTOL.ECTOC. 355201(complete)
34
0,6674
KLASSE 3A 612000(complete)
250
0,835
IMM.PATH.OND 350503(complete)
65
0,822
CONISATIE 337220(complete)
4
0,6672
ECHO BLAAS 339488A(complete)
10
0,6676
VULVECTOMIE 337452(complete)
3
0,51
CYTOL. LEVER 355431(complete)
2
0,51
AFW. VULVA 337419C(complete)
5
0,6672
NATRIUM S 370135S(complete)
3
0,51
EPI.ANALG.AN 339090B(complete)
1
0,51
CYTOL.LONGP. 355411(complete)
1
0,51
KLIN.KRT.INW 20113(complete)
1
0,51
AFW. VAGINA 337319(complete)
1
0,51
TROMBO S 370715S(complete)
290
PTT 370737S(complete)
50
0,9544
DIFF.HANDM. 379000A(complete)
14
0,6679
FDP DIMEER 376467E(complete)
7
0,6674
KALIUM S 370136S(complete)
2
0,52
PROTROMB. S 370707S(complete)
45
0,93831
INR TROMBOPL 370737Z(complete)
44
0,916
0,9642
KALIUM S 370443S(complete)
379
CREATININE S 370419S(complete)
206
0,955176
MELKZUUR S 376482S(complete)
135
0,923129
OVARIUMCARC. 337106A(complete)
5
0,51
LYMFADENECT 333742(complete)
1
0,51
NATRIUM S 370442S(complete)
373
0,975141
MELKZUUR SP 370488T(complete)
32
0,9628
MAGN.DIV. S 378858S(complete)
25
0,91713
O2-SATURATIE 378458(complete)
229
0,835143
ANTI-HIV 378644(complete)
2
0,51
SGOT ASAT SP 370489S(complete)
62
0,89156
CAPNOGRAFIE 339832C(complete)
14
0,6675
UREUM S 370403S(complete)
78
0,83969
URINEZUUR 370416(complete)
4
0,753
TROPONINE-T 378468P(complete)
7
0,6672
0,8335
GLUCOSE S 370402S(complete)
209
0,862176
0,93326
OVARIUMCARC. 337106(complete)
2
0,52
EXC.ADNEX EZ 336930(complete)
1
0,51
CALCIUM S 370426S(complete)
237
0,6672
BLAASKATHET. 336272(complete)
1
0,51
0,85748
0,896
0,933143
0,7511
EC PUN.LEVER 387677(complete)
1
0,51
LA2 710170B(complete)
2
0,252
GEFILT.ERYT 710170(complete)
187
0,66714
MET-SULF-HB 370407C(complete)
190
0,885103
DARM SCINT. 306332C(complete)
2
0,51
0,83328
0,982113
BICARBONAAT 370424(complete)
214
0,819
0,827145
0,92327
ANTI-HEPAT-C 377479A(complete)
2
0,51
0,7518
VRIESCOUPE 355105(complete)
10
0,6676
0,83331
0,90949
PH-PCO2-BIC. 372414(complete)
212
0,825129
0,7525
0,90978
CO-HB 370440(complete)
189
0,85592
0,92357
EXT. UTERUS 337105F(complete)
3
0,753
OP.BUIK 335512(complete)
1
0,51
0,85742
0,776147
HEP-B SURF. 375138A(complete)
101
0,6679
ALFA-AMYLASE 370117(complete)
1
0,51
ICCV-KL.CHIR 414403(complete)
1
0,51
0,817115
CHLORIDE S 370420S(complete)
22
0,92919
GAMMA-GT S 372417S(complete)
34
0,91718
LIGDAG IC 40034(complete)
9
0,6677
OP.BUIK 335519B(complete)
1
0,51
LIPASE 370415A(complete)
1
0,51
0,85716
0,83321
0,94726
GYN.-AANV.KO 10207(complete)
36
0,753
BEZOEK 410500(complete)
37
0,8335
EC-BIOP.BEKK 389177(complete)
1
0,51
COLON INLOOP 387511(complete)
1
0,51
CITO HISTOL. 359999(complete)
29
0,87514
0,811
0,95207
HEMATOCR. S 370711S(complete)
27
0,88921
0,909123
0,89899
LISEXC.CERV. 337202(complete)
1
0,51
DUO SCOP.ECH 339141J(complete)
1
0,51
0,97150
0,993210
BSE 378729(complete)
14
0,754
CHLORIDE 370119A(complete)
1
0,51
0,817
0,82445
0,944350
THORAX ZAAL 386001Z(complete)
22
0,6679
VULVECTOMIE 337441(complete)
1
0,51
ANTI-HBC-IAM 377478(complete)
1
0,51
TOONAUDIOMET 657021(complete)
1
0,51
0,87541
0,6678
0,6673
UTERUSCURETT 337190C(complete)
5
0,6673
MORFOMETRIE 355107(complete)
1
0,51
0,52
0,84
EXC. UTERUS 337101(complete)
7
0,86
0,51
0,51
0,6673
0,83347
0,7558
FSH EIA 372439(complete)
3
0,51
TT 375518(complete)
23
0,8578
PROTROMBINET 378720(complete)
28
0,87516
IGG-A.CARD. 375421C(complete)
1
0,51
FIBRINOGEEN 370487A(complete)
2
0,51
0,90920
0,7523
IGM-A.CARD. 375421B(complete)
1
0,51
0,6673
MAAGONTL.VVL 306231E(complete)
18
DARM SCIN.VV 306333C(complete)
18
0,718
0,6672
0,66711
ECHO CAROT.R 381670R(complete)
1
0,51
0,6672
CREATININE 377847A(complete)
2
0,6672
UREUM 377840(complete)
1
0,51
0,51
TSH EIA 372441(complete)
12
0,66712
0,54
0,54
FOSFAAT 370421(complete)
35
0,91721
0,85713
0,87512
0,97441
CT ABDOMEN 387043A(complete)
1
0,51
0,94157
0,94190
URINE ONDZ. 378149(complete)
81
SEDIMENT S 370111S(complete)
47
0,94746
RES.5 BEP. 370505A(complete)
78
0,6674
SEDIMENT 370111(complete)
21
0,7520
0,87
0,754
0,9731
0,7557
DOPPLER HART 339494C(complete)
1
0,51
AMMONIAK S 370483S(complete)
1
0,51
0,79242
ZWANGERSCH.S 370804S(complete)
1
0,51
0,7518
0,7529
0,94713
DIEET NNO 709999(complete)
37
0,66712
DUN.DARM MC 387411(complete)
1
0,51
0,53320
0,51
0,90921
0,9556
HS-CRP 378452A(complete)
1
0,51
0,51
VAGINA-TOUCH 339988E(complete)
34
0,85714
AFW. VULVA 337480(complete)
2
0,51
AFW. VULVA 337419(complete)
1
0,51
0,8578
0,7514
0,84
LH BLOED 372440A(complete)
4
0,753
0,753
0,51
0,94417
CT BEKKEN ZC 389141(complete)
1
0,51
0,66715
COLPOSCOPIE 339170(complete)
3
0,6673
COLPOSCOPIE 339171A(complete)
2
0,51
0,52
0,66722
URODYN.5 KAN 339869K(complete)
1
0,51
DOPPL.O.EXTR 339848D(complete)
1
0,51
DUPLEXSCAN 339848H(complete)
1
0,51
0,53
PROLACTINE 372443(complete)
2
0,52
0,52
EC PUN.HALS 382977(complete)
1
0,51
CT DRAINAGE 380048(complete)
1
0,51
0,66717
0,52
0,87511
0,87
0,51
0,6673
HYSTEROSCOP. 339186(complete)
5
0,51
CYST.UR.SCOP 339160(complete)
22
0,76919
AFW. VAGINA 337380(complete)
2
0,51
0,6676
0,6672
0,53
AFW.VRW.ORG. 337180(complete)
3
0,6672
0,51
LYMFEKL.BIOP 333780(complete)
1
0,51
0,6673
0,5713
CYTOL.PUNCT. 350507(complete)
3
0,6672
0,53
0,51
0,51
0,55
0,82
0,7512
BLD.GRP.LEW. 378490G(complete)
1
0,51
0,87512
0,53
0,88910
DAGVERPLEG. 40016(complete)
66
0,88148
0,87531
0,92315
IUD 337292(complete)
1
0,51
ALK.FOSFAT.S 370423T(complete)
44
0,94420
CRP S 378452S(complete)
43
0,917
BILT BILG S 370401S(complete)
47
0,94727
AMYLASE S 370415S(complete)
10
0,757
0,95529
FOSFAAT S 370421S(complete)
20
0,90913
ECHO BEEN 389070(complete)
1
0,51
0,83315
0,94419
0,91713
SGPT ALAT SP 370488S(complete)
59
0,97756
0,96242
EIW.TOT. S 370480S(complete)
8
0,86
0,756
0,7522
0,6672
0,90917
0,6677
0,83
INBR.KATHET. 333698(complete)
8
0,6673
0,54
0,53
C.V.V.H.D. 339970J(complete)
1
0,51
0,54
DRAIN.THORAX 332600D(complete)
2
0,51
0,93826
0,58
0,53
0,8338
0,51
0,85712
0,51
0,66714
0,88917
0,6673
0,94720
0,755
0,51
0,756
LYMFES.SCH.W 302213E(complete)
5
0,8335
SCINT.LYMFEK 302282F(complete)
5
0,8335
0,6674
0,53
MRI BEKKEN 389190(complete)
20
0,758
0,51
ECHO HALS 382970(complete)
1
0,51
0,51
0,53
0,52
0,6673
0,52
FACTOR V 378718(complete)
1
0,51
PROT-S.VRIJ 375581K(complete)
1
0,51
FII -DNA PCR 378717A(complete)
1
0,51
PROT. C ACT. 370743Q(complete)
1
0,51
0,51
0,51
TROMBINETIJD 375517(complete)
1
0,51
AS. ELISA 375423(complete)
1
0,51
0,51
L.A.C. 375552C(complete)
1
0,51
0,51
0,7512
0,8575
0,53
0,51
0,51
0,51
0,753
0,6674
0,51
0,52
DIR.COOMBS 375005(complete)
6
0,52
0,52
0,52
0,52
DRL.BUIK 387000(complete)
1
0,51
0,52
0,755
0,752
0,6675
0,6676
0,92313
0,51
0,52
0,51
0,53
VIT B12 370466C(complete)
2
0,51
TR.FERRINE 378808(complete)
5
0,754
0,52
ERY-ELUAAT 378490B(complete)
2
0,51
0,8333
0,81
AS-ERY.SPEC. 378609K(complete)
7
0,756
0,6672
0,6674
0,53
0,51
CHOLESTEROL 372425F(complete)
3
0,753
0,51
0,6673
BOTDICHT.FEM 304360F(complete)
1
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,53
0,55
0,6674
0,6674
0,52
0,83310
0,6675
0,51
CK-MB 378403(complete)
7
0,755
0,755
0,51
0,51
0,51
0,52
0,51
0,51
0,51
O.BEEN L. 2R 389502L(complete)
1
0,51
B.BEEN L. 2R 389302L(complete)
1
0,51
0,51
0,51
0,54
0,6675
0,54
0,52
OP.VRW.ORG. 337469(complete)
1
0,51
0,51
0,6675
0,51
0,51
0,51
0,51
0,51
MAM.GR.THWND 386902(complete)
7
0,753
0,8335
0,52
0,51
0,52
0,52
0,52
BRONCH.TOIL. 339943B(complete)
1
0,51
0,81
0,51
0,51
0,51
0,52
0,752
0,6673
0,51
0,51
HCVR PCR 378639U(complete)
1
0,51
0,51
0,52
0,51
0,51
0,52
0,51
0,52
0,754
BLD.GRP.MNSP 378490E(complete)
6
0,52
0,52
0,51
0,82
GEB.A.S.ERY 378609Y(complete)
1
0,51
0,51
0,82
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,51
SCHOUD.L. 2R 384202L(complete)
1
0,51
0,51
0,6672
0,51
0,51
0,51
0,51
0,51
0,51
0,6673
0,51
0,51
0,51
0,51
IMMUNOFORESE 378444A(complete)
1
0,51
0,51
0,52
IGG 370476A(complete)
2
0,6672
IGM 370476C(complete)
2
0,6672
0,52
0,52
0,51
0,51
0,51
0,52
0,51
0,51
0,51
SPRAAKAUD.ST 657031(complete)
1
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,52
0,54
0,51
0,51
0,51
0,52
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,51
0,6671
0,51
0,51
0,51
0,51
0,51
VIT. E 376451(complete)
1
0,51
VIT. B6 370474A(complete)
1
0,51
0,51
0,51
(a) hospital
010 Registreren huuropzegging(complete)
208
030 Vastleggen toekomstige adres(complete)
208
0,992193
050 Plannen afspraak 1e inspectie(complete)
163
0,978154
020 Vastleggen datum van overlijden(complete)
6
0,8576
057 Plannen eindinspectie bedryfsr/gar/ber/park/op(complete)
9
0,753
050 Inplannen afspraak 1e inspectie(complete)
33
0,94430
040 Vastleggen toekomstig adres medehuurder(complete)
32
0,8576
070 Is 1e inspectie uitgevoerd ?(complete)
204
0,875103
055 Plannen eindinspectie bedryfsr/gar/ber/park/op(complete)
1
0,51
060 Aanmaken bevestigingbrief / huuropzeggingform.(complete)
196
0,992163
0,99593
100 Gereedmelden 1e insp. / Voorcalculatie maken(complete)
192
0,993192
120 Plannen eindinspectie(complete)
192
0,966192
110 Bepalen leegstandsoort(complete)
192
0,944192
080 Versturen brief 'Niet thuis'(complete)
12
0,92312
400 Is eindinspectie uitgevoerd ?(complete)
171
0,929123
0,981129
300 Is eindinspectie uitgevoerd ?(complete)
34
0,93820
440 Zijn er nieuwe of niet herstelde gebreken ?(complete)
168
0,994168
410 Versturen brief 'niet thuis'(complete)
3
0,6673
450 Krijgt de huurder tijd om te herstellen ?(complete)
27
0,96427
500 Beoordelen/wijzigen leegstandsoort(complete)
168
0,991141
0,94120
460 (Her)plannen 2e eindinspectie(complete)
8
0,8576
420 Wijzigen einddatum huurovereenkomst(complete)
3
0,51
510 Is opleveringsformulier ondertekend ?(complete)
168
0,994168
130 Is het opleveringsformulier ondertekend ?(complete)
192
0,957166
520 Aanmaken 2e in gebreke stelling(complete)
7
0,8337
530 Aanmaken werkopdracht(complete)
167
0,993161
140 Aanmaken 1e in gebreke stelling(complete)
12
0,90912
150 Is er sprake van ZAV ?(complete)
192
0,994180
0,91712
180 Aanpassen woningwaardering(complete)
191
0,994168
170 Aanpassen plattegrond(complete)
191
0,92914
160 Registreren ZAV(complete)
9
0,8899
190 Harmoniseren huurprijs(complete)
169
0,993158
190 Actualiseren huurprijs(complete)
34
0,96633
0,97332
205 Bepalen kandidaat huurder(complete)
124
0,95102
240 Registreren voorl. huurovereenkomst +afdrukken(complete)
166
0,98398
210 Registreren voorl. huurovereenkomst +afdrukken(complete)
35
0,85721
0,86
540 Worden er bonussen/ kosten toegekend ?(complete)
167
0,994167
550 Vastleggen bonussen / kosten(complete)
48
0,97848
560 Opstellen eindnota(complete)
169
0,977100
0,92329
210 Aanmaken leegmelding en exporteren (WMS)(complete)
46
0,87510
260 Is contract getekend en geld ontvangen ?(complete)
166
0,993166
300 Wijzigen status WMS (definitief geaccepteerd)(complete)
94
0,98194
290 Definitief maken Huurovereenkomst(complete)
165
0,989165
270 Verwijderen voorlopige huurovereenkomst(complete)
1
0,51
310 Aanpassen factureerafspraak(complete)
162
0,80886
570 Archiveren huuropzegging(complete)
167
0,994167
0,91134
0,993159
305 Vastleggen huishoudgrootte en inkomen(complete)
3
0,753
330 Archiveren nieuwe verhuring(complete)
162
0,98162
320 After sales(complete)
162
0,991162
0,97631
0,90931
0,8576
058 Aanmaken bevest.brief huuropzegging(b/g/bso/p)(complete)
11
0,8339
075 Bepalen leegstandssoort bedr/gar/berg/park/op(complete)
12
0,911
0,91711
220 Aanbieden zelfstandige woning (WMS)(complete)
45
0,97645
230 Registreren/controleren kandidaat (WMS)(complete)
45
0,97645
0,96634
0,95833
200 Toewijzen woning/bedr.ruimte/gar/berg/park/ops(complete)
35
0,96834
0,87513
340 Zijn er nieuwe of niet herstelde gebreken ?(complete)
34
0,97134
400 Beoordelen/wijzigen leegstandsoort(complete)
34
0,96834
410 Is opleveringsformulier ondertekend ?(complete)
34
0,96934
430 Aanmaken werkopdracht(complete)
34
0,96434
440 Worden er bonussen/ kosten toegekend ?(complete)
34
0,97134
450 Vastleggen bonussen / kosten(complete)
4
0,754
0,93311
065 Aanmaken bevest.brief huuropzegging(b/g/bso/p)(complete)
1
0,51
460 Opstellen eindnota(complete)
34
0,753
220 Is contract getekend en geld ontvangen ?(complete)
34
0,9734
240 Definitief maken Huurovereenkomst(complete)
33
0,96633
230 Verwijderen voorlopige huurovereenkomst(complete)
1
0,51
250 Aanpassen factureerafspraak(complete)
32
0,96832
260 After sales(complete)
32
0,96326
270 Archiveren nieuwe verhuring(complete)
32
0,8576
0,60626
0,92315
470 Archiveren huuropzegging(complete)
34
0,97134
0,9092
0,8899
470 Wijzigen einddatum huurovereenkomst(complete)
8
0,86
480 Is de 2e eindinspectie uitgevoerd ?(complete)
8
0,8335
0,8576
490 Versturen brief Niet thuis(complete)
2
0,52
0,52
0,753
0,6673
0,51
430 Herplannen eindinspectie(complete)
3
0,753
0,6672
0,51
090 Herplannen 1e inspectie(complete)
12
0,92312
0,9238
0,51
0,51
0,51
(b) housing agency
Fig. 3: Two process models discovered using conventional process discovery techniques.
11
As input we assume an event log in XES format. In 2010, the IEEE Task Force
on Process Mining standardized XES (www.xes-standard.org), a standard logging
format that is extensible and supported by the OpenXES library (www.openxes.org)
and by tools such as ProM, XESame, Disco, Nitro, etc. XES is the successor of
the MXML format and we will also support this older format.
Fig. 3 shows two example models discovered using ProM’s heuristic miner
[1, 28]. The model in Fig. 3a was discovered based on event data of a group of 627
gynecological oncology patients treated in the AMC hospital in Amsterdam. All diag-
nostic and treatment activities have been recorded for these patients. The event log
contains 24331 events referring to 376 different activities. The process model shows all
376 activities and the paths followed by patients. The model looks Spaghetti-like, but
can be simplified by looking at homogeneous groups of patients and/or by focusing
on the frequent activities. The model in Fig. 3b was discovered using an event log
extracted from the database of a large Dutch housing agency. The event log contains
5987 events relating to 208 cases and 74 activity names. Each case corresponds to a
housing unit (accommodation such as a house or an apartment). The process starts
when the tenant leasing the unit wants to stop renting it. The process ends when a
new tenant moves into the unit after handling all formalities.
Process Mining Challenges and Evaluation Criteria
Traditional process discovery techniques suffer from the following limitations:
• Process discovery is done offline, i.e., it is assumed that there is a representative
event log. In some applications this assumption is unrealistic because it is im-
possible or too costly to store all event data. Recently, process mining techniques
12
have been developed for predictions and recommendations. However, also these
techniques do not discover process models on-the-fly.
• It is impossible to discover process models for extremely large event logs (i.e.,
terabyte logs or logs with thousands of different activities). Algorithmic tech-
niques such as heuristic mining [28], fuzzy mining [17], and the α-algorithm [9]
are fast, but as data sets continue to grow even these techniques will not be able
to keep up. Region-based techniques [7, 12, 29] are more precise but also time
consuming. Genetic process mining algorithms [22] can be distributed easily,
but are extremely inefficient.
• Most process discovery techniques assume the process to be in steady-state. It is
assumed to be irrelevant whether a case occurs at the beginning of the log or
towards the end. As a result, these techniques do not capture concept drift [14].
Processes may exhibit seasonal patterns (e.g., due to the increasing workload in
December some checks are skipped), sudden abrupt changes (e.g., a disaster or
a new law), or gradual changes (e.g., an increasing market share).
• The same process may exist within different organizations or different parts of
the same organization. Within a process there may be homogeneous groups of
cases that share common characteristics. Several authors proposed techniques to
cluster similar cases [13, 16]. These techniques focus on producing simple models
for subsets of cases. However, the resulting process models are not related and
cannot be folded easily into an overall configurable process model.
To evaluate process models discovered using process mining, we need to align
event log and model. Suppose that an event log contains cases that can be char-
acterized by the following three traces: σ1 = 〈A,B,C,D〉, σ2 = 〈A,C,D〉, and
σ3 = 〈A,C,D,B,D〉. Example alignments for these three traces are (based on Fig. 2):
13
γ1=
A B C D
A B C D
γ2=
A C � D
A C B D
γ3=
A C D B D
A C � B D
γ4=
A C � D B D
A C B D � �
The top row of each alignment corresponds to “moves in the log” and the bottom row
corresponds to “moves in the model”. If a move in the log cannot be mimicked by a
move in the model, then a “�” (“no move”) appears in the bottom row. If a move in
the model cannot be mimicked by a move in the log, then a “�” (“no move”) appears
in the top row. For example, in γ1 the trace in the log (σ1) and the model (Fig. 2) are
aligned perfectly as every move in the log is mimicked by a move in the model and vice
versa. In γ2, trace σ2 is aligned with Fig. 2. Since C is followed by D and no B occurred,
the model makes a B move without a corresponding move in the log. In γ3, trace σ3
is aligned with Fig. 2. Now the log makes a D move without a corresponding move
in the model. Given a trace in the event log, there may be many possible alignments.
The goal is to find the alignment with the least number of � elements, e.g., γ3 seems
better than γ4. Finding a optimal alignment can be viewed as an optimization problem
as shown in [5, 10].
The number of � elements can be used to quantify fitness. Model and log have
a perfect fitness if all traces in the log can be replayed by the model from beginning
to end. Fitness is just one of the four basic conformance dimensions defined in [1].
Other quality dimensions for comparing model and log are simplicity, precision, and
generalization.
The simplest model that can explain the behavior seen in the log is the best
model. This principle is known as Occam’s Razor. There are various metrics to quantify
the complexity of a model (e.g., size, density, etc.).
The precision dimension is related to the desire to avoid “underfitting”. It is
very easy to construct an extremely simple Petri net (“flower model”) that is able to
14
replay all traces in an event log (but also any other event log referring to the same set
of activities). See [5, 23, 27] for metrics quantifying this dimension.
The generalization dimension is related to the desire to avoid “overfitting” [1, 5].
In general it is undesirable to have a model that only allows for the exact behavior
seen in the event log. Remember that the log contains only example behavior and that
many traces that are possible may not have been seen yet.
Conformance checking can be done for various reasons, e.g., to evaluate the
results of process discovery. However, it may also be used to audit processes to see
whether reality conforms to some normative of descriptive model [6]. Deviations may
point to fraud, inefficiencies, and poorly designed or outdated procedures.
Dealing With Big Data
Figure 4 shows an overall approach for dealing with “big event data” in a compre-
hensive manner. Starting point are event logs that may be huge (millions of events).
Events may come from different data sources that change over time. The goal is to
be able to discover reliable models under these difficult circumstances. It should be
possible to discover processes while storing a minimal amount of information. More-
over, for performance reasons, it should be possible to utilize a network of computers
by distributing challenging process mining tasks. Processes may change over time and
may vary from one organization to the other. Moreover, groups of cases may exhibit
different behaviors. Therefore, it is vital to find out when and how a process changes,
and how different variants of the process can be discovered and compared.
One can consider two basic approaches for on-the-fly process discovery: sampling
and aggregation (see Fig. 4). For sampling we retain a representative subset of cases,
e.g., based on a time window. Techniques based on aggregation do not store cases, but
only aggregate information, e.g., the frequency of direct successions (with smoothing to
15
ca
se
s
time
inp
ut
da
ta
org
./g
rou
p
time
dis
co
ve
r
“big” event data
on-the-fly process discovery
distributed
process
discovery
concept drift analysis
configurable
process models
me
rgesample
aggregate
Fig. 4: Towards a more comprehensive approach to process mining supporting on-
the-fly and/or distributed process mining while considering concept drift and process
variability.
give more weight to recent observations). The challenge is to apply the best approach
given characteristics of the log and desirable quality levels. For example, there are
various tradeoffs between saving storage space and preserving model quality [15, 11].
Today, there are many different types of distributed systems, i.e., systems com-
posed of multiple autonomous computational entities communicating through a net-
work. Grid computing, multicore CPU systems, manycore GPU systems, cluster com-
puting, and cloud computing all refer to systems where different resources are used
concurrently to improve performance and scalability. We consider three basic types of
distribution [4]. This classification is based on the way the log is partitioned.
• Replication. If the process mining algorithm is non-deterministic (e.g., a genetic
algorithm), then the same task can be executed on all nodes and in the end the
best result can be taken. In this case, the event log can be simply replicated,
i.e., all nodes have a copy of the whole event log.
• Vertical partitioning. Event logs are composed of cases. There may be thousands
or even millions of cases. These can be distributed over the nodes in the network,
16
i.e., each case is assigned to one computing node. All nodes work on a subset of
the whole log and in the end the results need to be merged.
• Horizontal partitioning. Cases are composed of multiple events. Therefore, we
can also partition cases, i.e., part of a case is analyzed on one node whereas
another part of the same case is analyzed on another node. In principle, each
node needs to consider all cases. However, the attention of one computing node
is limited to a particular subset of events per case.
Process mining algorithms are typically linear in the size of the log and exponential
in the number of activities. Using a vertical partitioning it is easy to achieve a linear
speedup. A horizontal partitioning may be used to achieve a super linear speedup,
because the time needed to solve “many smaller problems” tends to be less than the
time needed to solve “one big problem” [3, 2]. This is only possible if the set of activities
can be partitioned in localized process fragments. In this case, decomposition can (most
likely) be used to speed up process mining algorithms even if the smaller problems are
solved sequentially on just one computing node.
Processes often change while being analyzed. Therefore, concept drift is men-
tioned as one of the challenges in the Process Mining Manifesto [19]. Concept drift
was been investigated in the context of various data mining problems [30, 20]. In [14]
the problem is investigated in the context of process mining thereby producing some
initial results. However, many challenges remain. For example, classical conformance
notions such as fitness, generalization, and precision cannot be applied to processes
that change [1, 5]. One needs to judge the result with respect to a moving time window
of suitable length.
17
Key Applications
Process mining can be used to improve processes in a wide variety of organizations. A
few examples of the industries were process mining has been applied.
• The healthcare industry includes hospitals and other care organizations. Most
events are being recorded (blood tests, MRI scans, appointments, etc.) and cor-
relation is easy because each event refers to a particular patient. The closer
processes get to the medical profession, the less structured they become. For
instance, most diagnosis and treatment processes tend to be rather Spaghetti-
like. Medical guidelines typically have little to do with the actual processes. On
the one hand, this suggests that these processes can be improved by structur-
ing them. On the other hand, the variability of medical processes is caused by
the different characteristics of patients, their problems, and unanticipated com-
plications. Patients are saved by doctors deviating from standard procedures.
However, some deviations also cost lives. Clearly, hospitals need to get a better
understanding of care processes to be able to improve them. Process mining can
help as event data is readily available.
• Governments range from small municipalities to large organizations operating
at the national level, e.g., institutions managing processes related to unemploy-
ment, customs, taxes, and traffic offences. Both local and national government
agencies can be seen as “administrative factories” as they execute regulations
and the “products” are mainly informational or financial. Processes in larger
government agencies are characterized by a high degree of automation. Con-
sider, for example, tax departments that need to deal with millions of tax dec-
larations. Processes in smaller government agencies (e.g., small municipalities)
are typically not automated and managed by office workers rather than BPM
18
systems. However, due to the legal requirements, all main events are recorded in
a systematic manner. Typical use cases for process mining in governments (local
or non-local) are flow time reduction (e.g., shorten the time to get a building
permit), improved efficiency, and compliance. Given the role of governments in
society, compliance is of the utmost importance.
• Banking and insurance are two industries where BPM technology has been
most effective. Processes are often automated and all events are recorded in
a systematic and secure manner. Examples are the processing of loans, claims
management, handling insurance applications, credit card payments, and mort-
gage payments. Most processes in banking and insurance are Lasagna processes,
i.e., highly structured. Hence, all of the techniques presented in this book can be
applied. Process discovery is less relevant for these organizations as most pro-
cesses are known and documented. Typical uses cases in these industries involve
conformance checking, performance analysis, and operational support.
• The transportation industry is also recording more and more information about
the movement of people and products. Through tracking and tracing function-
ality the whereabouts of a particular parcel can be monitored by both sender
and receiver. Although controversial, smartcards providing access to buildings
and transportation systems can be used to monitor the movement of people. For
example, the Dutch “ov-chipkaart” can be used to travel by train, subway, and
bus. The traveler pays based on the distance between the entry point and exit
point. The recorded information can be used to analyze traveling behavior. The
booking of a flight via the Internet also generates lots of event data. In fact,
the booking process involves only electronic activities. Note that the traveler
interacts with one organization that contacts all kinds of other organizations in
19
the background (airlines, insurance companies, car rental agencies, etc.). All of
these events are being recorded, thus enabling process mining.
These examples illustrate that there are numerous opportunities for process mining in
various industries. Moreover, in all of these industries the volumes of event data will
grow exponentially and there is the need to present analysis results instantly. Hence,
there is a need for the distributed and on-the-fly process mining.
Future Directions
Despite the applicability of process mining there are many interesting challenges; these
illustrate that process mining is a young discipline. Process discovery is probably the
most important and most visible intellectual challenge related to process mining: it is
far from trivial to construct a process model based on event logs that are incomplete
and noisy. Still extensive research is needed to improve existing techniques or to come
up with completely new techniques. Moreover, extensive research is needed to deal with
“Big Data” challenges, i.e., handling event logs with millions of cases, billions of events,
and thousands of different activities.
Cross References
• Data Mining
• Evolution of Social Networks
• Network Representations of Complex Data
• Role Discovery
• Service Discovery
• Temporal Networks
• Web Log Analysis
20
Acknowledgements
The author would like to thank all involved in the development of of the process mining
tool ProM and related techniques (processmining.org) and all members of the IEEE
Task Force on Process Mining (www.win.tue.nl/ieeetfpm/).
References
1. W.M.P. van der Aalst. Process Mining: Discovery, Conformance and Enhancement of Business
Processes. Springer-Verlag, Berlin, 2011.
2. W.M.P. van der Aalst. Decomposing Petri Nets for Process Mining: A Generic Approach. BPM
Center Report BPM-12-20, BPMcenter.org, 2012.
3. W.M.P. van der Aalst. Decomposing Process Mining Problems Using Passages. In S. Haddad and
L. Pomello, editors, Applications and Theory of Petri Nets 2012, volume 7347 of Lecture Notes in
Computer Science, pages 72–91. Springer-Verlag, Berlin, 2012.
4. W.M.P. van der Aalst. Distributed Process Discovery and Conformance Checking. In J. de Lara
and A. Zisman, editors, International Conference on Fundamental Approaches to Software Engi-
neering (FASE 2012), volume 7212 of Lecture Notes in Computer Science, pages 1–25. Springer-
Verlag, Berlin, 2012.
5. W.M.P. van der Aalst, A. Adriansyah, and B. van Dongen. Replaying History on Process Mod-
els for Conformance Checking and Performance Analysis. WIREs Data Mining and Knowledge
Discovery, 2(2):182–192, 2012.
6. W.M.P. van der Aalst, K.M. van Hee, J.M. van der Werf, and M. Verdonk. Auditing 2.0: Using
Process Mining to Support Tomorrow’s Auditor. IEEE Computer, 43(3):90–93, 2010.
7. W.M.P. van der Aalst, V. Rubin, H.M.W. Verbeek, B.F. van Dongen, E. Kindler, and C.W.
Gunther. Process Mining: A Two-Step Approach to Balance Between Underfitting and Overfitting.
Software and Systems Modeling, 9(1):87–111, 2010.
8. W.M.P. van der Aalst, M.H. Schonenberg, and M. Song. Time Prediction Based on Process
Mining. Information Systems, 36(2):450–475, 2011.
21
9. W.M.P. van der Aalst, A.J.M.M. Weijters, and L. Maruster. Workflow Mining: Discovering Process
Models from Event Logs. IEEE Transactions on Knowledge and Data Engineering, 16(9):1128–
1142, 2004.
10. A. Adriansyah, B. van Dongen, and W.M.P. van der Aalst. Conformance Checking using Cost-
Based Fitness Analysis. In C.H. Chi and P. Johnson, editors, IEEE International Enterprise
Computing Conference (EDOC 2011), pages 55–64. IEEE Computer Society, 2011.
11. C. Aggarwal. Data Streams: Models and Algorithms, volume 31 of Advances in Database Systems.
Springer-Verlag, Berlin, 2007.
12. R. Bergenthum, J. Desel, R. Lorenz, and S. Mauser. Process Mining Based on Regions of Lan-
guages. In G. Alonso, P. Dadam, and M. Rosemann, editors, International Conference on Business
Process Management (BPM 2007), volume 4714 of Lecture Notes in Computer Science, pages 375–
383. Springer-Verlag, Berlin, 2007.
13. R.P. Jagadeesh Chandra Bose and W.M.P. van der Aalst. Trace Clustering Based on Conserved
Patterns: Towards Achieving Better Process Models. In S. Rinderle-Ma, S. Sadiq, and F. Leymann,
editors, BPM 2009 Workshops, Proceedings of the Fifth Workshop on Business Process Intelli-
gence (BPI’09), volume 43 of Lecture Notes in Business Information Processing, pages 170–181.
Springer-Verlag, Berlin, 2010.
14. R.P. Jagadeesh Chandra Bose, W.M.P. van der Aalst, I. Zliobaite, and M. Pechenizkiy. Handling
Concept Drift in Process Mining. In H. Mouratidis and C. Rolland, editors, International Confer-
ence on Advanced Information Systems Engineering (Caise 2011), volume 6741 of Lecture Notes
in Computer Science, pages 391–405. Springer-Verlag, Berlin, 2011.
15. A. Burattin, A.Sperduti, and W.M.P. van der Aalst. Heuristics Miners for Streaming Event Data.
CoRR, abs/1212.6383, 2012.
16. G. Greco, A. Guzzo, L. Pontieri, and D. Sacca. Discovering Expressive Process Models by Clus-
tering Log Traces. IEEE Transaction on Knowledge and Data Engineering, 18(8):1010–1027,
2006.
17. C.W. Gunther and W.M.P. van der Aalst. Fuzzy Mining: Adaptive Process Simplification Based
on Multi-perspective Metrics. In G. Alonso, P. Dadam, and M. Rosemann, editors, International
Conference on Business Process Management (BPM 2007), volume 4714 of Lecture Notes in
Computer Science, pages 328–343. Springer-Verlag, Berlin, 2007.
22
18. M. Hilbert and P. Lopez. The World’s Technological Capacity to Store, Communicate, and Com-
pute Information. Science, 332(6025):60–65, 2011.
19. IEEE Task Force on Process Mining. Process Mining Manifesto. In F. Daniel, K. Barkaoui, and
S. Dustdar, editors, Business Process Management Workshops, volume 99 of Lecture Notes in
Business Information Processing, pages 169–194. Springer-Verlag, Berlin, 2012.
20. M. van Leeuwen and A. Siebes. StreamKrimp: Detecting Change in Data Streams. In Machine
Learning and Knowledge Discovery in Databases, volume 5211 of Lecture Notes in Computer
Science, pages 672–687. Springer-Verlag, Berlin, 2008.
21. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. Byers. Big Data: The
Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute, 2011.
22. A.K. Alves de Medeiros, A.J.M.M. Weijters, and W.M.P. van der Aalst. Genetic Process Mining:
An Experimental Evaluation. Data Mining and Knowledge Discovery, 14(2):245–304, 2007.
23. J. Munoz-Gama and J. Carmona. Enhancing Precision in Process Conformance: Stability, Con-
fidence and Severity. In N. Chawla, I. King, and A. Sperduti, editors, IEEE Symposium on
Computational Intelligence and Data Mining (CIDM 2011), pages 184–191, Paris, France, April
2011. IEEE.
24. C. Myhill. Commercial Success by Looking for Desire Lines. In Computer Human Interaction,
volume 3101 of Lecture Notes in Computer Science, pages 293–304. Springer-Verlag, Berlin, 2004.
25. W. Reisig and G. Rozenberg, editors. Lectures on Petri Nets I: Basic Models, volume 1491 of
Lecture Notes in Computer Science. Springer-Verlag, Berlin, 1998.
26. E. Barlow Rogers. Rebuilding Central Park: A Management and Restoration Plan. MIT Press,
1987.
27. A. Rozinat and W.M.P. van der Aalst. Conformance Checking of Processes Based on Monitoring
Real Behavior. Information Systems, 33(1):64–95, 2008.
28. A.J.M.M. Weijters and W.M.P. van der Aalst. Rediscovering Workflow Models from Event-Based
Data using Little Thumb. Integrated Computer-Aided Engineering, 10(2):151–162, 2003.
29. J.M.E.M. van der Werf, B.F. van Dongen, C.A.J. Hurkens, and A. Serebrenik. Process Discovery
using Integer Linear Programming. Fundamenta Informaticae, 94:387–412, 2010.
30. G. Widmer and M. Kubat. Learning in the Presence of Concept Drift and Hidden Contexts.
Machine Learning, 23:69–101, 1996.
23
Recommended Reading
To get started with process mining, the reader is advised to read the book “Process
Mining: Discovery, Conformance and Enhancement of Business Processes” [1] and the
Process Mining Manifesto [19].