Download - Desire Lines in Big Data - RWTH Aachen University · 2018. 1. 16. · Event log: multiset of traces. Trace: sequence of events. Event: occurrence of some discrete incident (e.g.,

Title: Desire Lines in Big Data

Name: Wil M.P. van der Aalst

Affil./Addr.: Eindhoven University of Technology

Department of Mathematics and Computer Science

PO Box 513, NL-5600 MB, Eindhoven, The Netherlands

E-mail: [email protected]

Desire Lines in Big Data

Synonyms

process mining, business process intelligence, distributed process mining, process dis-

covery

Glossary

Event log: multiset of traces.

Trace: sequence of events.

Event: occurrence of some discrete incident (e.g., completion of an activity).

Process mining: collection of techniques to discover, monitor and improve real pro-

cesses by extracting knowledge from event data.

Process discovery: extracting process models from an event log.

Conformance checking: monitoring deviations by comparing model and log.

Definition

Processes leave footprints in information systems just like people leave footprints in

grassy spaces. Desire lines, i.e., the tracks formed by erosion showing where people

2

really walk, may be very different from the formal pathways. When people deviate

from the official path there is often a good reason and room for improvement. The goal

of process mining is to extract desire lines from event logs, e.g., to automatically infer

a process model from raw events recorded by some information system.

Process mining techniques and tools should be able to deal with huge heteroge-

neous event logs. For example, the increasing ability to record events (cf. sensor data,

internet of things, remote monitoring, and service orientation) may make it infeasible

to store all events over an extended period. Therefore, on-the-fly discovery techniques

have been developed, i.e., techniques to learn process models without storing excessive

amounts of events. Moreover, techniques to distribute process mining techniques over

a network consisting of many computing nodes are being developed. The techniques

exploit modern computing infrastructures and make process mining scalable. This way

it is possible to discover desire lines in Big Data.

Introduction

Desire lines refer to tracks worn across grassy spaces – where people naturally walk

– regardless of formal pathways (see Figure 1). A desire line emerges through ero-

sion caused by footsteps of humans (or animals) and the width and degree of erosion

of the path indicates how frequently the path is used. Typically, the desire line fol-

lows the shortest or most convenient path between two points. Moreover, as the path

emerges more people are encouraged to use it, thus stimulating further erosion. Dwight

Eisenhower is often mentioned as one of the persons that noted this emerging group

behavior. Before becoming the 34th president of the United States, he was the pres-

ident of Columbia University. When he was asked how the university should arrange

the sidewalks to best interconnect the campus buildings, he suggested letting the grass

grow between buildings and delay the creation of sidewalks. After some time the de-

3

sire lines revealed themselves. The places where the grass was most worn by people’s

footsteps were turned into sidewalks.

normative or expected path

desire line

Fig. 1: Desire lines reveal the actual and not the assumed behavior of people, machines,

and organizations.

The term “desire line” has been used for decades in urban planning. A desire

line shows where people naturally walk. The width and degree of erosion of such an

informal path indicates how frequently the path is used. Often the desire line is very

different from the formal pathway. Therefore, some planners simply let erosion tell were

the paths need to be. For example, the paths across Central Park in New York were

reconstructed using this approach [24, 26].

Good information systems do not show signs of erosion. Nevertheless, they often

contain a wealth of event data providing clues about the paths followed by the users

of the system. Therefore, it is possible to determine desire lines in organizations, sys-

tems, and products. Besides visualizing such desire lines, we can also investigate how

these desire lines change over time, characterize the people following a particular de-

4

sire line, etc. There may also be desire lines that are “undesirable” (unsafe, inefficient,

unfair, etc.). Uncovering such phenomena is a prerequisite for process and product

improvement.

The potential value of desire lines in “big data” (say event logs containing mil-

lions of events) is enormous. The identification of such information can be used to

redesign procedures and systems (“reconstructing the formal pathways”), to recom-

mend people taking the right path (“adding signposts were needed”), or to build in

safeguards (“building fences to avoid dangerous situations”).

More and more information about (business) processes is recorded by informa-

tion systems in the form of so-called “event logs”. IT systems are becoming more and

more intertwined with these processes, resulting in an “explosion” of available data that

can be used for analysis purposes. Today’s information systems already log enormous

amounts of events. Classical workflow management systems (e.g. FileNet, TIBCO iPro-

cess Suite, Global 360), ERP systems (e.g. SAP, Oracle), case handling systems (e.g.

BPM|one), PDM systems (e.g. Windchill), CRM systems (e.g. Microsoft Dynamics

CRM, SalesForce), middleware (e.g., IBM’s WebSphere, Cordys), hospital information

systems (e.g., Chipsoft, Siemens Soarian), etc. provide very detailed information about

the activities that have been executed. Not just information systems record data; many

physical devices are connected to the Internet and objects (products and resources) are

tagged and monitored. Providers of high-tech systems (ASML, Philips Healthcare, etc.)

are recording terabytes of data on a daily basis. In fact, according to MGI, nearly all

sectors in the US economy have at least an average of 200 terabytes of stored data

per company (for companies with more than 1,000 employees) and many sectors have

more than 1 petabyte in mean stored data per company [21]. Until 2000 most data

was still stored in analog form (books, photos, etc.). Since 2000 data storage has grown

spectacularly, shifting markedly from analog to digital [18].

5

Data will continue to grow at a spectacular rate. Moreover, the digital universe

and the physical universe are becoming more and more aligned, e.g., money has become

a predominantly digital entity. When booking a flight over the Internet, the customer is

interacting with many organizations (airline, travel agency, bank, and various brokers),

often without actually realizing it. If the booking is successful, the customer receives

an e-ticket. Note that an e-ticket is basically a number, thus illustrating the tight

coupling between the digital and physical universe. When the SAP system of a large

manufacturer indicates that a particular product is out of stock, it is impossible to sell

or ship the product even when it is available in physical form. Technologies such as

RFID (Radio Frequency Identification), GPS (Global Positioning System), and sensor

networks will stimulate a further alignment of data and reality, e.g., RFID tags make

it possible to track and trace individual items. Hence, there will be more and more

high-quality data that can be used to reveal desire lines in any industry.

Since we are interested in analyzing processes based on the data recorded, we

focus on events that can be linked to relevant activities. The order of such events is

important for deriving the actual process. Fortunately, most events have a timestamp

or can be linked to a particular date. Hence, the event data needed for process mining

are omnipresent.

Consider for example Philips Healthcare, a provider of medical systems that are

often connected to the Internet to enable logging, maintenance, and remote diagnostics.

For example, more than 1500 Cardio Vascular (CV) systems (i.e., X-ray machines) are

monitored by Philips. On average each CV system produces 15,000 events per day,

resulting in 22.5 million events per day for just their CV systems. The events are

stored for about three years and have many attributes. The error logs of ASML’s

lithography systems have similar characteristics and also contain about 15,000 events

per machine per day. These numbers illustrate the fact that many organizations are

6

storing terabytes of event data. Earlier applications of process mining in organizations

such as Philips and ASML, show that there are various challenges with respect to

performance (response times), capacity (storage space), and interpretation (discovered

process models may be composed of thousands of activities).

Many organizations are using so-called Business Intelligence (BI) software, e.g.,

Business Objects (SAP), Cognos (IBM), Hyperion (Oracle), etc. Common functions

offered by these BI tools are reporting, online analytical processing, data mining, busi-

ness performance management, benchmarks, and predictive analysis. However, these

tools assume that the process is known and they typically look at data-related aspects

(e.g., correlations) or view the process at an aggregate level (e.g., a dashboard showing

the average response time). BI tools typically provide some form of data mining and

there are dedicated data mining tools such as Weka, SPSS Clementine, RapidMiner,

etc. Typical techniques supported are classification, clustering, association rules, etc.

However, these systems do not allow for the discovery of processes based on event

logs. In fact, an explicit process notion is missing. This led to the formation of a new

research domain: process mining.

Key Points

The spectacular growth of event data is providing opportunities and challenges for

process mining. Process discovery and conformance checking can be used to analyze and

improve operational business processes in any sector. However, as event logs are growing

in size it may be impossible to store, manage, and analyse event data using traditional

algorithms and tools. Moreover, process mining is increasingly used on online settings

where processes need to be analyzed on-the-fly. Process mining algorithms and tools

need to be adapted to this new reality.

7

case id event id properties

timestamp activity resource cost . . .

35654423 30-12-2011:11.02 A John 300 . . .

1 35654424 30-12-2011:11.06 B John 400 . . .

35654425 30-12-2011:11.12 C John 100 . . .

35654426 30-12-2011:11.18 D John 400 . . .

35655526 30-12-2011:16.10 A Ann 300 . . .

2 35655527 30-12-2011:16.14 C John 450 . . .

35655528 30-12-2011:16.26 B Pete 350 . . .

35655529 30-12-2011:16.36 D Ann 300 . . .

. . . . . . . . . . . . . . . . . . . . .

Table 1: A fragment of some event log: each line corresponds to an event.

Process Mining

In this section, we first introduce process mining using a small example. Then we

elaborate on ways to deal with huge event sets.

Process mining techniques attempt to extract non-trivial and useful information

from event logs [1, 19]. One aspect of process mining is control-flow discovery, i.e., au-

tomatically constructing a process model (e.g., a Petri net or BPMN model) describing

the causal dependencies between activities [7, 9, 29]. The basic idea of control-flow

discovery is very simple: given an event log containing a set of traces, automatically

construct a suitable process model “describing the behavior” seen in the log. Such dis-

covered processes have proven to be very useful for the understanding, redesign, and

continuous improvement of business processes [1].

To illustrate the notion of process discovery, consider Table 1. The table shows a

small fragment of some larger event log. Only two traces are shown, both containing 4

8

events. Each event has a unique id and several properties. For example, event 35654423

is an instance of activity A that occurred on December 30th at 11.02, was executed

by John, and costs 300 euros. The second trace starts with event 35655526 and also

refers to an instance of activity A. Note that each trace corresponds to a case, i.e., a

completed process instance.

1 〈A02, B06, C12, D18〉

2 〈A10, C14, B26, D36〉

3 〈A12, E22, D56〉

4 〈A15, B19, C22, D28〉

5 〈A18, B22, C26, D32〉

6 〈A19, E28, D59〉

7 〈A20, C25, B36, D44〉

Table 2: A simplified event log. Each line corresponds to a trace represented as a

sequence of activities with timestamps.

The information depicted in Table 1 is the typical event data that can be ex-

tracted from today’s information systems. To make the example more manageable, we

now focus on the activities and their timestamps only. Table 2 shows another view on

the same event log. Now each line corresponds to a process instance, e.g., the first trace

〈A02, B06, C12, D18〉 refers to a process instance where activity A was executed at time

2, activity B was executed at time 6, activity C was executed at time 12, and activity

D was executed at time 18. Note that the first two traces in Table 2 correspond to the

fragment shown in Table 1 (using simplified timestamps).

Using existing process mining techniques it is possible to extract a process model

from Table 2. For example, by applying the α algorithm [9] we obtain the process model

shown in Fig. 2. This simple Petri net model [25] describes the process that starts with

9

A

B

C

DE

start complete

p1

p2

p3

p4

Fig. 2: A process model discovered from Table 2 using the α algorithm.

A and ends with D. In-between A and D either E or B and C are executed (in any

order).

Clearly, process mining – in particular control-flow discovery – is related to

the classical work on inductive inference. However, there are also notable differences

because, unlike most of the classical work, process mining focuses on higher order

representations which explicitly model concurrency (e.g., Petri nets, UML ADs, EPCs,

BPMN, etc.) rather than lower level representations (e.g., Markov chains, finite state

machines, or regular expressions). Moreover, we do not assume negative examples (i.e.,

there are no events stating that an activity cannot happen) and deal with issues such

as incompleteness (i.e., if something did not happen, it may still be possible) and

exceptional behavior. See [1] for an overview of existing process discovery approaches.

Process mining is not limited to control-flow discovery [1]. First of all, besides

the control-flow perspective (“How?”), other perspectives such as the organizational

perspective (“Who?”) and the case/data perspective (“What?”) may be considered.

Second, process mining is not restricted to discovery. Typically three basic types of

process mining are considered: (a) discovery, (b) conformance, and (c) enhancement

[1]. In this article we will focus on process discovery, i.e., discovering a model from raw

events. Discovery serves as the starting point for the two other types of process mining.

The second type of process mining is conformance [27, 23]. Here, an existing process

model is compared with an event log of the same process. Conformance checking can be

used to check if reality, as recorded in the log, conforms to the model and vice versa. The

10

third type of process mining is enhancement [8]. Here, the idea is to extend or improve

an existing process model using information about the actual process recorded in some

event log. Whereas conformance checking measures the alignment between model and

reality, this third type of process mining aims at changing or extending the a-priori

model. For instance, by using timestamps in the event log one can extend the model

to show bottlenecks, service levels, throughput times, and frequencies.

For example, the event log in Table 2 shows timestamps. When replaying the

event log on the process model shown in Fig. 2, we can measure the time spent in the

places in-between the various activities. This can be used to identify bottlenecks and

predict the remaining flow time for running cases [1, 8].

THORAX 2R 386002(complete)

126

1E CONSULT 410100(complete)

165

0,78644

GYN.-KORT-KO 10107(complete)

137

0,88932

SKELETSC.TOT 304022B(complete)

3

0,51

CYTOL.VULVA 355428(complete)

1

0,51

SCC EIA 376480A(complete)

69

0,821

CYTOL.ASCIT. 355401(complete)

30

0,66721

CYTOL.VAGINA 355427(complete)

13

0,52

OND.V.ELDERS 383333(complete)

3

0,51

TARIEF CONS. 419100(complete)

495

AS-ERY. SCR. 378607(complete)

285

0,92371

TEL.CONS. KO 415100(complete)

183

0,96861

MRI ABDOMEN 387090(complete)

41

0,87520

OESTRADIOL 378431(complete)

4

0,6672

PROGESTERON 372442A(complete)

2

0,51

ECHO MAMMA 386970(complete)

3

0,51

CT BEKKEN MC 389142(complete)

2

0,51

ALBUMINE 378453A(complete)

238

ALK.FOSFAT. 370423(complete)

187

0,992144

CALCIUM 377498A(complete)

240

0,95241

CRP 378452(complete)

95

0,84236

NATRIUM VLAM 377842C(complete)

2

0,51

VANCOMYCINE 377410G(complete)

2

0,51

VERV.CONSULT 411100(complete)

676

0,94469

CA-19.9 379414(complete)

3

0,52

NO SHOW 380000(complete)

2

0,51

BILI. GECON. 370401(complete)

144

0,991131

BILI TOTAAL 370401C(complete)

193

0,95229

BEAD.ANESTH. 40032(complete)

4

0,51

AMYLASE 370415(complete)

11

0,757

GENTAMYCINE 377410D(complete)

4

0,51

0,975137

HB FOTOELEKT 370407D(complete)

410

LEUKO TELLEN 370712B(complete)

289

0,967266

HEMATOCRIET 370711(complete)

39

0,92326

FT 4 RIA 376406(complete)

9

0,6672

LWK 2R 383302(complete)

1

0,51

0,992151

HAPTO. 375101(complete)

4

0,53

TROMB TELLEN 370715A(complete)

263

0,977177

DIFF.AUTOM. 370701(complete)

284

0,96662

ICC-KL.ANAES 413489(complete)

11

0,51

BEAD.ANESTH. 40031(complete)

4

0,51

CT PULMON.MC 385442(complete)

2

0,51

G-GLUT-TRANS 372417(complete)

185

0,991136

KALIUM POTEN 370443(complete)

490

0,95850

CHLORIDE 370420(complete)

52

0,92931

SGOT KIN. S 370489T(complete)

3

0,51

NATRIUM VLAM 370135(complete)

5

0,53

AANNAME LAB 370000(complete)

2444

0,915223

CEFALINETIJD 370737C(complete)

29

0,91720

ANTITROMB. 375553D(complete)

4

0,52

ANF 375408B(complete)

1

0,51

GLUCOSE 370402(complete)

215

0,992146

0,95724

0,9881212

LIGDAGTARIEF 40014(complete)

1745

0,909586

KRUISPR. 375075(complete)

292

0,857116

STAGLAP.OMCT 335512J(complete)

4

0,54

INFUUS INBR. 339956(complete)

33

0,72729

B.O.Z. 1R 387001(complete)

20

0,516

KALIUM POTEN 377842A(complete)

3

0,53

ONDERZ.KWEEK 370504A(complete)

228

0,941121

HEUP R. 2R 389202R(complete)

1

0,51

CT THORAX MK 386042(complete)

24

0,7518

MICR.ONDERZ. 370501F(complete)

15

0,87513

ALBUMINE SP 378453S(complete)

53

0,81649

ART.PUNCT.CR 339954A(complete)

6

0,6676

CT A.PULM.MC 385442A(complete)

3

0,53

B.O.Z. 2R 387002(complete)

4

0,53

CLOSTRIDIUM 378216A(complete)

7

0,757

CHOLESTEROL 370425(complete)

3

0,6672

EIWIT COLOR. 370172(complete)

4

0,6674

THORAX 1R 386001(complete)

6

0,6675

DRL.THORAX 386000(complete)

1

0,51

CPK 370488H(complete)

7

0,6677

AMMONIAK 370483(complete)

1

0,51

BEKKEN LIGG. 389101(complete)

3

0,52

ERY ELEC S 377131S(complete)

1

0,51

DUPLEX-VEN. 339849W(complete)

6

0,55

LCR 378546(complete)

5

0,84

ALUMINIUM 378437(complete)

2

0,52

VIT. B1-THM. 378624(complete)

4

0,53

AFWEZIGH.DAG 610002(complete)

1

0,51

LYMFADENECT. 333727(complete)

2

0,51

LAPAROTOMIE 335512C(complete)

4

0,6674

IMM.FIX. 377450(complete)

1

0,51

IGA 370476B(complete)

2

0,52

ICC-KL.UROLO 413406(complete)

2

0,51

MYCOBAC PCR 378697F(complete)

1

0,51

0,993181

CREATININE 370419(complete)

483

0,986438

APCA 330398A(complete)

2

0,51

MELKZUUR 376482C(complete)

18

0,813

CA-125 MEIA 378619A(complete)

188

CEA MEIA BL 376400D(complete)

107

0,95796

CT THORAX ZK 386041(complete)

5

0,53

CYTOL.DIVER. 355499(complete)

2

0,51

CA 15.3 MEIA 378619E(complete)

6

0,755

ANES.VERV. 339992Z(complete)

1

0,51

PARAPROT.TYP 375128(complete)

2

0,51

0,83333

LDH KINET. 370488J(complete)

184

0,979159

NATRIUM VLAM 370442(complete)

494

0,975258

CDE FENOTYP 375003A(complete)

13

0,51

0,85738

0,978156

MAGN. DIV. 378858(complete)

61

0,91715

A1-FETOPROT. 378449(complete)

8

0,754

TRIGLYCERIDE 370460E(complete)

4

0,52

ORDERTARIEF 379999(complete)

14430,992891

OP. UTERUS 337105(complete)

5

0,52

0,8100

HIST.GR.PREP 356133(complete)

54

0,837

0,90961

HIST.KL.PREP 356132(complete)

26

0,66720

OP.BUIK 335519A(complete)

3

0,51

TESTOSTERON 376487D(complete)

2

0,52

ANESTHESIE 339090N(complete)

2

0,52

HIST.BIOPTEN 356134(complete)

49

0,7527

ECHO NIER 388170(complete)

10

0,6678

ECHO GEN.INT 339486E(complete)

30

0,7517

B-SUBUN. HCG 370828A(complete)

6

0,754

PACLITAXEL 686405(complete)

50

0,88930

ECHO ABDOMEN 387070A(complete)

2

0,52

LYMFSC.SCH.W 302211F(complete)

5

0,6673

VULVECT.LIES 337440(complete)

5

0,53

FACT. 8 ACT. 375552A(complete)

1

0,51

EXC. UTERUS 337101B(complete)

3

0,6672

TITR.DIR.CMB 375012(complete)

2

0,51

RENOGR.LASIX 307031G(complete)

2

0,52

MR BIJNIER 388890(complete)

1

0,51

RETI TELLEN 370716(complete)

4

0,51

FQ1 - FQ2 710290(complete)

6

0,754

OV.OP.CLITOR 337436(complete)

1

0,51

BLD.GRP.KIDD 378610(complete)

2

0,51

VULVECT.LIES 337451(complete)

1

0,51

ICC-KL.INTER 413413(complete)

1

0,51

EIWIT BEP. 700050(complete)

3

0,6673

EXC.ADNEX DZ 336950(complete)

1

0,51

ECHO BUIKW. 387970(complete)

1

0,51

PLEURAPUNCTI 332610(complete)

1

0,51

ANTIST.KOUD 375009(complete)

2

0,51

ECHO CAROT.L 381670L(complete)

1

0,51

ECHO DOPPLER 339482A(complete)

1

0,51

VIT. A 377439(complete)

1

0,51

0,987

SGOT-ASAT 370488E(complete)

215

0,994201

UREUM 370403(complete)

246

0,97864

CREATININE 370129(complete)

4

0,51

SGPT-ALAT 370488G(complete)

217

0,984207

0,993169

CPK-MB S 378403S(complete)

6

0,51

TOT.EIW. 370480A(complete)

7

0,754

FOLIUMZUUR 370465Q(complete)

3

0,51

0,951198

0,93884

0,833202

BNP 376425A(complete)

6

0,6674

IJZER 370437(complete)

5

0,51

OSMOLALITEIT 372107(complete)

1

0,51

FERRITINE 372454A(complete)

5

0,51

MICROALBUM. 378173B(complete)

1

0,51

ABO RH 370604(complete)

286

0,929199

BLD.GRP.KELL 375004(complete)

16

0,756

IRREG.AS ERY 378609S(complete)

12

0,8335

ICC-KL.CHIR. 413403(complete)

2

0,51

EIWITFRACT. 376478(complete)

1

0,51

RH-D CENTRIF 370606(complete)

286

0,981273

CT B.BUIK MC 387042(complete)

2

0,51

GEB. A.S.ERY 378609N(complete)

3

0,51

CT HERSEN.MC 381342(complete)

2

0,51

0,903224

0,6672

GEB. A.S.ERY 378609M(complete)

6

0,6672

CT HERSEN.ZC 381341(complete)

3

0,51

EIW.SPEC.KWN 370433F(complete)

2

0,51

0,831

COUPE INZAGE 355111(complete)

40

0,88

TOTAAL T4 376406B(complete)

1

0,51

CT ABDOM.MC 387042A(complete)

90

HAEMOGLOB. S 370701S(complete)

502

0,85727

CYTOL.NIERC. 355426(complete)

4

0,51

ZWARE DAGVPL 619700(complete)

1

0,51

0,945

0,981131

SHBG 377447(complete)

2

0,52

GYN.-JAAR-KO 10307(complete)

61

0,754

DAGVERPL. 619600(complete)

64

0,7512

ERYS ELEKTR. 378731(complete)

1

0,51

BOTDICHT.LWK 304360E(complete)

1

0,51

ECHO ROUTINE 339486G(complete)

3

0,51

DUPL.BEEN EZ 389073F(complete)

1

0,51

ELEKTROCARD. 330001B(complete)

70

0,66728

0,92315

KLIN.KRT.ANS 20189(complete)

8

0,52

KLASSE 3B 613000(complete)

13880,923618

REGIO-TOESL. 614400(complete)

1065

0,824560

STAGLAP.OMCT 335512N(complete)

2

0,52

ONTSTEK.TOT. 302622H(complete)

1

0,51

AS-HBS. KWN 375140(complete)

1

0,51

DIGOXINE 376454A(complete)

2

0,52

AUD KRT 1.5 659030(complete)

1

0,51

CYSTOSCOPIE 339161(complete)

2

0,51

ECHO ONDBUIK 388070A(complete)

1

0,51

VIT. B3 370474G(complete)

1

0,51

0,909276

0,9841061

KLIN.OPNAME 610001(complete)

312

0,942292

STAGLAP.REDU 335512H(complete)

7

0,6673

CYTOL.LYMFEK 355409(complete)

7

0,6672

IRREG.AS ERY 378609R(complete)

11

0,6674

PROT-S.TOT. 375581J(complete)

1

0,51

ECHO BO.BUIK 387070(complete)

6

0,54

KATHET.STOMA 334899(complete)

1

0,51

RIB STERN.2R 386802(complete)

1

0,51

ICC-KL.LONGZ 413422(complete)

1

0,51

ICC-KL.NEURL 413409(complete)

1

0,51

CT RETROP.MC 388942(complete)

1

0,51

CYTOL. BUIK 355435(complete)

1

0,51

ANTI-HAV.IG 371115(complete)

1

0,51

HOOGFR.AUDIO 657026(complete)

1

0,51

OSMOLALITEIT 370496(complete)

1

0,51

CYTOL.PLEURA 355454(complete)

3

0,53

TZ1 710072(complete)

2

0,52

CT HERSENEN 381343(complete)

1

0,51

MR GR.HERSEN 381390(complete)

1

0,51

VIT. B2 370474B(complete)

1

0,51

0,941122

LEUCO ELEC S 377121S(complete)

294

0,954272

SINUS 2R 382102(complete)

1

0,51

0,838197

CYTOL.ECTOC. 355201(complete)

34

0,6674

KLASSE 3A 612000(complete)

250

0,835

IMM.PATH.OND 350503(complete)

65

0,822

CONISATIE 337220(complete)

4

0,6672

ECHO BLAAS 339488A(complete)

10

0,6676

VULVECTOMIE 337452(complete)

3

0,51

CYTOL. LEVER 355431(complete)

2

0,51

AFW. VULVA 337419C(complete)

5

0,6672

NATRIUM S 370135S(complete)

3

0,51

EPI.ANALG.AN 339090B(complete)

1

0,51

CYTOL.LONGP. 355411(complete)

1

0,51

KLIN.KRT.INW 20113(complete)

1

0,51

AFW. VAGINA 337319(complete)

1

0,51

TROMBO S 370715S(complete)

290

PTT 370737S(complete)

50

0,9544

DIFF.HANDM. 379000A(complete)

14

0,6679

FDP DIMEER 376467E(complete)

7

0,6674

KALIUM S 370136S(complete)

2

0,52

PROTROMB. S 370707S(complete)

45

0,93831

INR TROMBOPL 370737Z(complete)

44

0,916

0,9642

KALIUM S 370443S(complete)

379

CREATININE S 370419S(complete)

206

0,955176

MELKZUUR S 376482S(complete)

135

0,923129

OVARIUMCARC. 337106A(complete)

5

0,51

LYMFADENECT 333742(complete)

1

0,51

NATRIUM S 370442S(complete)

373

0,975141

MELKZUUR SP 370488T(complete)

32

0,9628

MAGN.DIV. S 378858S(complete)

25

0,91713

O2-SATURATIE 378458(complete)

229

0,835143

ANTI-HIV 378644(complete)

2

0,51

SGOT ASAT SP 370489S(complete)

62

0,89156

CAPNOGRAFIE 339832C(complete)

14

0,6675

UREUM S 370403S(complete)

78

0,83969

URINEZUUR 370416(complete)

4

0,753

TROPONINE-T 378468P(complete)

7

0,6672

0,8335

GLUCOSE S 370402S(complete)

209

0,862176

0,93326

OVARIUMCARC. 337106(complete)

2

0,52

EXC.ADNEX EZ 336930(complete)

1

0,51

CALCIUM S 370426S(complete)

237

0,6672

BLAASKATHET. 336272(complete)

1

0,51

0,85748

0,896

0,933143

0,7511

EC PUN.LEVER 387677(complete)

1

0,51

LA2 710170B(complete)

2

0,252

GEFILT.ERYT 710170(complete)

187

0,66714

MET-SULF-HB 370407C(complete)

190

0,885103

DARM SCINT. 306332C(complete)

2

0,51

0,83328

0,982113

BICARBONAAT 370424(complete)

214

0,819

0,827145

0,92327

ANTI-HEPAT-C 377479A(complete)

2

0,51

0,7518

VRIESCOUPE 355105(complete)

10

0,6676

0,83331

0,90949

PH-PCO2-BIC. 372414(complete)

212

0,825129

0,7525

0,90978

CO-HB 370440(complete)

189

0,85592

0,92357

EXT. UTERUS 337105F(complete)

3

0,753

OP.BUIK 335512(complete)

1

0,51

0,85742

0,776147

HEP-B SURF. 375138A(complete)

101

0,6679

ALFA-AMYLASE 370117(complete)

1

0,51

ICCV-KL.CHIR 414403(complete)

1

0,51

0,817115

CHLORIDE S 370420S(complete)

22

0,92919

GAMMA-GT S 372417S(complete)

34

0,91718

LIGDAG IC 40034(complete)

9

0,6677

OP.BUIK 335519B(complete)

1

0,51

LIPASE 370415A(complete)

1

0,51

0,85716

0,83321

0,94726

GYN.-AANV.KO 10207(complete)

36

0,753

BEZOEK 410500(complete)

37

0,8335

EC-BIOP.BEKK 389177(complete)

1

0,51

COLON INLOOP 387511(complete)

1

0,51

CITO HISTOL. 359999(complete)

29

0,87514

0,811

0,95207

HEMATOCR. S 370711S(complete)

27

0,88921

0,909123

0,89899

LISEXC.CERV. 337202(complete)

1

0,51

DUO SCOP.ECH 339141J(complete)

1

0,51

0,97150

0,993210

BSE 378729(complete)

14

0,754

CHLORIDE 370119A(complete)

1

0,51

0,817

0,82445

0,944350

THORAX ZAAL 386001Z(complete)

22

0,6679

VULVECTOMIE 337441(complete)

1

0,51

ANTI-HBC-IAM 377478(complete)

1

0,51

TOONAUDIOMET 657021(complete)

1

0,51

0,87541

0,6678

0,6673

UTERUSCURETT 337190C(complete)

5

0,6673

MORFOMETRIE 355107(complete)

1

0,51

0,52

0,84

EXC. UTERUS 337101(complete)

7

0,86

0,51

0,51

0,6673

0,83347

0,7558

FSH EIA 372439(complete)

3

0,51

TT 375518(complete)

23

0,8578

PROTROMBINET 378720(complete)

28

0,87516

IGG-A.CARD. 375421C(complete)

1

0,51

FIBRINOGEEN 370487A(complete)

2

0,51

0,90920

0,7523

IGM-A.CARD. 375421B(complete)

1

0,51

0,6673

MAAGONTL.VVL 306231E(complete)

18

DARM SCIN.VV 306333C(complete)

18

0,718

0,6672

0,66711

ECHO CAROT.R 381670R(complete)

1

0,51

0,6672

CREATININE 377847A(complete)

2

0,6672

UREUM 377840(complete)

1

0,51

0,51

TSH EIA 372441(complete)

12

0,66712

0,54

0,54

FOSFAAT 370421(complete)

35

0,91721

0,85713

0,87512

0,97441

CT ABDOMEN 387043A(complete)

1

0,51

0,94157

0,94190

URINE ONDZ. 378149(complete)

81

SEDIMENT S 370111S(complete)

47

0,94746

RES.5 BEP. 370505A(complete)

78

0,6674

SEDIMENT 370111(complete)

21

0,7520

0,87

0,754

0,9731

0,7557

DOPPLER HART 339494C(complete)

1

0,51

AMMONIAK S 370483S(complete)

1

0,51

0,79242

ZWANGERSCH.S 370804S(complete)

1

0,51

0,7518

0,7529

0,94713

DIEET NNO 709999(complete)

37

0,66712

DUN.DARM MC 387411(complete)

1

0,51

0,53320

0,51

0,90921

0,9556

HS-CRP 378452A(complete)

1

0,51

0,51

VAGINA-TOUCH 339988E(complete)

34

0,85714

AFW. VULVA 337480(complete)

2

0,51

AFW. VULVA 337419(complete)

1

0,51

0,8578

0,7514

0,84

LH BLOED 372440A(complete)

4

0,753

0,753

0,51

0,94417

CT BEKKEN ZC 389141(complete)

1

0,51

0,66715

COLPOSCOPIE 339170(complete)

3

0,6673

COLPOSCOPIE 339171A(complete)

2

0,51

0,52

0,66722

URODYN.5 KAN 339869K(complete)

1

0,51

DOPPL.O.EXTR 339848D(complete)

1

0,51

DUPLEXSCAN 339848H(complete)

1

0,51

0,53

PROLACTINE 372443(complete)

2

0,52

0,52

EC PUN.HALS 382977(complete)

1

0,51

CT DRAINAGE 380048(complete)

1

0,51

0,66717

0,52

0,87511

0,87

0,51

0,6673

HYSTEROSCOP. 339186(complete)

5

0,51

CYST.UR.SCOP 339160(complete)

22

0,76919

AFW. VAGINA 337380(complete)

2

0,51

0,6676

0,6672

0,53

AFW.VRW.ORG. 337180(complete)

3

0,6672

0,51

LYMFEKL.BIOP 333780(complete)

1

0,51

0,6673

0,5713

CYTOL.PUNCT. 350507(complete)

3

0,6672

0,53

0,51

0,51

0,55

0,82

0,7512

BLD.GRP.LEW. 378490G(complete)

1

0,51

0,87512

0,53

0,88910

DAGVERPLEG. 40016(complete)

66

0,88148

0,87531

0,92315

IUD 337292(complete)

1

0,51

ALK.FOSFAT.S 370423T(complete)

44

0,94420

CRP S 378452S(complete)

43

0,917

BILT BILG S 370401S(complete)

47

0,94727

AMYLASE S 370415S(complete)

10

0,757

0,95529

FOSFAAT S 370421S(complete)

20

0,90913

ECHO BEEN 389070(complete)

1

0,51

0,83315

0,94419

0,91713

SGPT ALAT SP 370488S(complete)

59

0,97756

0,96242

EIW.TOT. S 370480S(complete)

8

0,86

0,756

0,7522

0,6672

0,90917

0,6677

0,83

INBR.KATHET. 333698(complete)

8

0,6673

0,54

0,53

C.V.V.H.D. 339970J(complete)

1

0,51

0,54

DRAIN.THORAX 332600D(complete)

2

0,51

0,93826

0,58

0,53

0,8338

0,51

0,85712

0,51

0,66714

0,88917

0,6673

0,94720

0,755

0,51

0,756

LYMFES.SCH.W 302213E(complete)

5

0,8335

SCINT.LYMFEK 302282F(complete)

5

0,8335

0,6674

0,53

MRI BEKKEN 389190(complete)

20

0,758

0,51

ECHO HALS 382970(complete)

1

0,51

0,51

0,53

0,52

0,6673

0,52

FACTOR V 378718(complete)

1

0,51

PROT-S.VRIJ 375581K(complete)

1

0,51

FII -DNA PCR 378717A(complete)

1

0,51

PROT. C ACT. 370743Q(complete)

1

0,51

0,51

0,51

TROMBINETIJD 375517(complete)

1

0,51

AS. ELISA 375423(complete)

1

0,51

0,51

L.A.C. 375552C(complete)

1

0,51

0,51

0,7512

0,8575

0,53

0,51

0,51

0,51

0,753

0,6674

0,51

0,52

DIR.COOMBS 375005(complete)

6

0,52

0,52

0,52

0,52

DRL.BUIK 387000(complete)

1

0,51

0,52

0,755

0,752

0,6675

0,6676

0,92313

0,51

0,52

0,51

0,53

VIT B12 370466C(complete)

2

0,51

TR.FERRINE 378808(complete)

5

0,754

0,52

ERY-ELUAAT 378490B(complete)

2

0,51

0,8333

0,81

AS-ERY.SPEC. 378609K(complete)

7

0,756

0,6672

0,6674

0,53

0,51

CHOLESTEROL 372425F(complete)

3

0,753

0,51

0,6673

BOTDICHT.FEM 304360F(complete)

1

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,53

0,55

0,6674

0,6674

0,52

0,83310

0,6675

0,51

CK-MB 378403(complete)

7

0,755

0,755

0,51

0,51

0,51

0,52

0,51

0,51

0,51

O.BEEN L. 2R 389502L(complete)

1

0,51

B.BEEN L. 2R 389302L(complete)

1

0,51

0,51

0,51

0,54

0,6675

0,54

0,52

OP.VRW.ORG. 337469(complete)

1

0,51

0,51

0,6675

0,51

0,51

0,51

0,51

0,51

MAM.GR.THWND 386902(complete)

7

0,753

0,8335

0,52

0,51

0,52

0,52

0,52

BRONCH.TOIL. 339943B(complete)

1

0,51

0,81

0,51

0,51

0,51

0,52

0,752

0,6673

0,51

0,51

HCVR PCR 378639U(complete)

1

0,51

0,51

0,52

0,51

0,51

0,52

0,51

0,52

0,754

BLD.GRP.MNSP 378490E(complete)

6

0,52

0,52

0,51

0,82

GEB.A.S.ERY 378609Y(complete)

1

0,51

0,51

0,82

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,51

SCHOUD.L. 2R 384202L(complete)

1

0,51

0,51

0,6672

0,51

0,51

0,51

0,51

0,51

0,51

0,6673

0,51

0,51

0,51

0,51

IMMUNOFORESE 378444A(complete)

1

0,51

0,51

0,52

IGG 370476A(complete)

2

0,6672

IGM 370476C(complete)

2

0,6672

0,52

0,52

0,51

0,51

0,51

0,52

0,51

0,51

0,51

SPRAAKAUD.ST 657031(complete)

1

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,52

0,54

0,51

0,51

0,51

0,52

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,51

0,6671

0,51

0,51

0,51

0,51

0,51

VIT. E 376451(complete)

1

0,51

VIT. B6 370474A(complete)

1

0,51

0,51

0,51

(a) hospital

010 Registreren huuropzegging(complete)

208

030 Vastleggen toekomstige adres(complete)

208

0,992193

050 Plannen afspraak 1e inspectie(complete)

163

0,978154

020 Vastleggen datum van overlijden(complete)

6

0,8576

057 Plannen eindinspectie bedryfsr/gar/ber/park/op(complete)

9

0,753

050 Inplannen afspraak 1e inspectie(complete)

33

0,94430

040 Vastleggen toekomstig adres medehuurder(complete)

32

0,8576

070 Is 1e inspectie uitgevoerd ?(complete)

204

0,875103

055 Plannen eindinspectie bedryfsr/gar/ber/park/op(complete)

1

0,51

060 Aanmaken bevestigingbrief / huuropzeggingform.(complete)

196

0,992163

0,99593

100 Gereedmelden 1e insp. / Voorcalculatie maken(complete)

192

0,993192

120 Plannen eindinspectie(complete)

192

0,966192

110 Bepalen leegstandsoort(complete)

192

0,944192

080 Versturen brief 'Niet thuis'(complete)

12

0,92312

400 Is eindinspectie uitgevoerd ?(complete)

171

0,929123

0,981129

300 Is eindinspectie uitgevoerd ?(complete)

34

0,93820

440 Zijn er nieuwe of niet herstelde gebreken ?(complete)

168

0,994168

410 Versturen brief 'niet thuis'(complete)

3

0,6673

450 Krijgt de huurder tijd om te herstellen ?(complete)

27

0,96427

500 Beoordelen/wijzigen leegstandsoort(complete)

168

0,991141

0,94120

460 (Her)plannen 2e eindinspectie(complete)

8

0,8576

420 Wijzigen einddatum huurovereenkomst(complete)

3

0,51

510 Is opleveringsformulier ondertekend ?(complete)

168

0,994168

130 Is het opleveringsformulier ondertekend ?(complete)

192

0,957166

520 Aanmaken 2e in gebreke stelling(complete)

7

0,8337

530 Aanmaken werkopdracht(complete)

167

0,993161

140 Aanmaken 1e in gebreke stelling(complete)

12

0,90912

150 Is er sprake van ZAV ?(complete)

192

0,994180

0,91712

180 Aanpassen woningwaardering(complete)

191

0,994168

170 Aanpassen plattegrond(complete)

191

0,92914

160 Registreren ZAV(complete)

9

0,8899

190 Harmoniseren huurprijs(complete)

169

0,993158

190 Actualiseren huurprijs(complete)

34

0,96633

0,97332

205 Bepalen kandidaat huurder(complete)

124

0,95102

240 Registreren voorl. huurovereenkomst +afdrukken(complete)

166

0,98398

210 Registreren voorl. huurovereenkomst +afdrukken(complete)

35

0,85721

0,86

540 Worden er bonussen/ kosten toegekend ?(complete)

167

0,994167

550 Vastleggen bonussen / kosten(complete)

48

0,97848

560 Opstellen eindnota(complete)

169

0,977100

0,92329

210 Aanmaken leegmelding en exporteren (WMS)(complete)

46

0,87510

260 Is contract getekend en geld ontvangen ?(complete)

166

0,993166

300 Wijzigen status WMS (definitief geaccepteerd)(complete)

94

0,98194

290 Definitief maken Huurovereenkomst(complete)

165

0,989165

270 Verwijderen voorlopige huurovereenkomst(complete)

1

0,51

310 Aanpassen factureerafspraak(complete)

162

0,80886

570 Archiveren huuropzegging(complete)

167

0,994167

0,91134

0,993159

305 Vastleggen huishoudgrootte en inkomen(complete)

3

0,753

330 Archiveren nieuwe verhuring(complete)

162

0,98162

320 After sales(complete)

162

0,991162

0,97631

0,90931

0,8576

058 Aanmaken bevest.brief huuropzegging(b/g/bso/p)(complete)

11

0,8339

075 Bepalen leegstandssoort bedr/gar/berg/park/op(complete)

12

0,911

0,91711

220 Aanbieden zelfstandige woning (WMS)(complete)

45

0,97645

230 Registreren/controleren kandidaat (WMS)(complete)

45

0,97645

0,96634

0,95833

200 Toewijzen woning/bedr.ruimte/gar/berg/park/ops(complete)

35

0,96834

0,87513

340 Zijn er nieuwe of niet herstelde gebreken ?(complete)

34

0,97134

400 Beoordelen/wijzigen leegstandsoort(complete)

34

0,96834

410 Is opleveringsformulier ondertekend ?(complete)

34

0,96934

430 Aanmaken werkopdracht(complete)

34

0,96434

440 Worden er bonussen/ kosten toegekend ?(complete)

34

0,97134

450 Vastleggen bonussen / kosten(complete)

4

0,754

0,93311

065 Aanmaken bevest.brief huuropzegging(b/g/bso/p)(complete)

1

0,51

460 Opstellen eindnota(complete)

34

0,753

220 Is contract getekend en geld ontvangen ?(complete)

34

0,9734

240 Definitief maken Huurovereenkomst(complete)

33

0,96633

230 Verwijderen voorlopige huurovereenkomst(complete)

1

0,51

250 Aanpassen factureerafspraak(complete)

32

0,96832

260 After sales(complete)

32

0,96326

270 Archiveren nieuwe verhuring(complete)

32

0,8576

0,60626

0,92315

470 Archiveren huuropzegging(complete)

34

0,97134

0,9092

0,8899

470 Wijzigen einddatum huurovereenkomst(complete)

8

0,86

480 Is de 2e eindinspectie uitgevoerd ?(complete)

8

0,8335

0,8576

490 Versturen brief Niet thuis(complete)

2

0,52

0,52

0,753

0,6673

0,51

430 Herplannen eindinspectie(complete)

3

0,753

0,6672

0,51

090 Herplannen 1e inspectie(complete)

12

0,92312

0,9238

0,51

0,51

0,51

(b) housing agency

Fig. 3: Two process models discovered using conventional process discovery techniques.

11

As input we assume an event log in XES format. In 2010, the IEEE Task Force

on Process Mining standardized XES (www.xes-standard.org), a standard logging

format that is extensible and supported by the OpenXES library (www.openxes.org)

and by tools such as ProM, XESame, Disco, Nitro, etc. XES is the successor of

the MXML format and we will also support this older format.

Fig. 3 shows two example models discovered using ProM’s heuristic miner

[1, 28]. The model in Fig. 3a was discovered based on event data of a group of 627

gynecological oncology patients treated in the AMC hospital in Amsterdam. All diag-

nostic and treatment activities have been recorded for these patients. The event log

contains 24331 events referring to 376 different activities. The process model shows all

376 activities and the paths followed by patients. The model looks Spaghetti-like, but

can be simplified by looking at homogeneous groups of patients and/or by focusing

on the frequent activities. The model in Fig. 3b was discovered using an event log

extracted from the database of a large Dutch housing agency. The event log contains

5987 events relating to 208 cases and 74 activity names. Each case corresponds to a

housing unit (accommodation such as a house or an apartment). The process starts

when the tenant leasing the unit wants to stop renting it. The process ends when a

new tenant moves into the unit after handling all formalities.

Process Mining Challenges and Evaluation Criteria

Traditional process discovery techniques suffer from the following limitations:

• Process discovery is done offline, i.e., it is assumed that there is a representative

event log. In some applications this assumption is unrealistic because it is im-

possible or too costly to store all event data. Recently, process mining techniques

12

have been developed for predictions and recommendations. However, also these

techniques do not discover process models on-the-fly.

• It is impossible to discover process models for extremely large event logs (i.e.,

terabyte logs or logs with thousands of different activities). Algorithmic tech-

niques such as heuristic mining [28], fuzzy mining [17], and the α-algorithm [9]

are fast, but as data sets continue to grow even these techniques will not be able

to keep up. Region-based techniques [7, 12, 29] are more precise but also time

consuming. Genetic process mining algorithms [22] can be distributed easily,

but are extremely inefficient.

• Most process discovery techniques assume the process to be in steady-state. It is

assumed to be irrelevant whether a case occurs at the beginning of the log or

towards the end. As a result, these techniques do not capture concept drift [14].

Processes may exhibit seasonal patterns (e.g., due to the increasing workload in

December some checks are skipped), sudden abrupt changes (e.g., a disaster or

a new law), or gradual changes (e.g., an increasing market share).

• The same process may exist within different organizations or different parts of

the same organization. Within a process there may be homogeneous groups of

cases that share common characteristics. Several authors proposed techniques to

cluster similar cases [13, 16]. These techniques focus on producing simple models

for subsets of cases. However, the resulting process models are not related and

cannot be folded easily into an overall configurable process model.

To evaluate process models discovered using process mining, we need to align

event log and model. Suppose that an event log contains cases that can be char-

acterized by the following three traces: σ1 = 〈A,B,C,D〉, σ2 = 〈A,C,D〉, and

σ3 = 〈A,C,D,B,D〉. Example alignments for these three traces are (based on Fig. 2):

13

γ1=

A B C D

A B C D

γ2=

A C � D

A C B D

γ3=

A C D B D

A C � B D

γ4=

A C � D B D

A C B D � �

The top row of each alignment corresponds to “moves in the log” and the bottom row

corresponds to “moves in the model”. If a move in the log cannot be mimicked by a

move in the model, then a “�” (“no move”) appears in the bottom row. If a move in

the model cannot be mimicked by a move in the log, then a “�” (“no move”) appears

in the top row. For example, in γ1 the trace in the log (σ1) and the model (Fig. 2) are

aligned perfectly as every move in the log is mimicked by a move in the model and vice

versa. In γ2, trace σ2 is aligned with Fig. 2. Since C is followed by D and no B occurred,

the model makes a B move without a corresponding move in the log. In γ3, trace σ3

is aligned with Fig. 2. Now the log makes a D move without a corresponding move

in the model. Given a trace in the event log, there may be many possible alignments.

The goal is to find the alignment with the least number of � elements, e.g., γ3 seems

better than γ4. Finding a optimal alignment can be viewed as an optimization problem

as shown in [5, 10].

The number of � elements can be used to quantify fitness. Model and log have

a perfect fitness if all traces in the log can be replayed by the model from beginning

to end. Fitness is just one of the four basic conformance dimensions defined in [1].

Other quality dimensions for comparing model and log are simplicity, precision, and

generalization.

The simplest model that can explain the behavior seen in the log is the best

model. This principle is known as Occam’s Razor. There are various metrics to quantify

the complexity of a model (e.g., size, density, etc.).

The precision dimension is related to the desire to avoid “underfitting”. It is

very easy to construct an extremely simple Petri net (“flower model”) that is able to

14

replay all traces in an event log (but also any other event log referring to the same set

of activities). See [5, 23, 27] for metrics quantifying this dimension.

The generalization dimension is related to the desire to avoid “overfitting” [1, 5].

In general it is undesirable to have a model that only allows for the exact behavior

seen in the event log. Remember that the log contains only example behavior and that

many traces that are possible may not have been seen yet.

Conformance checking can be done for various reasons, e.g., to evaluate the

results of process discovery. However, it may also be used to audit processes to see

whether reality conforms to some normative of descriptive model [6]. Deviations may

point to fraud, inefficiencies, and poorly designed or outdated procedures.

Dealing With Big Data

Figure 4 shows an overall approach for dealing with “big event data” in a compre-

hensive manner. Starting point are event logs that may be huge (millions of events).

Events may come from different data sources that change over time. The goal is to

be able to discover reliable models under these difficult circumstances. It should be

possible to discover processes while storing a minimal amount of information. More-

over, for performance reasons, it should be possible to utilize a network of computers

by distributing challenging process mining tasks. Processes may change over time and

may vary from one organization to the other. Moreover, groups of cases may exhibit

different behaviors. Therefore, it is vital to find out when and how a process changes,

and how different variants of the process can be discovered and compared.

One can consider two basic approaches for on-the-fly process discovery: sampling

and aggregation (see Fig. 4). For sampling we retain a representative subset of cases,

e.g., based on a time window. Techniques based on aggregation do not store cases, but

only aggregate information, e.g., the frequency of direct successions (with smoothing to

15

ca

se

s

time

inp

ut

da

ta

org

./g

rou

p

time

dis

co

ve

r

“big” event data

on-the-fly process discovery

distributed

process

discovery

concept drift analysis

configurable

process models

me

rgesample

aggregate

Fig. 4: Towards a more comprehensive approach to process mining supporting on-

the-fly and/or distributed process mining while considering concept drift and process

variability.

give more weight to recent observations). The challenge is to apply the best approach

given characteristics of the log and desirable quality levels. For example, there are

various tradeoffs between saving storage space and preserving model quality [15, 11].

Today, there are many different types of distributed systems, i.e., systems com-

posed of multiple autonomous computational entities communicating through a net-

work. Grid computing, multicore CPU systems, manycore GPU systems, cluster com-

puting, and cloud computing all refer to systems where different resources are used

concurrently to improve performance and scalability. We consider three basic types of

distribution [4]. This classification is based on the way the log is partitioned.

• Replication. If the process mining algorithm is non-deterministic (e.g., a genetic

algorithm), then the same task can be executed on all nodes and in the end the

best result can be taken. In this case, the event log can be simply replicated,

i.e., all nodes have a copy of the whole event log.

• Vertical partitioning. Event logs are composed of cases. There may be thousands

or even millions of cases. These can be distributed over the nodes in the network,

16

i.e., each case is assigned to one computing node. All nodes work on a subset of

the whole log and in the end the results need to be merged.

• Horizontal partitioning. Cases are composed of multiple events. Therefore, we

can also partition cases, i.e., part of a case is analyzed on one node whereas

another part of the same case is analyzed on another node. In principle, each

node needs to consider all cases. However, the attention of one computing node

is limited to a particular subset of events per case.

Process mining algorithms are typically linear in the size of the log and exponential

in the number of activities. Using a vertical partitioning it is easy to achieve a linear

speedup. A horizontal partitioning may be used to achieve a super linear speedup,

because the time needed to solve “many smaller problems” tends to be less than the

time needed to solve “one big problem” [3, 2]. This is only possible if the set of activities

can be partitioned in localized process fragments. In this case, decomposition can (most

likely) be used to speed up process mining algorithms even if the smaller problems are

solved sequentially on just one computing node.

Processes often change while being analyzed. Therefore, concept drift is men-

tioned as one of the challenges in the Process Mining Manifesto [19]. Concept drift

was been investigated in the context of various data mining problems [30, 20]. In [14]

the problem is investigated in the context of process mining thereby producing some

initial results. However, many challenges remain. For example, classical conformance

notions such as fitness, generalization, and precision cannot be applied to processes

that change [1, 5]. One needs to judge the result with respect to a moving time window

of suitable length.

17

Key Applications

Process mining can be used to improve processes in a wide variety of organizations. A

few examples of the industries were process mining has been applied.

• The healthcare industry includes hospitals and other care organizations. Most

events are being recorded (blood tests, MRI scans, appointments, etc.) and cor-

relation is easy because each event refers to a particular patient. The closer

processes get to the medical profession, the less structured they become. For

instance, most diagnosis and treatment processes tend to be rather Spaghetti-

like. Medical guidelines typically have little to do with the actual processes. On

the one hand, this suggests that these processes can be improved by structur-

ing them. On the other hand, the variability of medical processes is caused by

the different characteristics of patients, their problems, and unanticipated com-

plications. Patients are saved by doctors deviating from standard procedures.

However, some deviations also cost lives. Clearly, hospitals need to get a better

understanding of care processes to be able to improve them. Process mining can

help as event data is readily available.

• Governments range from small municipalities to large organizations operating

at the national level, e.g., institutions managing processes related to unemploy-

ment, customs, taxes, and traffic offences. Both local and national government

agencies can be seen as “administrative factories” as they execute regulations

and the “products” are mainly informational or financial. Processes in larger

government agencies are characterized by a high degree of automation. Con-

sider, for example, tax departments that need to deal with millions of tax dec-

larations. Processes in smaller government agencies (e.g., small municipalities)

are typically not automated and managed by office workers rather than BPM

18

systems. However, due to the legal requirements, all main events are recorded in

a systematic manner. Typical use cases for process mining in governments (local

or non-local) are flow time reduction (e.g., shorten the time to get a building

permit), improved efficiency, and compliance. Given the role of governments in

society, compliance is of the utmost importance.

• Banking and insurance are two industries where BPM technology has been

most effective. Processes are often automated and all events are recorded in

a systematic and secure manner. Examples are the processing of loans, claims

management, handling insurance applications, credit card payments, and mort-

gage payments. Most processes in banking and insurance are Lasagna processes,

i.e., highly structured. Hence, all of the techniques presented in this book can be

applied. Process discovery is less relevant for these organizations as most pro-

cesses are known and documented. Typical uses cases in these industries involve

conformance checking, performance analysis, and operational support.

• The transportation industry is also recording more and more information about

the movement of people and products. Through tracking and tracing function-

ality the whereabouts of a particular parcel can be monitored by both sender

and receiver. Although controversial, smartcards providing access to buildings

and transportation systems can be used to monitor the movement of people. For

example, the Dutch “ov-chipkaart” can be used to travel by train, subway, and

bus. The traveler pays based on the distance between the entry point and exit

point. The recorded information can be used to analyze traveling behavior. The

booking of a flight via the Internet also generates lots of event data. In fact,

the booking process involves only electronic activities. Note that the traveler

interacts with one organization that contacts all kinds of other organizations in

19

the background (airlines, insurance companies, car rental agencies, etc.). All of

these events are being recorded, thus enabling process mining.

These examples illustrate that there are numerous opportunities for process mining in

various industries. Moreover, in all of these industries the volumes of event data will

grow exponentially and there is the need to present analysis results instantly. Hence,

there is a need for the distributed and on-the-fly process mining.

Future Directions

Despite the applicability of process mining there are many interesting challenges; these

illustrate that process mining is a young discipline. Process discovery is probably the

most important and most visible intellectual challenge related to process mining: it is

far from trivial to construct a process model based on event logs that are incomplete

and noisy. Still extensive research is needed to improve existing techniques or to come

up with completely new techniques. Moreover, extensive research is needed to deal with

“Big Data” challenges, i.e., handling event logs with millions of cases, billions of events,

and thousands of different activities.

Cross References

• Data Mining

• Evolution of Social Networks

• Network Representations of Complex Data

• Role Discovery

• Service Discovery

• Temporal Networks

• Web Log Analysis

20

Acknowledgements

The author would like to thank all involved in the development of of the process mining

tool ProM and related techniques (processmining.org) and all members of the IEEE

Task Force on Process Mining (www.win.tue.nl/ieeetfpm/).

References

1. W.M.P. van der Aalst. Process Mining: Discovery, Conformance and Enhancement of Business

Processes. Springer-Verlag, Berlin, 2011.

2. W.M.P. van der Aalst. Decomposing Petri Nets for Process Mining: A Generic Approach. BPM

Center Report BPM-12-20, BPMcenter.org, 2012.

3. W.M.P. van der Aalst. Decomposing Process Mining Problems Using Passages. In S. Haddad and

L. Pomello, editors, Applications and Theory of Petri Nets 2012, volume 7347 of Lecture Notes in

Computer Science, pages 72–91. Springer-Verlag, Berlin, 2012.

4. W.M.P. van der Aalst. Distributed Process Discovery and Conformance Checking. In J. de Lara

and A. Zisman, editors, International Conference on Fundamental Approaches to Software Engi-

neering (FASE 2012), volume 7212 of Lecture Notes in Computer Science, pages 1–25. Springer-

Verlag, Berlin, 2012.

5. W.M.P. van der Aalst, A. Adriansyah, and B. van Dongen. Replaying History on Process Mod-

els for Conformance Checking and Performance Analysis. WIREs Data Mining and Knowledge

Discovery, 2(2):182–192, 2012.

6. W.M.P. van der Aalst, K.M. van Hee, J.M. van der Werf, and M. Verdonk. Auditing 2.0: Using

Process Mining to Support Tomorrow’s Auditor. IEEE Computer, 43(3):90–93, 2010.

7. W.M.P. van der Aalst, V. Rubin, H.M.W. Verbeek, B.F. van Dongen, E. Kindler, and C.W.

Gunther. Process Mining: A Two-Step Approach to Balance Between Underfitting and Overfitting.

Software and Systems Modeling, 9(1):87–111, 2010.

8. W.M.P. van der Aalst, M.H. Schonenberg, and M. Song. Time Prediction Based on Process

Mining. Information Systems, 36(2):450–475, 2011.

21

9. W.M.P. van der Aalst, A.J.M.M. Weijters, and L. Maruster. Workflow Mining: Discovering Process

Models from Event Logs. IEEE Transactions on Knowledge and Data Engineering, 16(9):1128–

1142, 2004.

10. A. Adriansyah, B. van Dongen, and W.M.P. van der Aalst. Conformance Checking using Cost-

Based Fitness Analysis. In C.H. Chi and P. Johnson, editors, IEEE International Enterprise

Computing Conference (EDOC 2011), pages 55–64. IEEE Computer Society, 2011.

11. C. Aggarwal. Data Streams: Models and Algorithms, volume 31 of Advances in Database Systems.

Springer-Verlag, Berlin, 2007.

12. R. Bergenthum, J. Desel, R. Lorenz, and S. Mauser. Process Mining Based on Regions of Lan-

guages. In G. Alonso, P. Dadam, and M. Rosemann, editors, International Conference on Business

Process Management (BPM 2007), volume 4714 of Lecture Notes in Computer Science, pages 375–

383. Springer-Verlag, Berlin, 2007.

13. R.P. Jagadeesh Chandra Bose and W.M.P. van der Aalst. Trace Clustering Based on Conserved

Patterns: Towards Achieving Better Process Models. In S. Rinderle-Ma, S. Sadiq, and F. Leymann,

editors, BPM 2009 Workshops, Proceedings of the Fifth Workshop on Business Process Intelli-

gence (BPI’09), volume 43 of Lecture Notes in Business Information Processing, pages 170–181.

Springer-Verlag, Berlin, 2010.

14. R.P. Jagadeesh Chandra Bose, W.M.P. van der Aalst, I. Zliobaite, and M. Pechenizkiy. Handling

Concept Drift in Process Mining. In H. Mouratidis and C. Rolland, editors, International Confer-

ence on Advanced Information Systems Engineering (Caise 2011), volume 6741 of Lecture Notes

in Computer Science, pages 391–405. Springer-Verlag, Berlin, 2011.

15. A. Burattin, A.Sperduti, and W.M.P. van der Aalst. Heuristics Miners for Streaming Event Data.

CoRR, abs/1212.6383, 2012.

16. G. Greco, A. Guzzo, L. Pontieri, and D. Sacca. Discovering Expressive Process Models by Clus-

tering Log Traces. IEEE Transaction on Knowledge and Data Engineering, 18(8):1010–1027,

2006.

17. C.W. Gunther and W.M.P. van der Aalst. Fuzzy Mining: Adaptive Process Simplification Based

on Multi-perspective Metrics. In G. Alonso, P. Dadam, and M. Rosemann, editors, International

Conference on Business Process Management (BPM 2007), volume 4714 of Lecture Notes in

Computer Science, pages 328–343. Springer-Verlag, Berlin, 2007.

22

18. M. Hilbert and P. Lopez. The World’s Technological Capacity to Store, Communicate, and Com-

pute Information. Science, 332(6025):60–65, 2011.

19. IEEE Task Force on Process Mining. Process Mining Manifesto. In F. Daniel, K. Barkaoui, and

S. Dustdar, editors, Business Process Management Workshops, volume 99 of Lecture Notes in

Business Information Processing, pages 169–194. Springer-Verlag, Berlin, 2012.

20. M. van Leeuwen and A. Siebes. StreamKrimp: Detecting Change in Data Streams. In Machine

Learning and Knowledge Discovery in Databases, volume 5211 of Lecture Notes in Computer

Science, pages 672–687. Springer-Verlag, Berlin, 2008.

21. J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, and A. Byers. Big Data: The

Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute, 2011.

22. A.K. Alves de Medeiros, A.J.M.M. Weijters, and W.M.P. van der Aalst. Genetic Process Mining:

An Experimental Evaluation. Data Mining and Knowledge Discovery, 14(2):245–304, 2007.

23. J. Munoz-Gama and J. Carmona. Enhancing Precision in Process Conformance: Stability, Con-

fidence and Severity. In N. Chawla, I. King, and A. Sperduti, editors, IEEE Symposium on

Computational Intelligence and Data Mining (CIDM 2011), pages 184–191, Paris, France, April

2011. IEEE.

24. C. Myhill. Commercial Success by Looking for Desire Lines. In Computer Human Interaction,

volume 3101 of Lecture Notes in Computer Science, pages 293–304. Springer-Verlag, Berlin, 2004.

25. W. Reisig and G. Rozenberg, editors. Lectures on Petri Nets I: Basic Models, volume 1491 of

Lecture Notes in Computer Science. Springer-Verlag, Berlin, 1998.

26. E. Barlow Rogers. Rebuilding Central Park: A Management and Restoration Plan. MIT Press,

1987.

27. A. Rozinat and W.M.P. van der Aalst. Conformance Checking of Processes Based on Monitoring

Real Behavior. Information Systems, 33(1):64–95, 2008.

28. A.J.M.M. Weijters and W.M.P. van der Aalst. Rediscovering Workflow Models from Event-Based

Data using Little Thumb. Integrated Computer-Aided Engineering, 10(2):151–162, 2003.

29. J.M.E.M. van der Werf, B.F. van Dongen, C.A.J. Hurkens, and A. Serebrenik. Process Discovery

using Integer Linear Programming. Fundamenta Informaticae, 94:387–412, 2010.

30. G. Widmer and M. Kubat. Learning in the Presence of Concept Drift and Hidden Contexts.

Machine Learning, 23:69–101, 1996.

23

Recommended Reading

To get started with process mining, the reader is advised to read the book “Process

Mining: Discovery, Conformance and Enhancement of Business Processes” [1] and the

Process Mining Manifesto [19].