Scientific Workflows for Process Mining: Building Blocks, Scenarios ...

Software Tools for Technology Transfer manuscript No.(will be inserted by the editor)

Scientific Workflows for Process Mining: Building Blocks,Scenarios, and Implementation

Alfredo Bolt, Massimiliano de Leoni, Wil M.P. van der Aalst

Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, The Netherlands

Received: date / Revised version: date

Abstract. Over the last decade process mining emergedas a new analytical discipline able to answer a variety ofquestions based on event data. Event logs have a veryparticular structure; events have timestamps, refer to ac-tivities and resources, and need to be correlated to formprocess instances. Process mining results tend to be verydifferent from classical data mining results, e.g., processdiscovery may yield end-to-end process models captur-ing different perspectives rather than decision trees orfrequent patterns. A process-mining tool like ProM pro-vides hundreds of different process mining techniquesranging from discovery and conformance checking to fil-tering and prediction. Typically, a combination of tech-niques is needed and, for every step, there are differenttechniques that may be very sensitive to parameter set-tings. Moreover, event logs may be huge and may needto be decomposed and distributed for analysis. Theseaspects make it very cumbersome to analyze event logsmanually. Process mining should be repeatable and au-tomated. Therefore, we propose a framework to sup-port the analysis of process mining workflows. Exist-ing scientific workflow systems and data mining toolsare not tailored towards process mining and the arti-facts used for analysis (process models and event logs).This paper structures the basic building blocks neededfor process mining and describes various analysis scenar-ios. Based on these requirements we implemented Rapid-ProM, a tool supporting scientific workflows for processmining. Examples illustrating the different scenarios areprovided to show the feasibility of the approach.

1 Introduction

Scientific Workflow Management (SWFM) systems helpusers to design, compose, execute, archive, and share

workflows that represent some type of analysis or exper-iment. Scientific workflows are often represented as di-rected graphs where the nodes represent “work” and theedges represent paths along which data and results canflow between nodes. Next to “classical” SWFM systemssuch as Taverna [18], Kepler [27], Galaxy [15], Clowd-Flows [22], and jABC [33], one can also see the uptake ofintegrated environments for data mining, predictive ana-lytics, business analytics, machine learning, text mining,reporting, etc. Notable examples are RapidMiner [17]and KNIME [4]. These can be viewed as SWFM systemstailored towards the needs of data scientists.

Traditional data-driven analysis techniques do notconsider end-to-end processes. People are process mod-els by hand (e.g., Petri nets, UML activity diagrams,or BPMN models), but this modeled behavior is seldomaligned with real-life event data. Process mining aims tobridge this gap by connecting end-to-end process modelsto the raw events that have been recorded.

Process-mining techniques enable the analysis of awide variety of processes using event data. For example,event logs can be used to automatically learn a processmodel (e.g., a Petri net or BPMN model). Next to theautomated discovery of the real underlying process, thereare process-mining techniques to analyze bottlenecks, touncover hidden inefficiencies, to check compliance, to ex-plain deviations, to predict performance, and to guideusers towards “better” processes. Hundreds of process-mining techniques are available and their value has beenproven in many case studies. See for example the twentycase studies on the webpage of the IEEE Task Forceon Process Mining [19]. The open source process min-ing framework ProM provides hundreds of plug-ins andhas been downloaded over 100.000 times. The growingnumber of commercial process mining tools (Disco, Per-ceptive Process Mining, Celonis Process Mining, QPRProcessAnalyzer, Software AG/ARIS PPM, Fujitsu In-

2 A. Bolt, M. de Leoni, W.M.P. van der Aalst: Scientific Workflows for Process Mining

Analysis scenarios for process mining

Result (sub-)

optimality

Parameter

sensitivity

Large-scale

experiments

Repeating

questions

Categories of building blocks

Event data

extraction

Import event

data

(ImportED)

Generate event

data from

model

(GenerED)

Event data

transformation

Add data to

event data

(AddED)

Filter event

data (FilterED)

Split event data

(SplitED)

Merge event

data (MergED)

Process model

extraction

Import process

model

(ImportM)

Discover

process model

from event data

(DiscM)

Select process

model form

collection

(SelectM)

Process model

analysis

Analyze

process model

(AnalyzeM)

Evaluate

process model

using event

data (EvaluaM)

Compare

process models

(CompareM)

Analyze event

data

(AnalyzeED)

Generate report

(GenerR)

Process model

transformations

Repair process

model

(RepairM)

Decompose

process model

(DecompM)

Merge process

models

(MergeM)

Process model

enhancement

Enrich process

model using

event data

(EnrichM)

Improve

process model

(ImproveM)

Implementation

RapidProM

ProM RapidMiner

Fig. 1: Overview of the framework to support process mining workflows

terstage Automated Process Discovery, etc.) further il-lustrates the uptake of process mining.

For process mining typically many analysis steps needto be chained together. Existing process mining tools donot support such analysis workflows. As a result, anal-ysis may be tedious and it is easy to make errors. Re-peatability and provenance are jeopardized by manuallyexecuting more involved process mining workflows.

This paper is motivated by the observation that toolsupport for process mining workflows is missing. Noneof the process mining tools (ProM, Disco, Perceptive,

Celonis, QPR, etc.) provides a facility to design and ex-ecute analysis workflows. None of the scientific work-flow management systems including analytics suites likeRapidMiner and KNIME support process mining. Yet,process models and event logs are very different from theartifacts typically considered. Therefore, we propose theframework to support process mining workflows depictedin Figure 1.

This paper considers four analysis scenarios whereprocess mining workflows are essential:

A. Bolt, M. de Leoni, W.M.P. van der Aalst: Scientific Workflows for Process Mining 3

– Result (sub-)optimality : Often different process min-ing techniques can be applied and upfront it is notclear which one is most suitable. By modeling theanalysis workflow, one can just perform all candi-date techniques on the data, evaluate the differentanalysis results, and pick the result with the highestquality (e.g., the process model best describing theobserved behavior).

– Parameter sensitivity : Different parameter settingsand alternative ways of filtering can have unexpectedeffects. Therefore, it is important to see how sensitivethe results are (e.g., leaving out some data or chang-ing a parameter setting a bit should not change theresults dramatically). It is important to not simplyshow the analysis result without having some confi-dence indications.

– Large-scale experiments: Each year new process min-ing techniques become available and larger data setsneed to be tackled. For example, novel discovery tech-niques need to be evaluated through massive testingand larger event logs need to be decomposed to makeanalysis feasible. Without automated workflow sup-port, these experiments are tedious, error-prone andtime consuming.

– Repeating questions: It is important to lower the thresh-old for process mining to let non-expert users ap-proach it. Questions are often repetitive, e.g., thesame analysis is done for a different period or a dif-ferent group of cases. Process mining workflows fa-cilitate recurring forms of analysis.

As shown in Figure 1 these scenarios build on processmining building blocks grouped into six categories:

– Event data extraction: Building blocks to extract datafrom systems or to create synthetic data.

– Event data transformation: Building blocks to pre-process data (e.g., splitting, merging, filtering, andenriching) before analysis.

– Process model extraction: Building blocks to obtainprocess models, e.g., through discovery or selection.

– Process model analysis: Building blocks to evaluatemodels, e.g., to check the internal consistency or tocheck conformance with respect to an event log.

– Process model transformations: Building blocks to re-pair, merge or decompose process models.

– Process model enhancement : Building blocks to en-rich event logs with additional perspectives or to sug-gest process improvements.

Building blocks can be chained together to supportspecific analysis scenarios. The suggested approach hasbeen implemented thereby building on the process min-ing framework ProM and the workflow and data miningcapabilities of RapidMiner. The resulting tool is calledRapidProM which supports process mining workflows.ProM was selected because it is open source and thereis no other tool that supports as many process mining

building blocks. RapidMiner was selected because it al-lows for extensions that can be offered through a mar-ketplace. RapidProM is also offered as such an extensionand the infrastructure allows us to mix process miningwith traditional data mining approaches, text mining,reporting, and machine learning. Overall, RapidProMoffers compressive support for any type of analysis in-volving event data and processes.

The remainder of this paper is organized as follows.Section 2 discusses related work and positions our frame-work. An initial set of process-mining building blocks isdescribed in Section 3. These building blocks supportthe four analysis scenarios described in Section 4. TheRapidProM implementation is presented in Section 5.Section 6 evaluates the approach by showing concreteexamples. Finally, Section 7 concludes the paper.

2 Related Work

Over the last decade, process mining emerged as a newscientific discipline on the interface between process mod-els and event data [36]. Conventional Business ProcessManagement (BPM) [37,51] and Workflow Management(WfM) [25, 41] approaches and tools are mostly model-driven with little consideration for event data. Data Min-ing (DM) [16], Business Intelligence (BI), and MachineLearning (ML) [29] focus on data without consideringend-to-end process models. Process mining aims to bridgethe gap between BPM and WfM on the one hand andDM, BI, and ML on the other hand. A wealth of pro-cess discovery [24, 43, 50] and conformance checking [1,2,39] techniques has become available. For example, theprocess mining framework ProM [47] provides hundredsof plug-ins supporting different types of process mining(www.processmining.org).

This paper takes a different perspective on the gapbetween analytics and BPM/WfM. We propose to useworkflow technology for process mining rather than theother way around. To this end, we focus on particularkinds of scientific workflows composed of process miningoperators.

Differences between scientific and business workflowshave been discussed in several papers [3]. Despite unifi-cation attempts (e.g., [31]) both domains have remainedquite disparate due to differences in functional require-ments, selected priorities, and disjoint communities.

Obviously, the work reported in this paper is closer toscientific workflows than business workflows (i.e., tradi-tional BPM/WFM from the business domain). Numer-ous Scientific Workflow Management (SWFM) systemshave been developed. Examples include Taverna [18],Kepler [27], Galaxy [15], ClowdFlows [22], jABC [33],Vistrails, Pegasus, Swift, e-BioFlow, VIEW, and manyothers. Some of the SWFM systems (e.g., Kepler andGalaxy) also provide repositories of models. The web-site myExperiment.org lists over 3500 workflows shared

www.processmining.org

myExperiment.org


by its members [14]. The diversity of the different ap-proaches illustrates that the field is evolving in manydifferent ways. We refer to the book [34] for an extensiveintroduction to SFWM.

An approach to mine process models for scientificworkflows (including data and control dependencies) waspresented in [53]. This approach uses “process miningfor scientific workflows” rather than applying scientificworkflow technology to process mining. The results in[53] can be used to recommend scientific workflow com-positions based on actual usage. To our knowledge, Rapid-ProM is the only approach supporting “scientific work-flows for process mining”. The demo paper [28] reportedon the first implementation. In the meantime, Rapid-ProM has been refactored based on various practicalexperiences.

There are many approaches that aim to analyze repos-itories of scientific workflows. In [52], the authors pro-vide an extensible process library for analyzing jABCworkflows empirically. In [12] graph clustering is usedto discover subworkflows from a repository of workflows.Other analysis approaches include [13], [26], and [49].

Scientific workflows have been developed and adoptedin various disciplines, including physics, astronomy, bioin-formatics, neuroscience, earth science, economics, health,and social sciences. Various collections of reusable work-flows have been proposed for all of these disciplines. Forexample, in [35] the authors describe workflows for quan-titative data analysis in the social sciences.

The boundary between data analytics tools and sci-entific workflow management systems is not well-defined.Tools like RapidMiner [17] and KNIME [4] provide graph-ical workflow modeling and execution capabilities. Eventhe scripting in R [20] can be viewed as primitive work-flow support. In this paper we build on RapidMiner asit allows us to mix process mining with data mining andother types of analytics (e.g., text mining). Earlier we de-veloped extensions of ProM for chaining process miningplug-ins together, but these were merely prototypes. Wealso realized a prototype using and integration betweenKNIME and ProM. However, for reasons of usability, weopted for RapidMiner as a platform to expose processmining capabilities.

3 Definition of the process-mining buildingblocks

To create scientific workflows for process mining we needto define the building blocks, which are, then, connectedwith each other to create meaningful analysis scenarios.This section discusses a taxonomy and a repertoire ofsuch building blocks inspired by the so-called “BPM usecases”, which were presented in [37]. The use cases struc-ture the BPM discipline and to provide a generic wayof describing the usage of BPM techniques. The BPMuse cases are characterized by two main aspects. Firstly,

they are abstract as they are not linked to any specifictechnique or algorithm. Secondly, they represent logi-cal units of work, i.e. they cannot be conceptually splitwhile maintaining their generality. This does not implythat concrete techniques that implement BPM use casescannot be composed by micro-steps, according to theimplementation and design that was used.

Similarly, each process-mining building block for cre-ating process-mining workflows represents a logical unitof work. The building blocks are conceptual in the sensethat they are independent of the specific implementationand represent atomic operations. The process-miningbuilding blocks can be chained, thus producing process-mining scientific workflows to answer a variety of process-mining questions.

Each process-mining building block takes a numberof inputs and produces certain outputs. The inputs el-ements represent the set (or sets) of abstract objectsrequired to perform the operation. The process-miningbuilding block component represents the logical unit ofwork needed to process the inputs and produce the out-puts. Inputs and outputs are indicated through circleswhereas a process-mining building block is representedby a rectangle. Arcs are used to connect the blocks tothe inputs and outputs.

Two process-mining building blocks a and b arechained if one or more outputs of a are used as an in-puts in b. As mentioned, inputs and outputs are depictedby circles. The letter inside a circle specifies the type ofthe input or output. The following types of inputs andoutputs are considered in this paper:

– Process models, which are a representation of the be-havior of a process, are represented by letter “M”.Here we abstract from the notation used, e.g., Petrinets, Heuristics nest, BPMN models are concrete rep-resentation languages.

– Event data sets, which contain the recording of theexecution of process instances within the informationsystem(s), regardless of the format. They are repre-sented by letter “E”. MXML and XES are standardformats to store events.

– Information systems, which supports the performanceof processes at runtime. They are represented by thelabel “S”. Information systems may generate eventsused for analysis and process mining results (e.g.,prediction) may influence the information system.

– Sets of parameters to configure the application ofprocess-mining building blocks (e.g., thresholds, wei-ghts, ratios, etc.). They are represented by letter “P”.

– Results that are generated as outputs of a process-mining building blocks. This can be as simple as anumber or more complex structures like a detailed re-port. In principle, the types enumerated above in thislist (e.g., process models) can also be results. How-ever, it is worth differentiating those specific typesof outputs from results which are not process mining


Fig. 2: Process-mining building blocks related to eventdata extraction.

specific (like a bar chart). Results are represented byletter “R”.

– Additional Data Sets that can be used as input forcertain process-mining building blocks. These are rep-resented by the letter “D”. Such an additional dataset can be used to complement event data with con-text information (e.g., one can use weather or stock-market data to augment the event log with additionaldata).

The remainder of this section provides a taxonomy ofprocess-mining building blocks grouped in six differentcategories. For each category, several building blocks areprovided. They were selected because of their usefulnessfor the definition of many process-mining scientific work-flows. The taxonomy is not intended to be exhaustive,there will be new process-mining building blocks as thediscipline evolves.

3.1 Event data extraction

Event data are the cornerstone of process mining. In or-der to be used for analysis, event data has to be extractedand made available. All of the process-mining buildingblocks of this category can extract event data from differ-ent sources. Figure 2 shows some process-mining build-ing blocks that belong to this category.

Import event data ( ImportED). Information systems storeevent data in different format and media, from files ina hard drive to databases in the cloud. This buildingblock represents the functionality of extracting eventdata from any of these sources. Some parameters canbe set to drive the event-data extraction. For example,event data can be extracted from files in standard for-mats, such as XES1, or from transactional databases.

1 XES (Extensible Event Stream) is an XML-based standard forevent logs http://www.xes-standard.org. It provides a standardformat for the interchange of event log data between tools andapplication domains.

Fig. 3: Process-mining building blocks related to eventdata transformations

Generate event data from model ( GenerED). In a num-ber of cases, one wants to assess whether a certain tech-nique returns the expected or desired output (i.e., syn-thetic event data). For this assessment, controlled exper-iments are necessary where input data is generated in away that the expected output of the technique is clearlyknown. Given a process model M , this building blockrepresents the functionality of generating event data thatrecord the possible execution of instances of M . This isan important function for, e.g., testing a new discoverytechnique. Various simulators have been developed tosupport the generation of event data.

3.2 Event data transformation

Sometimes, event data sets are not sufficiently rich to en-able certain process-mining analyses. In addition, certaindata-set portions should be excluded, because they areare irrelevant, out of the scope of the analysis or, even,noise. Therefore, a number of event data transformationsmay be required before doing further analysis. This cat-egory comprises the building blocks to provide function-alities to perform the necessary event data transforma-tions. Figure 3 the repertoire of process-mining buildingblocks that belong to this category.

Add data to event data ( AddED). In order to perform acertain analysis or to improve the results, the event datacan be augmented with additional data coming from dif-ferent sources. For instance, if the process involves citi-zens, the event data can be augmented with data from

http://www.xes-standard.org


the municipality data source. If the level of performanceof a process is suspected to be influenced by the weather,event data can incorporate weather data coming from asystem storing such a kind of data. If the event data con-tains a ZIP code, then other data fields such as countryor city can be added to the event data from externaldata sources. This building block represents the func-tionality of augmenting event data using external data,represented as a generic data set in the figure.

Filter event data ( FilterED). Several reasons may ex-ist to filter out part of the event data. For instance, theprocess behavior may exhibit concept drifts over time. Inthose situations, the analysis needs to focus on certainparts of the event data instead of all of it. One couldfilter the event data and use only those events that oc-curred, e.g., in year 2015. As a second example, the sameprocess may run at different geographical locations. Onemay want to restrict the scope of the analysis to a spe-cific location by filtering out the event data referring todifferent locations. This motivates the importance of be-ing able to filter event data in various ways.

Split event data ( SplitED). Sometimes, the organiza-tion generating the event data is interested in comparingthe process’ performances for different customers, offices,divisions, involved employees, etc. To perform such com-parison, the event data needs to be split according to acertain criterion, e.g., according to organizational struc-tures, and the analysis needs to be iterated over eachportion of the event data. Finally, the results can be com-pared to highlight difference. Alternatively, the splittingof the data may be motivated by the size of the data. Itmay be intractable to analyze all data without decompo-sition or distribution. Many process-mining techniquesare exponential in the number of different activities andlinear in the size of the event log. If data is split in aproper way, the results of applying the techniques to thedifferent portions can be fused into a single result. For in-stance, work [38] discusses how to split event data whilepreserving the correctness of results. This building blockrepresents the functionality of splitting event data intooverlapping or non-overlapping portions.

Merge event data ( MergED). This process-mining build-ing block is the dual of the previous: data sets fromdifferent information systems are merged into a singleevent data set. This process-mining building block canalso tackle the typical problems of data fusion, such asredundancy and inconsistency.

3.3 Process model extraction

Process mining revolves around process models to rep-resent the behavior of a process. This category is con-cerned with providing building blocks to mine a processmodel from event data as well as to select or extract it

Fig. 4: Process-mining building blocks related to processmodel extraction

from a process-model collection. Figure 4 lists a num-ber of process-mining building blocks belonging to thiscategory.

Import process model ( ImportM). Process models canbe stored in some media for later retrieval to conductsome analyses. This building block represents the func-tionality of loading a process model from some reposi-tory.

Discover process model from event data ( DiscM). Pro-cess models can be manually designed to provide a nor-mative definition for a process. These models are usuallyintuitive and understandable, but they might not de-scribe accurately what happens in reality. Event datarepresent the “real behavior” of the process. Discov-ery techniques can be used to mine a process modelon the basis of the behavior observed in the event data(cf. [36]). Here, we stay independent of the specific nota-tions and algorithms. Examples of algorithms are the Al-pha Miner [43], the Heuristics Miner [50] or, more recenttechniques like the Inductive Miner [24]. This buildingblock represents the functionality of discovering a pro-cess model from event data. This block, as many others,can receive a set of parameters as an input to customizethe application of the algorithms.

Select process model from collection ( SelectM). Organi-zations can be viewed as a collection of processes andresources that are interconnected and form a processecosystem. This collection of processes can be managedand supported by different approaches, such as ARIS [30]or Apromore [23]. To conduct certain analyses, one needsto use some of these models and not the whole collec-tion. In addition, one can give a criterion to retrieve asubset of the collection. This building block represents


Fig. 5: Process-mining building blocks related to processmodel analysis

the functionality of selecting one or more process modelsfrom a process-model collection.

3.4 Process model analysis

Organizations normally use process models for the dis-cussion, configuration and implementation of processes.In recent years, many process mining techniques are alsousing process models for analysis. This category groupsprocess-mining building blocks that can analyze processmodels and provide analysis results. Figure 5 shows someprocess-mining building blocks that belong to this cate-gory.

Analyze process model ( AnalyzeM). Process models maycontain a number of structural problems. For instance,the model may exhibit undesired deadlocks, activitiesthat are never enabled for execution, variables that usedto drive decisions without previously taking on a value,etc. Several techniques have been designed to verify thesoundness of process models against deadlocks and otherproblems [42]. This building block refers to design-timeproperties: the process model is analyzed without con-sidering how the process instances are actually being ex-ecuted. The checking of the conformance of the process

model against real event data is covered by the nextbuilding block (EvaluaM ). Undesired design-time prop-erties happen for models designed by hand but also formodels automatically mined from event data. Indeed,several discovery techniques do not guarantee to mineprocess models without structural problems. This build-ing block provides functionalities for analyzing processmodels and detecting structural problems.

Evaluate process model using event data ( EvaluaM).Besides structural analysis, process models can also beanalyzed against event data. Compared with the previ-ous building block (AnalyzeM ), this block is not con-cerned with a design-time analysis. Conversely, it makesa-posteriori analysis where the adherence of the processmodel is checked with respect to the event data, namelyhow the process has actually been executed. In this way,the expected or normative behavior as represented bythe process model is checked against the actual behavioras recorded in event data. In literature, this is referred toas conformance checking (cf. [36]). This can be used, forexample, in fraud or anomaly detection. Replaying eventdata on process models has many possible uses: Aligningobserved behavior with modeled behavior is key in manyapplications. For example, after aligning event data andmodel, one can use the time and resource informationcontained in the log for performance analysis. This canbe used for bottleneck identification or to gather infor-mation for simulation analysis or predictive techniques.This building block represents the functionality of ana-lyzing or evaluating process models using event data.

Compare process models ( CompareM). Processes arenot static as they dynamically evolve and adapt to thebusiness context and requirements. For example, pro-cesses can behave differently over different years, or atdifferent locations. Such differences or similarities can becaptured through the comparison of the correspondingprocess models. For example, the degree of similarity canbe calculated. Approaches that explicitly represent con-figuration or variation points [40] directly benefit fromsuch comparisons. Building block CompareM is oftenused in combination with SplitED that splits the eventdata in sublogs and DiscM that discovers a model persublog.

Analyze event data ( AnalyzeED). Instead of directlycreating a process model from event data, one can alsofirst inspect the data and look at basic statistics. More-over, it often helps to simply visualize the data. For ex-ample, one can create a so-called dotted chart [36] ex-ploiting the temporal dimension of event data. Everyevent is plotted in a two dimensional space where onedimension represents the time (absolute or relative) andthe other dimension may be based on the case, resource,activity or any other property of the event. The colorof the dot can be used as a third dimension. See [21]


Fig. 6: Process-mining building blocks related to processmodel transformations

for other approaches combining visualization with otheranalytical techniques.

Generate report ( GenerR). To consolidate process mod-els and other results, one may create a structured re-port. The goal is not to create new analysis results, butto present the findings in an understandable and pre-dictable manner. Generating standard reports helps toreduce the cognitive load and helps users to focus on thethings that matter most.

3.5 Process model transformations

Process models can be designed or, alternatively, discov-ered from event data. Sometimes, these models need tobe adjusted for follow-up analyses. This category groupsprocess-mining building blocks that provide functional-ity to change the structure of a process model. Figure 6shows some process-mining building blocks that belongto this category.

Repair process model ( RepairM). Process models mayneed be repaired in case of consistency or conformanceproblems. Repairing can be regarded from two perspec-tives: repairing structural problems and repairing be-havioral problems. The first case is related to the factthat models can contain undesired design-time proper-ties such as deadlocks and livelocks (see also the Analyzeprocess model building block discussed in Section 3.4).Repairing involves modifying the model to avoid thoseproperties. Techniques for repairing behavioral problemsfocus on models that are structurally sound but that al-low for undesired behavior or behavior that does notreflect reality. See also the Evaluate process model usingevent data building block discussed in Section 3.4, whichis concerned with discovering the conformance problems.

Fig. 7: Process-mining building blocks related to processmodel enhancement

This building block provides functionality for both typesof repairing.

Decompose process model ( DecompM). Processes run-ning within organizations may be extremely large, interms of activities, resources, data variables, etc. As men-tioned, many techniques are exponential in the num-ber of activities. The computation may be improved bysplitting the models into fragments, analogously to whatmentioned for splitting the event log. If the model is splitaccording to certain criteria, the results can be some-how amalgamated and, hence, be meaningful for the en-tire model seen as a whole. For instance, the work ondecomposed conformance checking [38] discusses how tosplit process model to make process mining possible withmodels with hundreds of elements (such as activities,resources, data variables), while preserving the correct-ness certain results (e.g., the fraction of deviating casesdoes not change because of decomposition). This blockprovides functionalities for splitting process models intosmaller fragments.

Merge process models ( MergeM). Process models mayalso be created from the intersection (i.e. the commonbehavior) or union of other models. This building blockprovides functionalities for merging process models intoa single process model. When process discovery is de-composed, the resulting models need to be merged intoa single model.

3.6 Process model enhancement

Process models just describing the control-flow are usu-ally not the final result of process mining analysis. Pro-cess models can be enriched or improved using addi-tional data in order to provide better insights about thereal process behavior that it represents. This categorygroups process-mining building blocks that are used toenhance process models. Figure 7 shows a summary ofthe process-mining building blocks that belong to thiscategory.


Enrich process model using event data ( EnrichM). Thebackbone of any process models contains basic structuralinformation relating to control-flow. However, the back-bone can be enriched with additional perspectives de-rived from event data in to obtain better analysis results.For example, event frequency can be annotated in a pro-cess model in order to identify the most common pathsfollowed by process instances. Timing information canalso be used to enrich a process model in order to high-light bottlenecks or long waiting times. This enrichmentdoes not have an effect on the structure of the processmodel. This building block represents the functionalityof enriching process models with additional informationcontained in event data.

Improve process model ( ImproveM). Besides being en-riched with data, process models can also be improved.For example, performance data can be used to suggeststructural modifications in order to improve the over-all process performance. It is possible to automaticallyimprove models using causal dependencies and observedperformance. The impact of such modifications could besimulated in “what-if scenarios” using performance dataobtained in previous steps. This building block repre-sents the functionality of improving process models us-ing data from other analysis results.

4 Analysis scenarios for process mining

This section reports generic analysis scenarios that arenot domain-specific and, hence, that can be applied todifferent contexts. The analysis scenarios compose thebasic process-mining building blocks and, hence, theyremain independent of any specific operationalization ofa technique. In fact, as mentioned before, the buildingblocks may employ different concrete techniques, e.g.,there are dozens of process discovery techniques realizinginstances of building block DiscM (Figure 4).

As depicted in Figure 1, we consider four analysisscenarios: (a) result (sub-)optimality, (b) parameter sen-sitivity, (c) large-scale experiments, and (d) repeatingquestions. These are described in the remainder of thissection.

As discussed in this section and validated in Sec-tion 6, the same results could also be achieved withoutusing scientific workflows. However, the results wouldrequire a tedious and error-prone work of repeating thesame steps ad nauseam.

4.1 Result (sub-)optimality

This subsection discusses how process-mining buildingblocks can be used to mine optimal process model ac-cording to some optimality criteria. Often, in processdiscovery, optimality is difficult (or even impossible) to

achieve. Often sub-optimal results are returned and it isno known what is “optimal”.

Consider for example the process discovery task. Thequality of a discovered process model is generally definedby four quality metrics [1, 2, 36,39]:

– Replay fitness quantifies the ability of the processmodel to reproduce the execution of process instancesas recorded in event data.

– Simplicity captures the degree of complexity of aprocess model, in terms of the numbers of activities,arcs, variables, gateways, etc.

– Precision quantifies the degree with which the modelallows for too much behavior compared to what wasobserved in the event data.

– Generalization quantifies the degree with whichthe process model is capable to reproduce behaviorthat is not observed in the event data but that po-tentially should be allowed. This is linked to the factthat event data often are incomplete in the sense thatonly a fraction of the possible behaviors can be ob-served.

Traditionally, these values are normalized between 0 and1, where 1 indicates the highest score and 0 the lowest.

The model of the highest value within a collection of(discovered) models is such that it can mediate amongthose criteria at best. Often, these criteria are in compet-ing: higher score for one criterion may lower the score ofa second criterion. For instance, in order to have a moreprecise model, it is necessary to sacrifice the behaviorobserved in the event data that is less frequent, partlyhampering the replay-fitness score.

Later in this paper we will use an scoring criterionthat is the geometric average of replay fitness and preci-sion. This is merely an example to illustrate this analy-sis scenario. The geometric average of replay fitness andprecision seems to be better than the arithmetic averagesince it is necessary to have a strong penalty if one ofthe criteria is low.

Figure 8 shows a suitable scientific workflow for min-ing a process model from event data that is sub-optimalwith respect to the geometric average of fitness and pre-cision. The optimization is done by finding the parame-ters that returns a sub-optimal model.

Event data is loaded from an information system andused n times as input for a discovery technique using dif-ferent parameter values. The n resulting process modelsare evaluated using the original event data and the modelthat scores higher in the geometric average is returned.Please note that the result is likely to be sub-optimal:n arbitrary parameter values are chosen out of a muchlarger set of possibilities. If n is sufficiently large, theresult is sufficiently close to the optimal. This scientificworkflow is still independent of the specific algorithmused for discovery; as such, the parameter settings arealso generic.


Fig. 8: Result (sub-)optimality in process model discov-ery: process-mining scientific workflow for mining an op-timal model in terms of geometric average of replay fit-ness and precision

Figure 9a illustrates a scientific workflow that triesto account for generalization. For this purpose, a k-foldcross validation approach is used. In this approach, theprocess instances recorded in the event data are ran-domly split into k folds, through building block Splitevent data (SplitED). In each of the k times, a differ-ent fold is taken aside: the other k − 1 folds are usedfor discovery and the “elected” fold is used for evalu-ation through conformance checking. This correspondsto block Fold(i) with 1 ≤ i ≤ k. Finally, through theprocess-mining building block Select process model fromcollection (SelectM), the model with the best geometricaverage is returned as output. Figure 9b enters inside theblock Fold(i) showing how fold Ei is used for evaluationand folds E1, . . . , Ei−1, Ei+1, En are used for discovery(after being merged).

Scientific workflows can also be hierarchically defined:in turn, the discover process-mining building block (DiscM)in Figure 8 can be an entire scientific sub-workflow. Thetwo scientific workflows shown in Figures 8 and 9 donot exclude each other. Process-mining building blockDiscover process model from event data (DiscM) can bereplaced by the entire workflow in Figure 9a, thus in-cluding some generalization aspects in the search for asub-optimal process model.

4.2 Parameter sensitivity

Parameters are used by techniques to customize their be-havior, e.g., adapting to the noise level in the event log.These parameters have different ways of affecting theresults produced, depending on the specific implemen-

(a) Main workflow

(b) Process-mining sub-workflow for macro-block Fold(i)

Fig. 9: Process-mining main scientific workflow based onk-fold cross validation


Fig. 10: Parameter sensitivity in process discovery tech-niques: process mining workflow for comparing the ef-fects of different parameter values for a given discoverytechnique.

tation of the technique or algorithm. Some parameterscan have more relevance than others (i.e., they have amore substantial effect on the results). There are manyways to evaluate the sensitivity of a certain parameterfor a given algorithm. Figure 10 shows an example ofthis analysis scenario. Now the parameter value is var-ied across the range. For each of the discovered models,the average of precision and generalization is computed.The results are finally plotted on a Cartesian coordinatesystem where the X-axis is associated with the potentialparameter’s values and the Y-axis is associated with thegeometric average of precision and generalization.

Alternatively, the sensitivity analysis can also focuson the filtering part, while keeping the same configu-ration of parameter(s) for discovery. In other words, wecan study how the discovered model is affected by differ-ent filtering, namely different values of the parameter(s)that customize the application of filtering.

Fig. 11 shows an example of this analysis scenarioin the process mining domain, by using process-miningbuilding block to analyze the differences and similaritiesof results obtained by discovery techniques from eventdata that was filtered using different parameter values.In this example, event data is loaded and filtered severaltimes using different parameter settings, producing sev-eral filtered event data sets. Each of these filtered eventdata sets is input for the same discovery technique usingthe same configuration of parameter(s).

4.3 Large-scale experiments

Empirical evaluation is often needed (and certainly rec-ommended) when testing new process mining algorithms.In case of process mining, many experiments need to beconducted in order to prove that these algorithms ortechniques can be applied in reality, and that the results

Fig. 11: Parameter sensitivity in event data filtering:process-mining scientific workflow for comparing the ef-fect of different event-data filtering configurations on thediscovered model.

are as expected. This is due to the richness of the do-main. Process models can have a wide variety of routingbehaviors, timing behavior, and second-order dynamics(e.g., concept drift). Event logs can be large or smalland contain infrequent behavior (sometimes called noise)or not. Hence, this type of evaluation has to be con-ducted on a large scale. The execution and evaluationof such large-scale experiment results is a tedious andtime-consuming task: it requires intensive human assis-tance by configuring each experiment’s run and waitingfor the results at the end of each run.

This can be greatly improved by using process min-ing workflows, as only one initial configuration is re-quired. There are many examples for this analysis sce-nario within the process mining domain. Two of themare presented next.

4.3.1 Assessment of discovery techniques throughmassive testing

When developing new process discovery techniques, sev-eral experiments have to be conducted in order to testthe robustness of the approach. As mentioned, many dis-covery techniques use parameters that can affect the re-sult produced. It is extremely time-consuming and errorprone to assess the discovery techniques using severaldifferent combinations of parameter values and, at thesame time, testing on a dozen of different event-datasets.

Figure 12 shows the result of a large-scale experimentusing n event data sets and m different parameter set-tings that produces n ×m resulting process models. Inthis example, the same discovery technique with differ-


Fig. 12: Exhaustive testing of a discovery technique:Large-scale experiments using different types of eventdata and parameter combinations are needed to evalu-ate a discovery technique

ent parameters is used. However, one can consider thediscovery algorithm to employ as an additional param-eter. Therefore, the m different parameter settings canindicate m different discovery algorithms. After miningn×m models, the best model is considered.

4.3.2 Decomposed process discovery

Existing process mining techniques are often unable tohandle “big event data” adequately. Decomposed pro-cess mining aims to solve this problem by decomposingthe process mining problem into many smaller problems,which can be solved in less time and using less resources.

In decomposed process discovery, large event datasets are decomposed in sublogs, each of which refers toa subset of the process’ activities. Once an appropriatedecomposition is performed, the discovery can be appliedto each cluster. The results in as many process models asthe number of clusters; these models are finally mergedto obtain a single process model. See for example thedecomposed process mining technique described in [48]which presents an approach that clusters the event data,applies discovery techniques to each cluster, and mergesthe process models.

Figure 13 shows a process-mining workflow that splitsthe event data into n subsets, then uses a discovery algo-rithm to discover models for each of these subsets, andfinally merges them into a single process model.

4.4 Repeating questions

Whereas the previous scenarios are aimed at (data) sci-entists, process mining workflows can also be used tolower the threshold for process mining. After the processmining workflow has been created and tested, the sameanalysis can be repeated easily using different subsets ofdata and different time-periods. Without workflow sup-port this implies repeating the analysis steps manuallyor use hardcoded scripts that perform them over some

Fig. 13: Decomposed process discovery: a generic exam-ple using event data splitting, model composition and aspecified discovery technique.

input data. The use of scientific workflows is clearly ben-eficial: the same workflow can be replayed many timesusing different inputs where no further configuration isrequired.

There are many examples for this analysis scenariowithin the process-mining domain. Two representativeexamples are described next.

4.4.1 Periodic benchmarking

Modern organizations make large investments to improvetheir own processes: better performance in terms of costs,time, or quality. In order to measure these improvements,organizations have to evaluate their performance period-ically. This requires them to evaluate performance of thenew time-period and compare it with the previous pe-riods. Performance can improve or degrade in differenttime-periods. Obviously, the returned results require hu-man judgments and, hence, cannot be fully automatedby the scientific workflow.

Figure 14 shows an example of this analysis scenariousing different process-mining building blocks. Let us as-sume that we want to compare period τk with periodτk−1. For period τk, the entire event data is loaded and,then, is filtered so as to only keep portion Eτk that refersto the period τk only. Using portion Eτk , a process modelMτk is discovered. For period τk−1, the entire event datais loaded and, then, is filtered so as to only keep theportion Eτk−1

that refers to the period τk−1. Finally, anevaluation is computed about the conformance betweenmodel Mτk and event-data portion Eτk and between Mτk


Fig. 14: Periodic performance benchmark: Process min-ing workflow for comparing the performance of the pro-cess in two different time-periods (t and t− 1).

and Eτk−1. Each evaluation will return valuable results,

which are compared to find remarkable changes.

4.4.2 Report generation over collections of data sets

Scientific workflows are very handy when generating sev-eral reports for different portions of event data, e.g., dif-ferent groups of patients or customers. Since the stepsare the same and the only difference is concerned withusing different portions of events, this can be easily au-tomated, even when dozens of subsets need to be takeninto consideration.

From this, it follows that this scenario shares com-mon points with large-scale experiments. However, somedifferences exist. The report-generation scenario is char-acterized by a stable workflow with a defined set ofparameters, whereas in the large-scale experiments sce-nario, parameters may change significantly in the differ-ent iterations. In addition to that, the input elementsused in report-generation scenarios are similar and com-parable event data sets. This can be explained by thedesire that reports should have the same structure. Incase of large-scale experiments, event data sets may beheterogenous. It is actually worthwhile repeating the ex-periments using diverse and dissimilar event data sets asinput.

Fig. 15: Report Generation workflow

Figure 15 illustrates a potential scientific workflow togenerate reports that contain process-mining results. Forthe sake of explanation, the process mining workflow iskept simple. The report is assumed to contain only threeobjects: the result RED of the analyze of the input eventdata, the discovered process model M and the resultsRM of the evaluation of such a model against the in-put event data. Process-mining building block Generatereport takes these three objects as input and combinesthem into a reporting document R.

5 Implementation

Our framework to support process mining workflowsshown in Figure 1 is supported by RapidProM. Rapid-ProM was implemented using ProM and RapidMiner.The building blocks defined in Section 3 have been con-cretely implemented in RapidProM. Most of the build-ing blocks have been realized using RapidMiner-specificwrappers of plug-ins of the ProM Framework [47]. ProMis a framework that allows researchers to implement pro-cess mining algorithms in a standardized environment,which provides a number of facilities to support pro-grammers. Nowadays, it has become the de-facto stan-dard for process mining. ProM can be freely downloadedfrom http://www.promtools.org. The extension of Ra-pidMiner to provide process-mining blocks for scientificworkflows using ProM is also freely available. At the timeof writing, RapidProM provides 37 process mining op-erators, including several process-discovery algorithmsand filters as well as importers and exporters from/todifferent process-modeling notations.

http://www.promtools.org


The first version of RapidProM was presented dur-ing the BPM 2014 demo session [28]. This initial versionsuccessfully implemented basic process-mining function-alities and has been downloaded 4020 times since its re-lease in July 2014 until April 2015 (on average, over 400monthly downloads). However, process mining is a rel-atively new discipline, which is developing and evolvingvery fast. Therefore, various changes and extensions wereneeded to keep up with the state-of-the-art. The newversion incorporates implementations of various new al-gorithms, which did not exist in the first version.

The RapidProM extension is hosted both at http:

//www.rapidprom.org and in the RapidProM extensionmanager server, which can be directly accessed throughthe RapidMiner Marketplace. After installation, the Rapid-Miner operators are available for use in any RapidMinerworkflow. Figure 16 shows an example of a process-miningscientific workflow implemented using RapidProM.

Readers are referred to http://www.rapidprom.org

for detailed installation, setup and troubleshooting in-structions.

Table 1 shows the ProM import plugins implementedin RapidProM Version 2. These five operators are com-plemented with RapidMiner native operators to exportvisual results and data tables, in a way that most finalresults of process mining workflows can be exported andsaved outside RapidMiner.

Operator Name Operator Description

Read Log (path) Imports an event log from a speci-fied path

Read Log (file) Takes a file object (usually obtainedfrom a ”loop files” operator) andtransforms it to an Event Log

Read PNML Imports a Petri Net from a specifiedpath

Export EventLog

Exports an Event Log in differentformats

Export PNML Exports a Petri Net in PNML for-mat

Table 1: Import/Export Operators

Table 2 shows a list of ProM Discovery plugins im-plemented in RapidProM as Discovery Operators. Thesenine operators (usually referred to as miners) are themost commonly used discovery techniques for processmining. These discovery operators produce different mod-els using different techniques and parameters to fine-tunethe resulting model.

Table 3 shows a list of ProM visualization pluginsimplemented in RapidProM as visualization operators.These four visualization plugins are accompanied by ren-derers that allow one to inspect both intermediate andfinal results during and after the execution of processmining workflows.


Alpha Miner Discovers a Petri Net. Fast but re-sults are not always reliable becauseof overfitting issues

ILP Miner Discovers a Petri Net by solvingILP problems. Result have perfectfitness but generally poor precision.Slow on large Logs

Genetic Miner Discovers a Heuristics Net using ge-netic algorithms. Depending on theparameter settings it can be slow orfast

EvolutionaryTree Miner

Discovers a Process Tree using aguided genetic algorithms based onmodel quality dimensions. Guaran-tees soundness but cannot repre-sent all possible behavior due to itsblock-structured nature

HeuristicsMiner

Discovers a Heuristics Net using aprobabilistic approach. Good whendealing with noise. Fast

Inductive Miner Discovers a Process Tree or PetriNet. Good when dealing with in-frequent behavior and large Logs.Soundness is guaranteed

Social NetworkMiner

Discovers a Social Network from theEvent Log resources. Different So-cial Networks can be obtained: sim-ilar task, handover of work, etc.

TransitionSystem Miner

Discovers a Transition System us-ing parameters to simplify thespace-state exploration.

Fuzzy Miner Discovers a Fuzzy Model. Goodwhen dealing with unstructured be-havior. Fast

Table 2: Discovery Operators


Dotted Chart Shows the temporal distribution ofevents within traces

Inductive VisualMiner

Process exploration tool that showsan annotated interactive model forquick exploration of a Log

Animate Log inFuzzy Instance

Shows an animated replay of a Logprojected over a Fuzzy Instance

PomPom Petri Net visualizer that empha-sizes those parts of the process thatcorrespond to high-frequent eventsin a given Log

Table 3: Visualization Operators

Table 4 shows a list of ProM conversion plugins im-plemented in RapidProM as conversion operators. Thesefour conversion plugins are intended for converting mod-els into other model formats. This way we improve thechances that a produced model can be used by otheroperators. For example, if a heuristics net is discovered

http://www.rapidprom.org




Fig. 16: Process Mining Workflows in RapidMiner through the RapidProM extension

from an Event Log using the Heuristics Miner, then theReplay Log on Petri Net (Conformance) operator cannotbe executed unless a conversion to Petri Net is performed(which is supported).


ReachabilityGraph toPetri Net

Converts a Reachability Graph intoa Petri Net

Petri Net toReachabilityGraph

Converts a Petri Net into a Reach-ability Graph

Heuristics Netto Petri Net

Converts a Heuristics Net into aPetri Net

Process Tree toPetri Net

Converts a Process Tree into a PetriNet

Table 4: Conversion Operators

Table 5 shows a list of log processing operators imple-mented in RapidProM. Some of these eight operators useProM functionalities to perform their tasks, but otherswere developed specifically for RapidProM, as the ProMframework generally does not use flat data tables to rep-resent event data. These operators are used to modify

an event log by adding attributes, events, or convertingit to data tables, and vice versa.

Table 6 shows a list of ProM plugins implemented inRapidProM as analysis operators.

6 Evaluation

This section shows a number of instantiations of scien-tific workflows in RapidProM, highlighting the benefitsof using scientific workflows for process mining. They arespecific examples of the analysis scenarios discussed inSection 4

6.1 Evaluating result optimality

The first experiment is related to Result Optimality. Inthis experiment, we implemented a process mining work-flow using RapidProM to extract the model that scoreshigher with respect to the geometric average of preci-sion and replay fitness. For this experiment, we employedthe Inductive Miner - Infrequent discovery technique [24]and used different values for the noise threshold param-eter. This parameter is defined in a range of values be-tween 0 and 1. This parameter allows for filtering out



Add TableColumn toEvent Log

Adds a single Data Table column astrace attribute to a given Event Log

Add TraceAttributes toEvent Log

Adds all columns of a Data Table(except case id) as trace attributesto a given Event Log

Add EventAttributes toEvent Log

Adds all columns of a Data Table(except case id and event id) asevent attributes to a given EventLog

Add Events toEvent Log

Adds Events to a given Event Logfrom selected columns on a DataTable

Merge EventLogs

Merges two Event Logs

Add ArtificialStart and EndEvent

Adds an artificial Start Event tothe beginning, and an artificial EndEvent to the ending of each trace

Event Log toExampleSet

Converts an Event Log into a DataTable (ExampleSet)

ExampleSet toEvent Log

Converts a Data Table (Example-Set) into an Event Log

Table 5: Log Processing Operators


WOFLAN Analyzes the soundness of a PetriNet

Select FuzzyInstance

Selects the best fuzzy instance froma Fuzzy Model

Repair Model Replays an Event Log in a Petri Netand repairs this net to improve fit-ness.

Reduce SilentTransitions

Reduces a Petri Net by removing in-visible transitions (and places) thatare not used

FeaturePrediction

Produces predictions of businessprocess features using decision trees

Replay Log onPetri Net(Performance)

Replays a Log on a Petri Net andgenerates performance metrics suchas throughput time, waiting time,etc.

Replay Log onPetri Net(Conformance)

Replays a Log on a Petri Net andgenerates conformance metrics suchas fitness

Table 6: Analysis Operators

infrequent behavior contained in event data in order toproduce a simpler model: the lower the value is for thisparameter (i.e., close to 0), the larger the fraction ofbehavior observed in the event data that the model al-lows. To measure fitness and precision, we employ theconformance-checking techniques reported in [1, 2]. Alltechniques are available as part of the RapidProM ex-tension.

This experiment instantiates the analysis scenariodescribed in Section 4.1 and depicted in Figure 8. The

model obtained with the default value of the parameteris compared with the model that (almost) maximizesthe geometric average of fitness and precision. To obtainthis result, we designed a scientific workflow where sev-eral models are discovered with different values of thenoise threshold parameter. Finally, the workflow selectsthe model with the highest value of the geometric aver-age among those discovered. As input, we used an event-data log that records real-life executions of a process forroad-traffic fine managements, which is employed by alocal-police force in Italy [11]. This event data refers to150370 process-instance executions and records the exe-cution of around 560000 activities.

Figure 17b shows the model obtained through ourscientific workflow, whereas Figure 17a illustrates themodel generated using default parameters.

There are clear differences between the models. Forexample, in the default model, parallel behavior domi-nates the beginning of the process. Instead, the “optimalmodel” presents simpler choices. Another example con-cerns the final part of the model. In the default model,the latest process activities can be skipped through. How-ever, in the optimal model, this is not possible. The op-timal model has a replay fitness and precision of 0.921and 0.903 respectively, with geometric average 0.912. Itscores better than the model obtained through defaultparameters, where the replay fitness and precision is 1and 0.548, respectively, with geometric average 0.708.The optimal model was generated with value 0.7 for thenoise threshold parameter.

6.2 Evaluating parameter sensitivity

As second experiment illustrating the benefits of usingscientific workflows for process mining, we conducted ananalysis of the sensitivity of the noise threshold param-eter of the Inductive Miner - infrequent. We used againthe event data of the road-traffic fine management pro-cess also used in Section 6.1. This experiment opera-tionalizes the analysis scenario discussed in Section 4.2and depicted in Figure 10. In this experiment, we imple-mented a process mining workflow using RapidProM toexplore the effect of this parameter in the final qualityof the produced model. In order to do so, we discovered41 models using different parameter values between 0and 1 (i.e., a step-size 0.025) and evaluated their qual-ity through the geometric average of replay fitness andprecision used before.

Figure 18 shows the results of these evaluations, show-ing the variation of the geometric average for differentvalues of the noise threshold parameter.

By analyzing the graph, the models with higher geo-metric average are produced when the parameter takeson a value between 0.675 and 0.875. The worst model isobtained when value 1 is assigned to the parameter.


(a) Model mined using the default value of the noise-threshold parameter, which is 0.2. The geometric average is 0.708

(b) Model mined using one of the best values of the noise-threshold parameter, which is 0.7. The geometric average is 0.912

Fig. 17: Comparison of process models that are mined with the default parameters and with the parameters thatmaximize the geometric average of replay fitness and precision. The process is concerned with road-traffic finemanagement and models are represented using the BPMN notation.

Fig. 18: Parameter sensitivity analysis: Variation of thegeometric average of fitness and precision when varyingthe value of the noise threshold parameter.

6.3 Performing large scale experiments

As mentioned before, the use of scientific workflows isvery beneficial for conducting large-scale experimentswith many event logs. When assessing a certain process-mining technique one cannot rely on a single event logto draw conclusions.

For instance, here we want to study how the noisethreshold parameter influences the quality of the discov-ered model, in term of geometric average of fitness andprecision. In Section 4.2, the experiment was conductedusing a single event log, but RapidProM allows us to dothis for any number of event logs. To illustrate this, weuse 11 real-life event logs and produce the correspondingprocess models using different parameter settings.

Table 7 shows the results of this evaluation, whereeach cell shows the geometric average of the replay fit-ness and the precision of the model obtained using a


Event data nt=0 nt=0.1 nt=0.2 nt=0.3 nt=0.4 nt=0.5 nt=0.6 nt=0.7 nt=0.8 nt=0.9 nt=1 Average

BPI2012 [45] 0.314 0.730 0.430 0.450 0.508 0.474 0.675 0.683 0.674 0.679 0.644 0.569BPI2013 [32] 0.847 0.826 0.778 0.863 0.458 0.458 0.458 0.458 0.458 0.458 0.453 0.592BPI2014 [46] 0.566 0.720 0.708 0.613 0.616 0.654 0.626 0.414 0.530 0.527 0.490 0.588Hospital [44] 0.153 0.111 0.546 0.473 0.338 0.172 0.280 0.342 0.392 0.515 0.517 0.349

Road Fines [11] 0.689 0.633 0.708 0.721 0.909 0.909 0.744 0.912 0.912 0.710 0.498 0.758CoSeLoG 1 [5] 0.143 0.366 0.389 0.576 0.687 0.710 0.737 0.668 0.673 0.649 0.594 0.563CoSeLoG 2 [6] 0.095 0.191 0.146 0.233 0.127 0.167 0.250 0.177 0.218 0.180 0.362 0.195CoSeLoG 3 [7] 0.182 0.352 0.573 0.640 0.170 0.209 0.628 0.632 0.585 0.732 0.657 0.487CoSeLoG 4 [8] 0.190 0.448 0.488 0.640 0.623 0.163 0.553 0.621 0.546 0.518 0.670 0.496CoSeLoG 5 [9] 0.160 0.199 0.445 0.517 0.522 0.628 0.634 0.145 0.246 0.222 0.602 0.393

CoSeLoG R. [10] 0.520 0.860 0.838 0.869 0.859 0.377 0.868 0.868 0.883 0.861 0.656 0.769

Average 0.350 0.494 0.549 0.599 0.528 0.447 0.586 0.538 0.556 0.550 0.558

Table 7: Summary of a few large-scale experimental results: Evaluating the geometric average of replay fitness andprecision of models discovered with the Inductive Miner using different values of the noise threshold parameter(columns) and different real-life sets of event data (rows). We use nt to indicate the value of the noise thresholdparameter of application of the algorithm.

specific parameter value (column) and event data (row).Every event log used in this experiment is publicly avail-able through the Digital Object Identifiers (DOIs) of theincluded references. To use some of them for discovery,we had to conduct some pre-processing (depending onthe specifics of the event data).

The hospital event data set [44] was extremely un-structured. To provide reasonable results and to allowfor conformance checking using alignments, we filteredthe event log to retain the 80% most frequent behav-ior before applying the mining algorithm. The same wasdone for the five CoSeLog event logs [5–9].

The actual results in Table 7 are not very relevantfor this paper. It just shows that techniques can be eval-uated on a large scale by using scientific workflows.

6.4 Automatic report generation

To illustrate the fourth analysis scenario we used eventdata related to the study behavior and actual perfor-mance of students of the Faculty of Mathematics andComputer Science at Eindhoven University of Technol-ogy (TU/e). TU/e provides video lectures for many coursesto support students who are unable to attend face-to-face lectures for various reasons. The event data recordthe views of video lectures and the exam attempts of allTU/e courses.

First of all, students generate events when they watchlectures. It is known how long and when they watch aparticular lecture of a particular course. These data canbe preprocessed so that low-level events are collapsedinto lecture views. Second, students generate events whenthey make exams and the result is added to the event.

For each course, we have generated a report thatincludes the results of the application of various data-mining and process-mining techniques. This generationis automatic in the sense that the scientific workflowtakes a list of courses as input and produces as manyreports as the number of course in the list.

The report contains three sections: course informa-tion, core statistics and advanced analysis.

Figure 19 shows a small part of the report gener-ated for the course on Business Information Systems(2II05). In the first section, the report provides infor-mation about the course, the bachelor or master pro-grams which it belongs to, as well as the informationabout the overall number of views of the course’s videolectures. In the second section (only small fragment isshown), some basic distributions are calculated. For ex-ample, statistics are reported about the division per gen-der, nationality and final grade. The third section is de-voted to process mining results. The results of applyingconformance checking using the event data and the idealprocess model where a student watches every video lec-ture and in the right order, namely he/she watches theith video lecture only after watching the (i− 1)th videolecture. As expected, the results show a positive correla-tion between higher grades and higher compliance withthe normative process just mentioned: The more a stu-dent watches all video lectures in the right order, thehigher the corresponding grade will be. In addition toshowing the conformance information, the report alwaysembeds a dotted chart. The dotted chart is a similar to aGannt chart, see building block AnalyzeED. The dottedchart shows the distribution of events for the differentstudents over time. This way one can see the patternsand frequency with which students watch video lectures.

Note that reports like the one shown in Figure 19are very informative for both professors and students.By using RapidProM we are able to automatically gen-erate reports for all courses (after data conversion andmodeling the desired process mining workflow).

7 Conclusions

This paper presented a framework for supporting thedesign and execution of process mining workflows. As


First page

of report

Conformance

results

Dotted chart

Fig. 19: Fragments of the automatically generated report using RapidProM

argued, scientific workflow systems are not tailored to-wards the analysis of processes based on models and logs.Tools like RapidMiner and KNIME can model analysisworkflows but do not provide any process mining ca-pabilities. The focus of these tools is mostly on tradi-tional data mining and reporting capabilities that tendto use tabular data. Also more classical Scientific Work-flow Management (SWFM) systems like Kepler and Tav-erna do not provide dedicated support for artifacts likeprocess models and event logs. Process mining tools likeProM, Disco, Perceptive, Celonis, QPR, etc. do not pro-vide any workflow support. The inability to model andexecute process mining workflows was the primary mo-tivation for developing the framework presented in thispaper.

We proposed generic process mining building blocksgrouped into six categories. These can be chained to-gether to create process mining workflows. We identifiedfour broader analysis scenarios and provided conceptualworkflows for these. The whole approach is supportedusing RapidProM which is based on ProM and Rapid-Miner. RapidProM has been tested in various situationsand in this paper we demonstrated this using concreteinstances of the four analysis scenarios. RapidProM isfreely available via http://www.rapidprom.org and theRapidMiner Market place.

Future work aims at extending the set of processmining building blocks and evaluating RapidProM in

various case studies. We continue to apply RapidProMin all four areas described. Moreover, we would like tomake standard workflows available via infrastructureslike myExperiment and OpenML. We are also interesteda further cross-fertilizations between process mining andother analysis techniques available in tools like Rapid-Miner and KNIME (text mining, clustering, predictiveanalytics, etc.).

Acknowledgements

The authors thank Ronny Mans for his seminal work onthe initial version of RapidProM.

References

1. Arya Adriansyah, Jorge Munoz-Gama, Josep Carmona,Boudewijn F. van Dongen, and Wil M. P. van der Aalst.Alignment based precision checking. In Business Pro-cess Management Workshops, volume 132, pages 137–149. Springer Berlin Heidelberg, 2013.

2. Arya Adriansyah, Jorge Munoz-Gama, Josep Carmona,Boudewijn F. van Dongen, and Wil M. P. van der Aalst.Measuring precision of modeled behavior. InformationSystems and e-Business Management, 13(1):37–67, 2015.



3. Roger Barga and Dennis Gannon. Scientific versus busi-ness workflows. In Ian J. Taylor, Ewa Deelman, Den-nis B. Gannon, and Matthew Shields, editors, Workflowsfor e-Science, pages 9–16. Springer Verlag, Berlin, 2007.

4. Michael R. Berthold, Nicolas Cebron, Fabian Dill,Thomas R. Gabriel, Tobias Koetter, Thorsten Meinl,Peter Ohl, Christoph Sieb, Kilian Thiel, and BerndWiswedel. Knime: The konstanz information miner.In Christine Preisach, Hans Burkhardt, Lars Schmidt-Thieme, and Reinhold Decker, editors, Data Analysis,Machine Learning and Applications, Studies in Classi-fication, Data Analysis, and Knowledge Organization,pages 319–326. Springer Berlin Heidelberg, 2008.

5. Joos C. A. M. Buijs. Environmental permit applicationprocess (wabo), coselog project, municipality 1. 10.

4121/uuid:c45dcbe9-557b-43ca-b6d0-10561e13dcb5,2014.


4121/uuid:34b4f6f4-dbe0-4857-bf75-5b9e1138eb87,2014.


4121/uuid:a8ed945d-2ad8-480e-8348-cf7f06c933b3,2014.


4121/uuid:e8c3a53d-5301-4afb-9bcd-38e74171ca32,2014.


4121/uuid:c399c768-d995-4086-adda-c0bc72ad02bc,2014.

10. Joos C. A. M. Buijs. Receipt phase of an environmentalpermit application process (wabo), coselog project. 10.

4121/uuid:a07386a5-7be3-4367-9535-70bc9e77dbe6,2014.

11. Massimiliano de Leoni and Felix Mannhardt. Roadtraffic fine management process. 10.4121/uuid:

270fd440-1057-4fb9-89a9-b699b47990f5, 2015.12. Claudia Diamantini, Domenico Potena, and Emanuele

Storti. Mining usage patterns from a repository of scien-tific workflows. In Proceedings of the 27th Annual ACMSymposium on Applied Computing, SAC ’12, pages 152–157, New York, NY, USA, 2012. ACM.

13. Daniel Garijo, Pinar Alper, Khalid Belhajjame, OscarCorcho, Yolanda Gil, and Carole A. Goble. Commonmotifs in scientific workflows: An empirical analysis. Fu-ture Generation Comp. Syst., 36:338–351, 2014.

14. Carole A. Goble, Jiten Bhagat, Sergejs Aleksejevs, DonCruickshank, Danius Michaelides, David Newman, MarkBorkum, Sean Bechhofer, Marco Roos, Peter Li, andDavid De Roure. myExperiment: A repository and so-cial network for the sharing of bioinformatics workflows.Nucleic Acids Research, 38(suppl 2):W677–W682, 2010.

15. Jeremy Goecks, Anton Nekrutenko, James Taylor, andThe Galaxy Team. Galaxy: a comprehensive approachfor supporting accessible, reproducible, and transparentcomputational research in the life sciences. Genome Bi-ology, 11(8):R86, 2010.

16. David J. Hand, Padhraic Smyth, and Heikki Mannila.Principles of Data Mining. MIT Press, Cambridge, MA,USA, 2001.

17. Markus Hofmann and Ralf Klinkenberg. RapidMiner:Data Mining Use Cases and Business Analytics Applica-tions. Chapman and Hall/CRC, 2013.

18. Duncan Hull, Katy Wolstencroft, Robert Stevens, Car-ole A. Goble, Matthew R. Pocock, Peter Li, and TomOinn. Taverna: A tool for building and running work-flows of services. Nucleic Acids Research, 34:729–732,2006.

19. IEEE Task Force on Process Mining. Process MiningCase Studies. http://www.win.tue.nl/ieeetfpm/doku.php?id=shared:process_mining_case_studies, 2013.

20. Ross Ihaka and Robert Gentleman. R: A language fordata analysis and graphics. Journal of Computationaland Graphical Statistics, 5(3):299–314, 1996.

21. Daniel Keim, Gennady Andrienko, Jean-Daniel Fekete,Carsten Grg, Jrn Kohlhammer, and Guy Melanon. Vi-sual analytics: Definition, process, and challenges. InAndreas Kerren, JohnT. Stasko, Jean-Daniel Fekete, andChris North, editors, Information Visualization, volume4950 of Lecture Notes in Computer Science, pages 154–175. Springer Berlin Heidelberg, 2008.

22. Janez Kranjc, Vid Podpean, and Nada Lavra. Clowd-flows: A cloud based scientific workflow platform. InPeterA. Flach, Tijl De Bie, and Nello Cristianini, ed-itors, Machine Learning and Knowledge Discovery inDatabases, volume 7524 of Lecture Notes in ComputerScience, pages 816–819. Springer Berlin Heidelberg,2012.

23. Marcello La Rosa, Hajo A. Reijers, Wil M. P. van derAalst, Remco M. Dijkman, Jan Mendling, Marlon Du-mas, and Luciano Garcıa-Banuelos. Apromore: An ad-vanced process model repository. Expert Systems withApplications, 38(6):7029–7040, 2011.

24. Sander J. J. Leemans, Dirk Fahland, and Wil M. P.van der Aalst. Discovering block-structured process mod-els from event logs - a constructive approach. In Jose-Manuel Colom and Jorg Desel, editors, Application andTheory of Petri Nets and Concurrency, volume 7927of Lecture Notes in Computer Science, pages 311–329.Springer Berlin Heidelberg, 2013.

25. Frank Leymann and Dieter Roller. Production workflow- concepts and techniques. Prentice Hall, 2000.

26. Richard Littauer, Karthik Ram, Bertram Ludascher,William Michener, and Rebecca Koskela. Trends in useof scientific workflows: Insights from a public repositoryand recommendations for best practice. IJDC, 7(2):92–100, 2012.

27. Bertram Ludascher, Ilkay Altintas, Chad Berkley, DanHiggins, Efrat Jaeger, Matthew B. Jones, Edward A. Lee,Jing Tao, and Yang Zhao. Scientific workflow manage-ment and the kepler system. Concurrency and Compu-tation: Practice and Experience, 18(10):1039–1065, 2006.

28. Ronny S. Mans, Wil M. P. van der Aalst, and H. M. W.Verbeek. Supporting process mining workflows withRapidProM. In Lior Limonad and Barbara Weber,editors, Proceedings of the BPM Demo Sessions 2014Co-located with the 12th International Conference onBusiness Process Management (BPM), volume 1295of CEUR Workshop Proceedings, pages 56–60. CEUR-WS.org, 2014.

29. Tom M. Mitchell. Machine learning. McGraw Hill seriesin computer science. McGraw-Hill, 1997.

10.4121/uuid:c45dcbe9-557b-43ca-b6d0-10561e13dcb5

10.4121/uuid:c45dcbe9-557b-43ca-b6d0-10561e13dcb5

10.4121/uuid:34b4f6f4-dbe0-4857-bf75-5b9e1138eb87

10.4121/uuid:34b4f6f4-dbe0-4857-bf75-5b9e1138eb87

10.4121/uuid:a8ed945d-2ad8-480e-8348-cf7f06c933b3

10.4121/uuid:a8ed945d-2ad8-480e-8348-cf7f06c933b3

10.4121/uuid:e8c3a53d-5301-4afb-9bcd-38e74171ca32

10.4121/uuid:e8c3a53d-5301-4afb-9bcd-38e74171ca32

10.4121/uuid:c399c768-d995-4086-adda-c0bc72ad02bc

10.4121/uuid:c399c768-d995-4086-adda-c0bc72ad02bc

10.4121/uuid:a07386a5-7be3-4367-9535-70bc9e77dbe6

10.4121/uuid:a07386a5-7be3-4367-9535-70bc9e77dbe6

10.4121/uuid:270fd440-1057-4fb9-89a9-b699b47990f5

10.4121/uuid:270fd440-1057-4fb9-89a9-b699b47990f5

http://www.win.tue.nl/ieeetfpm/doku.php?id=shared:process_mining_case_studies

http://www.win.tue.nl/ieeetfpm/doku.php?id=shared:process_mining_case_studies


30. August-Wilhelm Scheer and Markus Nuttgens. Arisarchitecture and reference models for business processmanagement. In Wil M. P. van der Aalst, Jorg Desel,and Andreas Oberweis, editors, Business Process Man-agement, volume 1806 of Lecture Notes in Computer Sci-ence, pages 376–389. Springer Berlin Heidelberg, 2000.

31. Mirko Sonntag, Dimka Karastoyanova, and Ewa Deel-man. Bridging the gap between business and scientificworkflows: Humans in the loop of scientific workflows. InSixth International Conference on e-Science, e-Science2010, 7-10 December 2010, Brisbane, QLD, Australia,pages 206–213, 2010.

32. Ward Steeman. Bpi challenge 2013. 10.4121/

500573e6-accc-4b0c-9576-aa5468b10cee, 2013.

33. Bernhard Steffen, Tiziana Margaria, Ralf Nagel, SvenJoerges, and Christian Kubczak. Model-driven devel-opment with the jABC. In Eyal Bin, Avi Ziv, andShmuel Ur, editors, Hardware and Software, Verificationand Testing, volume 4383 of Lecture Notes in ComputerScience, pages 92–108. Springer Berlin Heidelberg, 2007.

34. Ian J. Taylor, Ewa Deelman, Dennis B. Gannon, andMatthew Shields. Workflows for e-Science: ScientificWorkflows for Grids. Springer Verlag, Berlin, 2007.

35. Kenneth J. Turner and Paul S. Lambert. Workflows forquantitative data analysis in the social sciences. Interna-tional Journal on Software Tools for Technology Trans-fer, pages 1–18, 2014.

36. Wil M. P. van der Aalst. Process Mining: Discovery,Conformance and Enhancement of Business Processes.Springer-Verlag Berlin Heidelberg, 1st edition, 2011.

37. Wil M. P. van der Aalst. A decade of business processmanagement conferences: Personal reflections on a de-veloping discipline. In Alistair Barros, Avigdor Gal, andEkkart Kindler, editors, Business Process Management,volume 7481 of Lecture Notes in Computer Science, pages1–16. Springer Berlin Heidelberg, 2012.

38. Wil M. P. van der Aalst. Decomposing process miningproblems using passages. In Serge Haddad and LuciaPomello, editors, Application and Theory of Petri Nets,volume 7347 of Lecture Notes in Computer Science, pages72–91. Springer Berlin Heidelberg, 2012.

39. Wil M. P. van der Aalst, Arya Adriansyah, andBoudewijn F. van Dongen. Replaying history on pro-cess models for conformance checking and performanceanalysis. Wiley Interdisciplinary Reviews: Data Miningand Knowledge Discovery, 2(2):182–192, 2012.

40. Wil M. P. van der Aalst, Alexander Dreiling, FlorianGottschalk, Michael Rosemann, and Monique H. Jansen-Vullers. Configurable process models as a basis for ref-erence modeling. In Christoph J. Bussler and ArminHaller, editors, Business Process Management Work-shops, volume 3812 of Lecture Notes in Computer Sci-ence, pages 512–518. Springer Berlin Heidelberg, 2006.

41. Wil M. P. van der Aalst, Arthur H. M. ter Hofstede,Bartek Kiepuszewski, and Alistair P. Barros. Workflowpatterns. Distributed and Parallel Databases, 14(1):5–51,2003.

42. Wil M. P. van der Aalst, Kees M. van Hee, Arthur H. M.ter Hofstede, Natalia Sidorova, H. M. W. Verbeek, MarcVoorhoeve, and Moe T. Wynn. Soundness of workflownets: classification, decidability, and analysis. Formal As-pects of Computing, 23(3):333–363, 2011.

43. Wil M. P. van der Aalst, Anton J. M. M. Weijters, andLaura Maruster. Workflow mining: discovering processmodels from event logs. IEEE Transactions on Knowl-edge and Data Engineering, 16(9):1128–1142, Sept 2004.

44. Boudewijn F. van Dongen. Real-lifeevent logs - hospital log. 10.4121/uuid:

d9769f3d-0ab0-4fb8-803b-0d1120ffcf54, 2011.45. Boudewijn F. van Dongen. Bpi challenge 2012. 10.4121/

uuid:3926db30-f712-4394-aebc-75976070e91f, 2012.46. Boudewijn F. van Dongen. Bpi challenge 2014. 10.4121/

uuid:d5ccb355-ca67-480f-8739-289b9b593aaf, 2014.47. Boudewijn F. van Dongen, Ana Karla A. de Medeiros,

H. M. W. Verbeek, Anton J. M. M. Weijters, and WilM. P. van der Aalst. The ProM framework: A newera in process mining tool support. In Gianfranco Cia-rdo and Philippe Darondeau, editors, Applications andTheory of Petri Nets, volume 3536 of Lecture Notes inComputer Science, pages 444–454. Springer Berlin Hei-delberg, 2005.

48. Wil M. P. vander Aalst. Decomposing petri nets for pro-cess mining: A generic approach. Distributed and ParallelDatabases, 31(4):471–507, 2013.

49. Ingo H. C. Wassink, Paul E. van der Vet, Katy Wols-tencroft, Pieter B. T. Neerincx, Marco Roos, Han Rauw-erda, and Timo M. Breit. Analysing scientific workflows:Why workflows not only connect web services. In 2009IEEE Congress on Services, Part I, SERVICES I 2009,Los Angeles, CA, USA, July 6-10, 2009, pages 314–321,2009.

50. Anton J. M. M. Weijters and Wil M. P. van der Aalst. Re-discovering workflow models from event-based data usingLittle Thumb. Integrated Computer-Aided Engineering,10(2):151–162, 2003.

51. Mathias Weske. Business Process Management - Con-cepts, Languages, Architectures, 2nd Edition. Springer,2012.

52. Alexander Wickert and Anna-Lena Lamprecht. jABC-stats: An extensible process library for the empiricalanalysis of jABC workflows. In Tiziana Margaria andBernhard Steffen, editors, Leveraging Applications ofFormal Methods, Verification and Validation. Special-ized Techniques and Applications, volume 8803 of Lec-ture Notes in Computer Science, pages 449–463. SpringerBerlin Heidelberg, 2014.

53. Reng Zeng, Xudong He, Jiafei Li, Zheng Liu, and WilM. P. van der Aalst. A method to build and analyze sci-entific workflows from provenance through process min-ing. In 3rd Workshop on the Theory and Practice ofProvenance, TaPP’11, Heraklion, Crete, Greece, June20-21, 2011, 2011.

10.4121/500573e6-accc-4b0c-9576-aa5468b10cee

10.4121/500573e6-accc-4b0c-9576-aa5468b10cee

10.4121/uuid:d9769f3d-0ab0-4fb8-803b-0d1120ffcf54

10.4121/uuid:d9769f3d-0ab0-4fb8-803b-0d1120ffcf54

10.4121/uuid:3926db30-f712-4394-aebc-75976070e91f

10.4121/uuid:3926db30-f712-4394-aebc-75976070e91f

10.4121/uuid:d5ccb355-ca67-480f-8739-289b9b593aaf

10.4121/uuid:d5ccb355-ca67-480f-8739-289b9b593aaf

Date post:	03-Jan-2017
Category:	Documents
Upload:	phungdieu
View:	218 times
Download:	2 times

Scientific Workflows for Process Mining: Building Blocks, Scenarios ...

Documents