TECHNISCHE UNIVERSITÄT MÜNCHEN - TUM · In Section 6 we present our data corpus, explaining the...

TECHNISCHE UNIVERSITÄT MÜNCHENDEPARTMENT OF INFORMATICS

Master’s Thesis in Information Systems

Automatic Documentation of ResultsDuring Online Architectural Meetings

Oleksandra Klymenko

TECHNISCHE UNIVERSITÄT MÜNCHENDEPARTMENT OF INFORMATICS

Master’s Thesis in Information Systems

Automatic Documentation of ResultsDuring Online Architectural Meetings

Automatische Dokumentation vonErgebnissen während Architektur-Meetings

Author: Oleksandra KlymenkoSupervisor: Prof. Dr. Florian MatthesAdvisor: Daniel Braun, M.Sc.Submission Date: 17th of June, 2019

I confirm that this master’s thesis in information systems is my own work and I havedocumented all sources and material used.

Munich, 17th of June, 2019 Oleksandra Klymenko

Acknowledgments

First and foremost, I would like to thank my advisors, Daniel Braun and ManojMahabaleshwar, who were always there for me throughout this journey. Thank you foryour extensive help and friendly support, constant encouragement and professionalguidance. I cannot imagine having better advisors for my master’s thesis.

I would also like to express my sincere gratitude to Professor Dr. Florian Matthes forthe opportunity to write this thesis at his chair for Software Engineering for BusinessInformation Systems (SEBIS) and grow as a research scientist.

I am also thankful to everyone at the department of Architecture Definition andManagement at the Corporate Technology unit at Siemens AG with whom I had thepleasure to cooperate at different stages of my research, including Dr. Andreas Biesdorf,Dr. Christoph Brand, Martin Kramer, Peter Bouda and Srishti Dang, as well as all theexperts in various departments of the company, who kindly agreed to participate in theconducted case study.I also want to thank Klym Shumaiev, who introduced the field of software architecturedecision-making to me in our earlier research.

Last, but always first in my heart, I want to thank my family. None of this would bepossible if it wasn’t for you. Thank you for your continuous support, endless love andgiving me the opportunities that brought me to where I am now. I love you.

Abbreviations and Acronyms

NAASRNLUNLPNLGAISVMADDRDFNERSVMNIFLPCREUXD

Not AvailableAutomatic Speech RecognitionNatural Language UnderstandingNatural Language ProcessingNatural Language GenerationArtificial IntelligenceSupport Vector MachinesArchitectural Design DecisionResource Description FrameworkNamed Entity RecognitionSupport Vector MachineNLP Interchange FormatLinear Predictive CodingRequirements EngineeringUser Experience Design

iv

Abstract

Decision-making is a very important aspect in software development and architecture,and the need to explicitly document design decisions has been emphasized both inresearch and industry. The goal of this thesis is to develop a system that supportssoftware development teams by automatically documenting the results of their onlinemeetings, focusing on design decisions as the main meeting results. In order tounderstand the requirements for such a system, we conduct a set of expert interviewswith experienced software architects and developers. Our practical approach includesconverting audio stream from the meetings into text using an Automatic SpeechRecognition (ASR) service Speechmatics and detecting decisions in the generatedtranscript using Natural Language Understanding (NLU). In order to obtain trainingdata for the decision detection model, we analyze 17 architectural meetings, totallingto 620 minutes of conversation. We evaluate the performance of our model, describechallenges and limitations of our research and outline possible directions for futurework.

v

Contents

Acknowledgments iii

Abbreviations and Acronyms iv

Abstract v

1. Introduction 11.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2. Objectives and Research Questions . . . . . . . . . . . . . . . . . . . . . . 21.3. Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2. Theoretical Background 42.1. Software Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1. Architectural design decisions . . . . . . . . . . . . . . . . . . . . 52.1.2. Classificatiton of architectural design decisions . . . . . . . . . . 6

2.2. Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.1. Automatic speech recognition . . . . . . . . . . . . . . . . . . . . 92.2.2. Natural language understanding . . . . . . . . . . . . . . . . . . . 11

3. Related Work 14

4. Task Description and Requirements 184.1. Task Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.1. Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.1.2. Virtual meeting assistant . . . . . . . . . . . . . . . . . . . . . . . 184.1.3. Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.4. Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2. Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2.1. Case study design . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2.2. Interview process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

5. Case Study Findings 245.1. The Process and Documentation of Online Meetings (RQ1) . . . . . . . . 24

5.1.1. Experience with Circuit . . . . . . . . . . . . . . . . . . . . . . . . 24

vi

Contents

5.2. Decision-Making Process (RQ2) . . . . . . . . . . . . . . . . . . . . . . . . 255.2.1. When and where are decisions made? . . . . . . . . . . . . . . . . 255.2.2. Group decisions or single-person decisions? . . . . . . . . . . . . 265.2.3. Are decisions documented? . . . . . . . . . . . . . . . . . . . . . . 27

5.3. Requirements for the System (RQ3) . . . . . . . . . . . . . . . . . . . . . 275.3.1. Information in the summary . . . . . . . . . . . . . . . . . . . . . 275.3.2. Intrusiveness of the bot . . . . . . . . . . . . . . . . . . . . . . . . 28

5.4. Other Suggestions and General Feedback . . . . . . . . . . . . . . . . . . 295.4.1. Additional use case . . . . . . . . . . . . . . . . . . . . . . . . . . 295.4.2. Ideas from the interviewees . . . . . . . . . . . . . . . . . . . . . . 315.4.3. General feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.5. Validity of the Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6. Data Corpus 366.1. Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2. Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7. Implementation 397.1. Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . 397.2. Decision Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7.2.1. Rasa NLU pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 407.2.2. Decision-detection model . . . . . . . . . . . . . . . . . . . . . . . 41

7.3. Concept Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.3.1. Linked data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.3.2. DBPedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427.3.3. DBPedia annotator . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7.4. Report Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447.5. Integration into Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

8. Evaluation and Results 508.1. Speech Recognition Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 508.2. Model Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518.3. Challenges and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 51

8.3.1. Data scarcity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 518.3.2. Quality of speech recognition . . . . . . . . . . . . . . . . . . . . . 528.3.3. Challenges of spoken language . . . . . . . . . . . . . . . . . . . . 528.3.4. Uncertainty expressions . . . . . . . . . . . . . . . . . . . . . . . . 538.3.5. Referring expressions . . . . . . . . . . . . . . . . . . . . . . . . . 538.3.6. Identifying context and distinguishing decision types . . . . . . 54

vii

Contents

9. Conclusion 559.1. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559.2. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

9.2.1. Model enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . 579.2.2. Virtual assistant development . . . . . . . . . . . . . . . . . . . . . 59

List of Figures 61

List of Tables 62

Listings 63

A. Appendix A: Interview Discussion Guide 64A.1. Warm-up questions (6 minutes) . . . . . . . . . . . . . . . . . . . . . . . . 64A.2. Challenges with the existing systems (6 minutes) . . . . . . . . . . . . . 65

A.2.1. When the meeting is in progress (4 minutes) . . . . . . . . . . . . 65A.2.2. Post a meeting (2 minutes) . . . . . . . . . . . . . . . . . . . . . . 65

A.3. User’s perception of working with futuristic concepts (18 minutes) . . . 65

B. Appendix B: Report Example 67

Bibliography 68

viii

1. Introduction

In the first chapter we explain the motivation for our work, define objectives andresearch questions, and introduce the structure of this master’s thesis.

1.1. Motivation

Constant communication between team members plays a crucial role in maintainingefficient software development work. In order to coordinate their efforts, softwarearchitects and developers must have regular discussions on system architecture, plansand processes [1]. While for teams that are located in the same location this doesnot pose any problems, for large and globally distributed software projects suchcoordination continues to be challenging [2].

In big distributed teams most of the communication is done via online tools such asemails and online messaging, while group meetings are mostly held through telecon-ferencing [3]. The main idea of this master’s thesis is to develop a system that supportssoftware development teams by automatically documenting the results of their onlinemeetings.

We argue that among the main results of such meetings are the decisions, that weremade during the meeting. Decision-making is a very important aspect in softwaredevelopment and architecture, and the need to explicitly document design decisionshas been emphasized both in research and industry [4, 5]. A survey by Tang, Babar,Gorton, and J. Han [6] showed that 74% of respondents forget the reasons behinddesign decisions, thus providing further evidence on the importance of documentationof design decisions and their rationale. However, manual documentation takes a lotof time and effort and thus automatic design decision detection and documentationbecomes highly advantageous.

Previous research on extraction and documentation of design decisions in the fieldof software development and architecture has mostly focused on detecting decisionsin issue management systems and source code commits [7, 8]. However, many designdecisions are also made in regular online meetings between the members of softwaredevelopment teams. According to the expert survey with experienced software archi-tects, developers and team leads, performed by Miesbauer and Weinreich [9], the mostfrequent form of documentation of architectural design decisions is meeting minutes.

1

1. Introduction

Therefore, in this thesis, in cooperation with an industry partner, which is a largeindustrial manufacturing company in Europe with more than 379,000 employees inmore than 200 countries / regions worldwide, we intend to develop a solution forautomatic documentation of online conversations in the form of meeting minutes,focusing on decision extraction, thus providing software development teams withthe opportunity to refer back to and review their decisions, save time on manualdocumentation and be more process-compliant.

1.2. Objectives and Research Questions

The main objective of this thesis is to develop a system that would support softwarearchitects, developers and team leads by automatically documenting the results of theironline meetings. In order to do so, we first need to understand what is the currentsituation and challenges that these people face during online meetings and how dothey currently deal with documentation of such meetings. Therefore, the first researchquestion is:

RQ1: How are online meetings between software development professionals held in practiceand how are they documented?

We argue that among the main results of such meetings are the architectural designdecisions that are made during the meeting. In order to support this statement and getmore information on the process of decision making in distributed teams, we formulatethe second research question:

RQ2: What is the process of decision-making in distributed software development teams?Since we are not provided with a specification for a system that would capture the

results of architectural meetings, we need to define the requirements that it must fulfullin order to be considered valuable:

RQ3: What are the requirements for a system that automatically documents online architec-tural meetings?

Finally, for the technical implementation of such a virtual assistant, we need toexplore the approaches, tools and technologies that can be used to achieve our goal:

RQ4: How to identify, extract and document design decisions in online architectural meet-ings?

1.3. Thesis Structure

The remainder of this thesis is structured as follows.In Section 2 we present relevant theoretical background in the fields of Architectural

Design Decisions (ADDs) and Natural Language Processing (NLP).

2

1. Introduction

In Section 3 we discuss some of the previous works related to understanding ofthe process of decision-making in software architecture, detection of decisions innatural-language communication and summarization of meetings.

In Section 4 we describe the problem statement and the design of the case studywhich we use for establishing the requirements for the system to be developed.

In Section 5 we discuss the findings from our inteviews that provide answers to RQ1,RQ2 and RQ3.

In Section 6 we present our data corpus, explaining the processes of data collectionand analysis.

The implementation process, including the descriptions of the used tools and tech-nologies, is described in Section 7, thus providing the answer to RQ4.

Evaluation of our approach to automatic decision detection, as well as its performanceresults and limitations are presented in Section 8.

The thesis is concluded with its summary and suggestions for future work in Section9.

3

2. Theoretical Background

2.1. Software Architecture

There are many definitions of software architecture. The IEEE 1471 standard thatprovides definition for architecture terms, principles and guidelines, defines softwarearchitecture as "the fundamental organization of a system embodied in its components, theirrelationships to each other, and to the environment, and the principles guiding its design andevolution" [10].

Perry and Wolf [11] were one of the first researchers to stress the significance ofsoftware architecture in the development of complex software systems. The authorsobserved that despite its importance, software architecture is "underutilized and un-derdeveloped" and started to address this issue by proposing a software architecturemodel that defined architectural elements such as data, processing and connection,the properties and relationships between the elements, and system constraints. Sincethen, a lot of research has been carried out in the field of software architecture andit has developed into a fundamental concept, essential for successful modelling andimplementation of complex software systems.

Bosch [5] was the first to propose to shift from the traditional view of softwarearchitecture where the central concepts are components and connectors of a system,to a new approach of viewing software architecture as the composition of a set ofarchitectural design decisions, representing design decisions as first-class entities insoftware architecture. In his paper published in 2004, the author discussed the problemsthat he sees in the the traditional approach to software architecture, namely the lack offirst-class representation of design decisions, the cross-cutting and intertwined natureof these design decisions, high cost of change of existing design decisions, violationof the design rules and constraints from the earlier design decisions and finally, thefact that obsolete design decisions are not removed, eventually resulting in erosion andpremature retirement of the software system.Therefore, in this perspective, proposed by Bosch, software architecture can be viewedas the "result of the architectural design decisions made over time" [12].

4


2.1.1. Architectural design decisions

Jansen and Bosch define an architectural design decision as "a description of the set ofarchitectural additions, subtractions and modifications to the software architecture, the rationale,and the design rules, design constraints and additional requirements that (partially) realize oneor more requirements on a given architecture" [12], where:

• Rationale are the reasons behind an architectural design decision, i.e. why acertain change is made to software architecture

• Design rules and design constraints are instructions for further design decisions,where rules are mandatory guidelines and constraints are intended to limit thedesign

• Design constraints are the opposite of rules and describe what is not allowed inthe design

• Additional requirements that may arise as a result of a design decision have tobe addressed by additional design decisions

The authors propose a conceptual model for architectural design decisions, whichconsists of the following key elements:

• Problem is the main element of the model, which represents the goal that has tobe achieved by an architectural design decision

• Motivation describes "why the problem is a problem" and why it is important

• Solutions represents the suggested solutions to the Problem

• Decision represents a solution that has been picked among other alternatives

• Architectural modification is caused by the implementation of the decision

• Context gets changed as a result of Architectural modification

The links between the elements are illustrated on Figure 2.1.For a typical proposed solution, authors define the following elements:

• Description defines the proposed solution, including the description of necessarymodifications and their rationale

• Design rules of a potential solution define the partial requirements that have tobe fulfilled in order to solve the problem

5


Figure 2.1.: Model for architectural design decisions. Reprinted from [12].

• Design constraints define limitations and constraints, imposed on the furtherdesign of architectural entities

• Consequences describe the predicted effects of the proposed solution on thearchitecture

• Pros outline the expected advantages that the proposed solution will bring to thearchitecture design, as well as its impact on the requirements

• Cons outline the possible negative effects that the proposed solution may haveon the architecture design

In the next section we take a look at different kinds of architectural design decisions.

2.1.2. Classificatiton of architectural design decisions

Kruchten proposed an ontology of architectural design decisions for software-intensivesystems [13]. The author has identified 3 major classes of design decisions: 1) existence

6


decisions, 2) property decisions and 3) executive decisions. Existence decisions indicatethe presence of a certain element or artifact in the design or implementation of a system.This type of decision is further classified into structural and behavioral decisions.Structural decisions refer to the creation of artifacts in a system, whereas behavioraldecisions relate to the way the elements of a system interact with each other. As asubtype of the existence decision category, the authors also distinguish ban (or non-existence) decisions, which means that an element will not appear in the design orimplementation of a system. Property decisions refer to the general characteristics orquality of the system and describe its traits. For example, design rules, guidelinesand constraints are all considered property decisions. Executive decisions do not relatedirectly to the elements or qualities of a system, but are rather driven by the businessenvironment and affect the development process, people and the organization itself.

Miesbauer and Weinreich [9] conducted an expert survey with software architects,senior developers and team leads from six different Austrian companies to identifythe kinds of architectural decisions that are made in practice. The authors collected 22different categories of architectural design decisions and mapped them to the taxonomyproposed by Kruchten. The results showed that around 70% of all the mentioneddecisions were existence decisions (most of them being structural decisions, bans werenot mentioned at all) and around 25% were related to technology decisions. Accordingto the authors, such results mean that participants are "still heavily structure- andtechnology-minded when thinking about architecture". However, the authors were alsoable to identify additional kinds of decisions. They noticed that most of the time, theparticipants of the study classified design decisions according to different levels, whichthe authors aggregated into four main levels: implementation, architecture, project, andorganization.

In this thesis, we are not attempting to distinguish between different decision typesthat can be extracted from online conversations between software architects, however,as we discuss in section 9.2.1, this is an interesting task for future work.

2.2. Natural Language Processing

Natural Language Processing (NLP) is a branch of artificial intelligence (AI) [14] thatenables interaction between humans and computers using natural language. NLP usesthe insights from the field of linguistics in order to make it possible for computers tomake sense of human language, whether it is text or speech, and analyze it to derivemeaning and determine important parts.

NLP tasks include but are not limited to Part-Of-Speech (POS) tagging, NamedEntity Recognition (NER), machine translation, sentiment analysis, text generation and

7


summarization, speech to text and text to speech conversion, and question answering.There are three main approaches that are used to solve NLP tasks [15]:

• Rule-based approach is the oldest approach applied to NLP tasks, which is basedon development of a set of rules based on different linguistic structures, usuallyfocusing on pattern-matching and parsing. Rule-based approach suffers frommultiple shortcomings, including the challenging and time-consuming process ofrule generation, especially in complex domains, difficulty of generalization andregularization of rules and the often occurring unclear interaction of rules in largedatasets. However, rule-based approaches have been proven to work well and arestill widely used for many tasks such as tokenization and stemming. Furthermore,rule-based systems are also applied in various NLP tasks in cases when there isno good training dataset available that would allow to use alternative approachessuch as machine learning.

• Machine Learning approach is based on statistical methods that enable algo-rithms to analyze the training dataset and produce their own rules and classifierswithout being explicitly programmed. Traditional machine learning methodsinclude probabilistic modeling, likelihood maximization, and linear classifiers.The main advantage of machine learning methods is that given available datasets,they can be quickly developed without the need for a knowledge expert to man-ually define each rule. Among the disadvantages of the approach is a commonproblem of lack of training data and the difficulty of debugging, i.e. unlike in therule-based approaches, where if necessary, a relevant rule can be easily edited,when using machine learning algorithms, it is hard to determine how the modelcan be adjusted to fix unsatisfactory output.

• Neural Networks are algorithms modelled after the network of neurons in ahuman brain, that are designed to recognize patterns.Deep Learning, which is a subset of machine learning, offers powerful techniquesfor learning in neural networks, employing multiple processing layers to learnhierarchical representations of data. Deep learning models currently producestate-of-the-art results for many NLP problems [16]. The paper by Young, Haz-arika, Poria, and Cambria [16] provides an overview on the most recent trends indeep learning for NLP tasks.

NLP can be divided into two main subfields: NLU (Natural Language Understanding)and NLG (Natural Language Generation). NLU, which is sometimes used interchange-ably with a broader term NLP, is the subtopic of the field that is actually responsiblefor interpreting human language by turning text into structured data that computerscan understand. NLG performs an opposite task - it turns structured data into text that

8


humans are used to, thus allowing computers to communicate with us by writing orsaying natural language sentences. In this thesis, we are focusing on the tasks relatedto NLU.

2.2.1. Automatic speech recognition

Speech is the most natural form of communication between people, therefore beingable to automatically recognize and process human voice plays an important role inthe development of human-computer interaction.

Automatic Speech Recognition (ASR) is aimed at recognizing human speech andconverting it into text for further analysis. ASR is a complex multidisciplinary task thatrequires knowledge from the fields of signal processing, acoustics, linguistics, computerscience, pattern recognition, communication and information theory, physiology andpsychology [17] and has to cope with challenges like identifying continuous speechin real time, dealing with word ambiguities, noise and speaker variability (includingaccents and dialects), and psycho-intellectual aspects, such as specifics of spokenlanguage and human understanding of speech [18].

Research in automatic speech recognition dates back to the 1950s, when Davis,Biddulph, and Balashek proposed a system for recognizing individual digits spoken bya single speaker at a normal speech rate [19]. Over the years, the field has experiencedmajor advances due to the introduction of speech representations based on LinearPredictive Coding (LPC) analysis, cepstral signal analysis, as well as statistical methodsbased on hidden Markov models [20]. However, it is only in the last decade, withthe emergence of deep learning and increased computing capabilities, that automaticspeech recognition got accurate enough to enter the marketplace and find its applicationin various services.

Most of the current state-of-the-art automatic speech recognizers are proprietary sys-tems that are developed by technological giants such as Google1, Microsoft2, IBM3 andAmazon4. In 2017, researchers from Microsoft published a revised paper [21], reportingon the results of the evaluation of performance of their automatic speech recognitionsystem in comparison to professional transcribers. They measured the human errorrate, as well as the Word Error Rate (WER) of their system, reporting that their systemestablished a new state-of-the-art achieving WER of 5.8% and 11.0% for the Switchboardand CallHome subsets respectively and has performed on par with professional tran-scribers, thus stating that automatic speech recognition has already achieved human

1https://cloud.google.com/speech-to-text/2https://azure.microsoft.com/en-in/services/cognitive-services/speech-to-text/3https://www.ibm.com/watson/services/speech-to-text/4https://aws.amazon.com/transcribe/

9

https://cloud.google.com/speech-to-text/

https://azure.microsoft.com/en-in/services/cognitive-services/speech-to-text/

https://www.ibm.com/watson/services/speech-to-text/

https://aws.amazon.com/transcribe/


performance.Later the same year, the IBM team [22, 23] published several papers, presenting

their efforts in the area of automatic speech recognition, namely a set of acoustic andlanguage modeling techniques that lowered the WER of their speech recognition systemto 5.5% and 10.3% on the Switchboard and CallHome subsets respectively. They havealso performed multiple measurements of performance of human transcribers on twoconversational tasks and found that human performance can be considerably betterthan what was reported by Microsoft, thus disproving their statement that humanparity has been reached for the task of automatic speech recognition.

In 2017, Biran has also tested the leading ASR services, namely Google Cloudspeech API, IBM Watson and Microsoft Speech Bing API (now replaced by AzureSpeech Services), using 1000 samples of audio data and corresponding transcriptionsin English and French languages, analyzing accuracy and speed of each provider [24].The tests were performed in a context of a use case for a simple chatbot for customerservice. According to the results, Google Cloud Speech demonstrates the most accurateand consistent results, having the lowest difference in performance for two languages.On the other hand, Microsoft and IBM performance for the French language is morethan twice worse that for English. Furthermore, while Microsoft shows average resultsin terms of response time, IBM’s response time is too big for real-time usage. Thecomplete accuracy results for all ASR services in terms of exact match of every wordare demonstrated in Table 2.1.

Provider English FrenchGoogle 40.2% 35.1%

IBM Watson 32.5% 8.8%Microsoft 25.8% 12.0%

Table 2.1.: Comparison of ASR services [24]

Therefore, the authors provide further evidence that ASR has not yet reached humanperformance, concluding that "unless you are an English native speaker with a perfectBritish accent, the future is not there yet".

In 2018, researchers at Google presented their attention-based model for sequence-to-sequence speech recognition [25], which works by integrating acoustic, pronunciation,and language models into one neural network without requiring a lexicon or a separatetext normalization component. The proposed models showed a state-of-the-art resultof 5.6% Word Error Rate (WER) on a voice search task and WER of 4.1% on a dictationtask for the datasets extracted from Google traffic. In April 2019, Google has alsopresented a data augmentation method for speech recognition called SpecAugment

10


that is applied directly to the feature inputs of a neural network [26]. The resultsshowed that SpecAugment significantly improves the performance of ASR networks,outperforming the previous results of hybrid systems with a WER of 6.8% and 5.8% onthe LibriSpeech task without and with the use of a language model respectively.

In this thesis, we will be using automatic speech recognition service called Speech-matics, which is presented in more detail in section 7.1.

2.2.2. Natural language understanding

Natural Language Understanding (NLU) is a subfield of Natural Language Processing.NLU is defined by Gartner IT Glossary as "the comprehension by computers of the structureand meaning of human language (e.g., English, Spanish, Japanese), allowing users to interactwith the computer using natural sentences" [27]. In other words, as described earlier, ithelps computers to understand what people are saying.Figure 2.2 shows some of the tasks of NLU, as a subset of NLP problems.

NLU is applied in many widely used tools and services such as personal voiceassistants (e.g. Siri or Alexa), natural-language search (e.g. Google), natural-languagetranslators (e.g. Google Translate) and many others. One of the most common applica-tions of NLU is chatbots.

Currently, there exists a wide range of NLU services and choosing the right onecan be a complicated task. Braun, Hernandez-Mendez, Matthes, and Langen [29]conducted an evaluation of the four most popular NLU services at the time of writing:LUIS1, Watson Conversation (now Watson Assistant2), API.ai (now Dialogflow3), andRASA4. In the overall results of the study, LUIS performed best with an F-score of0.916, RASA showed the second best result with and F-score of 0.821, followed byWatson Conversation and API.ai with F-scores of 0.752 and 0.687 respectively. AlthoughMicrosoft’s LUIS showed the best performance, the open source RASA was able toachieve similar results. The authors point out that given the advantages of open sourceproducts such as adaptability, after some customization, it might be possible to achieveeven better results with RASA. Moreover, it was accurately pointed out in the paper thatthe services, and therefore their performance, will change over time and as we can see,in less than 2 years, there have been major changes in all of the evaluated NLU tools,including their names. Among the four tools, RASA might have as well experienced thebiggest rise since then. In December 2017, the RASA team introduced two separate toolsRasa NLU and Rasa Core for the tasks of natural language understanding and dialog

1https://www.luis.ai/home2https://www.ibm.com/cloud/watson-assistant/3https://dialogflow.com/4https://www.rasa.com/docs/nlu/

11

https://www.luis.ai/home

https://www.ibm.com/cloud/watson-assistant/

https://dialogflow.com/

https://www.rasa.com/docs/nlu/


Figure 2.2.: NLP and NLU Tasks. Adapted from [28].

management respectively, aiming to "bridge the gap between research and application"and make the technologies accessible to non-specialist software developers [30]. Sincethen, Rasa open source community has been growing at an ever-increasing speed andRasa products have established themselves as one of the definite leaders among theNLP tools.

In a more recent paper from March 2019 [31], Liu, Eshghi, Swietojanski, and Rieserhave also conducted a comparative evaluation of the popular NLU services, includingRasa, Dialogflow, LUIS and Watson, performing separate significance tests for intentand entity detection. The results showed that Watson performed the best in termsof the F1 score for the task of intent detection (0.882), while there was no significantdifference between Dialogflow, LUIS and Rasa (0.864, 0.855 and 0.863 respectively).However, for entities, Watson achieved significantly lower F1 score (0.488), while theother tools again performed equally well (0.743, 0.777 and 0.768 for Dialogflow, LUISand Rasa respectively). According to the combined overall scores, Rasa has showedthe best performance in terms of Precision (0.862) and F-score (0.822), while the bestrecall was achieved by Watson (0.838). The complete results for intent and entity type

12


classification, obtained in the study, are showed in Table 2.2. The presented resultsare calculated by summing up the individual TP, FP, and FN scores of all intent andentity classes over 10-fold cross validation. The overall performance of NLU services ispresented in Table 2.3.

Intent EntityPresicion Recall F1 Presicion Recall F1

Rasa 0.863 0.863 0.863 0.859 0.694 0.768Dialogflow 0.870 0.859 0.864 0.782 0.709 0.743

LUIS 0.855 0.855 0.855 0.837 0.725 0.777Watson 0.884 0.881 0.882 0.354 0.787 0.488

Table 2.2.: Overall scores for intent and entity [31]

Precision Recall F1Rasa 0.862 0.787 0.822

Dialogflow 0.832 0.791 0.811LUIS 0.848 0.796 0.821

Watson 0.540 0.838 0.657

Table 2.3.: Combined overall scores [31]

Based on the above evaluations, in this thesis we have decided to use open sourceNLU service Rasa, due to its high performance, customizability and further advantages,outlined in section 7.2.

13

3. Related Work

This chapter will describe some of the other works that concern detection of decisionsand action items in natural-language communication, summarization of meetings,understanding the process of decision-making in software architecture and extractionof architectural design decisions.

Most of the previous research concerning analysis of spoken meetings has used AMIMeeting Corpus1 as their training data. The AMI Meeting Corpus consists of 100 hoursof meeting recordings that were recorded in English language by mostly non-nativespeakers in three separate rooms with different acoustic settings. The corpus also offersmanual transcripts with detailed annotations, including word-level timings, dialogueacts, named entities, topic segmentation, extractive and abstractive summaries, handand head gestures, gaze direction, emotional states and movements around the room.Around two-thirds of the meetings in the corpus are simulated, i.e. according to agiven scenario, participants play different roles in a fabricated design project, imitatingthe lifecycle of the project from its kick-off to completion over the day. The participants’roles include project manager, marketing expert, interface designer and industrialdesigner. Due to economic and logistical difficulties, participating people are neitherprofessionally trained nor experienced in the role they are playing.Non-scenario meetings include students and professional colleagues from various fieldsdiscussing different topics, ranging from speech research and astronomy issues toselection of films to show at a fictional movie club [32].

Hsueh and Moore were one of the first to address the problem of automatic detectionof decisions in conversational speech. In 2007, they have conducted two studies [33,34] aiming to develop models for automatic detection of decision segments in audiorecordings using a set of 50 scenario-driven meetings from the AMI Corpus. Theauthors propose a model that detects decision-related dialogue acts and show thatin order for the model to achieve higher precision, it has to combine all the availablefeatures extracted from different knowledge sources such as lexical, prosodic, dialogact related and topical class.

Murray and Renals have also used the AMI meetings corpus for analyzing meetings,however their work was devoted to detection of action items in spontaneous meetingspeech [35]. The performed analysis of the meetings showed that dialog acts that

1http://groups.inf.ed.ac.uk/ami/corpus/

14

http://groups.inf.ed.ac.uk/ami/corpus/

3. Related Work

contain action items are usually longer in duration, have more words, a longer pausebefore them, and a shorter pause afterwards. Furthermore, it was found that actionitems tend to be spoken out by people, who are in general more dominant throughoutthe meeting. The supervised approach, proposed by the authors, incorporates theseprosodic, lexical and structural features, detecting action items with a high degree ofaccuracy.

Purver, Dowding, Niekrasz, et al. have also approached the task of detecting actionitems in conversational speech [36] and have successfully applied their findings to thetask of decision detection in their later work [37]. In their work on decision detection,they proposed a decision annotation scheme that takes into account different roles thatutterances play in the decision-making process, distinguishing between three maindecision dialog act classes: issue, resolution and agreement. The results of their researchshowed that such approach not only allows to extract more detailed information, butalso outperforms decision detection systems that are based on flat annotations.

In a different work, featuring the same authors [38], researchers identified relevantphrases to summarize decisions in spoken meetings and compared two differentapproaches for identification and summarization of decisions made in meetings - aparse-based approach and a word-based approach. The results showed that while theparse-based approach results in a higher precision, the word-based approach yieldshigher recall and F-score.

Wang and Cardie have also addressed the problem of summarizing decisions inspoken meetings by producing decision abstracts for each decision made in the meeting[39]. The authors experiment with token-level and dialogue act (DA) level approaches toautomatic summarization using unsupervised and supervised learning frameworks. Foreach used meeting, they compared automatically generated summaries to the manuallygenerated decision abstracts and evaluate performance using the ROUGE-1 [40] textsummarization evaluation metric. The results showed that token-level summaries thatuse discourse context can outperform DA-level summarization when true clusteringsof decision-related dialogue acts are available. Otherwise, DA-level summarizationmethods show better performance.

Closely related to our work is a system for meeting recognition and understanding,presented by Tur, Stolcke, Voss, et al., which the authors call CALO Meeting Assis-tant [41]. The system, which is a part of a larger personal assistant system calledCALO1, captures, annotates, performs automatic transcription and semantic analysis ofmultiparty meetings. Speech understanding component, implemented in the CALOincludes dialog act segmentation and tagging, topic identification and segmentation,question-answer pair identification, action item recognition, decision extraction, and

1http://www.ai.sri.com/project/CALO

15

3. Related Work

summarization. The authors emphasize the potential improvements of human produc-tivity that such an assistant can bring to many professional environments and outlinefurther meeting information that can be processed to improve their system, such astopics, participants and action items.

The works described above are mostly based on the AMI corpus and deal withidentification of decisions in non-technical meeting scenarios. However, decisions madein the field of software architecture and engineering are very specific, with many typicalterms, concepts and characteristics that significantly distinguish them from decisionsthat are made in other fields. Previous research that concerned detecting architecturaldecisions specificially, has mostly focused on detecting them in issue managementsystems and source code commits.

In the work of Ven and Bosch [7], the authors explore the hypothesis that architecturaldecisions can be derived from open source projects in version management systems,namely from the system commits. They have presented 100 pre-chosen commits takenfrom the Gemfile (a file describing components and dependencies for programs writtenin Ruby) to six experienced Ruby software developers, software architects and softwarearchitecture researchers, asking them to provide feedback on whether the commitscontained design decisions, rationale for a decision and relevant information aboutalternatives for a decision. The results showed that at least 60% of the commits onGemfiles describe a design decision, thus proving that project commit messages containdecisions and they can be automatically derived using the approach, presented by theauthors.

Bhat, Shumaiev, Biesdorf, et al. [8] propose a machine learning approach for auto-matic extraction of design decisions from issue management systems. The authors haveanalyzed and labeled more than 1,500 issues from two large open source repositoriesand used this dataset to generate a machine learning model, applying different classi-fiers under different configurations. With the use of linear Support Vector Machines(SVM), which outperformed other classifiers such as Logistic regression, One-vs-Rest,Decision Tree and Naive Bayes, they have managed to automatically extract designdecisions from issues with an outstanding result of 91.29% accuracy.

In the work of Pedraza-García, Astudillo, and Correal, the authors studied the wayhow software architects make design decisions in design meetings and proposed atechnique for identifying design decisions in such meetings, called Design VerbalInterventions Analysis (DVIA) [42]. Using a set of manually transcribed meetings, theydivided them into small units called interventions, which get codified, classified andfinally allocated to a decision topic, and mapped to a decision element according tothe proposed decision making model. The results of their study provide empiricalevidence of the possibility to identify design decisions from speech recordings, as wellas describe the specifics of decision-making process for software architects.

16

3. Related Work

In this thesis, we will continue the research on detection of architectural designdecisions, attempting to extract them from online conversational meetings that are heldby software development teams.

17

4. Task Description and Requirements

In this chapter we formulate the task that has to be performed in the scope of this thesisand describe the process that was performed to obtain the exact requirements for ourfuture solution.

4.1. Task Description

The practical task of this thesis is to implement a solution for automatic extraction ofdecisions made during online meetings of software development teams. The decisionsand possibly other relevant meeting information have to be presented in a form of aPDF document and sent to the participants in the end of the conversation. The exactrequirements for what has to be included in the meeting report are to be definedthrough a set of interviews with software development and architecture experts whoparticipate in online meetings on a daily basis. The proposed solution has to beintegrated into the existing system that is being developed by the industry partner inthe scope of a bigger project, aimed at developing a virtual meeting assistant (i.e. abot), supporting teams in their online communication via software called Circuit.

4.1.1. Circuit

Circuit1 is a tool for online communication and collaboration that supports audio andvideo calls, screen-sharing, messaging and content sharing. It can also be furtherextended and customized through various extensions, as well as the public API [43],which can be used to integrate other services.

4.1.2. Virtual meeting assistant

The existing system, which has to be extended in this thesis, serves as a virtual meetingassistant for participants of online meetings that are held in Circuit. The system collectsaudio from a Circuit conversation, where the bot is present as a participant, transcribesit using an Automatic Speech Recognition technology and, once the call is finished,sends a text file with the transcription to the chat of the conversation. Therefore,

1https://www.circuit.com/

18

https://www.circuit.com/


in this thesis, we have to use the results of previously implemented functionality,i.e. the generated transcription of the call and analyze it using Natural LanguageUnderstanding tools to extract decisions and other important information.

4.1.3. Architecture

The existing system follows an event-driven architecture, where the services publish /consume event notifications to / from a message bus and perform actions accordingly.The services do not know about other services and only react to event notifications.As an event bus, a distributed streaming platform Apache Kafka1 is used. Kafkaoffers the possibilities to publish and subscribe to streams of records, as well asqueuing, consumption of past events and scalability features. The output events of onecomponent can be consumed by more that one component. For example, the audiois being consumed by the automatic speech reocgnizer as well as by the audio to filerecorder.

The overall architecture of the existing system can be seen on Figure 4.1. The audiostream gets collected from Circuit and passed to audio and event collector (1). Then, therecording state and audio goes to a websocket endpoint (2), which publishes receivedmessages into the corresponding topic in Kafka (3). Then, audio data from Kafkagoes to the voice to text module (4), where it gets processed by automatic speechrecognizer (5, 6) and the corresponding transcript gets published to Kafka (7). Whenthe transcription is complete and a corresponding message is published to the Kafkatopic, the chat message generator service, which is subscribed to this topic, gets thetranscription text from the topic message and sends it to the Circuit chat in a text file(9). Alternatively, for the test purposes, the existing system provides a possibility toperform this process with a pre-recorded audio file instead of the audio stream fromCircuit.

4.1.4. Services

The described above process is performed by five main services that are implementedin the system:

• circuit-event-collectorTakes care of recording audio from a conversation where the bot is included,publishes the recording state and audio into a websocket endpoint.

1https://kafka.apache.org/

19

https://kafka.apache.org/


Figure 4.1.: Overall architecture

• ws-2-kafkaWebsocket server which publishes every message it receives into the right topicin Kafka.

• circuit-msg-senderConsumes messages with the complete transcription from a Kafka topic, extractsfrom it the transcription text and posts it to Circuit as a text file.

• kafka-2-audioAssembles a file using the audio chunk messages which have been previouslypublished in Kafka.

• kafka-asrConsumes audio data from Kafka, process with ASR, publish transcripts to Kafka

20


4.2. Requirements

In order to get feedback on the initial idea and understand the requirements for avirtual meeting assistant that would capture the results of architectural meetings, wedesigned and conducted an exploratory case study, involving 10 software architectureprofessionals, who participate in online meetings on a daily basis.

4.2.1. Case study design

Case study is a qualitative analysis method that can be applied to get real-life knowledgeand insights into the processes and generate new ideas. An important aspect of casestudies is that they can be done remotely, which is a crucial factor when participantsare distributed around the world.

Since there were no requirements provided for the system, it was decided to conducta case study to understand the processes and challenges in online meetings, introducethe idea of a virtual meeting assistant and finally, explore the requirements that it mustfulfull in order to be considered valuable.

To collect the data necessary for the investigation, in cooperation with User Experi-ence Design (UXD) and Requirements Engineering (RE) departments from the industrypartner we have designed a semi-structured interview with 24 questions, divided into5 sections: (1) questions about expertise and team organization, (2) questions about thedecision-making process in the team, (3) questions about current challenges duringonline meetings, (4) suggestions and preferences for the proposed virtual assistant,and (5) personal feedback on the idea. In order to avoid possible influence on expert’sanswers, the question catalog was not provided to them in advance. The planned timeof the interviews was 30 minutes.

4.2.2. Interview process

The interviews were conducted remotely using an online telecommunications applica-tion Circuit, in the period from 9th of December, 2018 until 21st of January, 2019. Theaverage duration of the interviews was 35 minutes, where the shortest interview took 21minutes and the longest 55 minutes. This excluded the time for the basic introductionprovided by the interviewers at the beginning of the call to explain the obejctive ofthe interview to the interviewees and obtain the permission to record the interview.The question catalog, presented in Appendix A, ensured that all important questionswere answered, however the semi-structured type of the interviews allowed to keep theconversations very natural and open, without restricting the participants to freely talkabout what they consider relevant.

21


The criteria for selecting the interviewees were that they work as software architectsor developers and are involved in decision-making processes in their teams. Multiplesoftware architecture experts from a big multionational company were contacted viae-mail and invited to an interview, out of whom 10 were available and willing toparticipate in the case study. Therefore, in total we have interviewed 10 experts, 9males and 1 female. Although all interviewees are employees of the same company,they represent different projects and teams. The interviewed experts have more than12 years of professional experience on average and are almost all members of big,geographically dispersed teams. The detailed information about the participants of theinterviews can be seen in Table 4.1.

No. PositionYears ofExperience

Teamsize1

Teamdistribution

1 Software Architect 1 10Indiaand Germany

2 Senior Software Architect 20 50-60Different citiesin Germany

3 Senior Software Architect 22 30Germany, Austria,China, India,US and Switzerland

4 Senior Software Architect 20 25Different citiesin Germany

5Software ArchitectConsultant

7 7Germanyand Switzerland

6 Senior Software Architect 10 11Germany, Indiaand China

7 Senior Software Architect 19 30Different citiesin Germany

8Head of ArchitectureDepartment

22 14Different citiesin Germany

9Software ArchitectConsultant

6 15Different citiesin Germany

10Software Engineer,Researcher in the field ofSoftware Architecture

2 1 Munich

Table 4.1.: Interviewees

1Number of people in the main team. Many interviewees are members of several teams.

22


Neither the questions nor the approach was changed throughout the course of theinterviews, however slight variations in the order or the wording of the questions didoccur. The questions were not provided to the interviewees in advance or during theinterview. All interviews were recorded for the purpose of further analysis.

23

5. Case Study Findings

In this chapter we summarize the feedback that was obtained during expert interviewsand describe our findings. We also discuss the validity of our results.

The primary way to analyze information that was gathered through the case studies,included a detailed analysis of the interview transcripts that were manually transcribedout after all interviews were conducted. The transcripts were printed out and analyzedquestion by question through color-coding and summarizing main findings for each ofthe discussed topics. We have structured the results of our case study into four sectionsaccording to the research questions, stated in section 1.2.

5.1. The Process and Documentation of Online Meetings (RQ1)

5.1.1. Experience with Circuit

The aim of the first part of our interviews, was to to find out more about user experienceand current challenges that people face with the communication software Circuit, intowhich the automatic summarizer was to be integrated.

Circuit is a standard tool, used throughout the entire industry partner organization,at which all the interviewed software architects and deveopers are employed. Accordingto their answers, the interviewees seem to be used to Circuit and the majority is quitesatisfied with it, although some still have "mixed feelings". As the interviewed usershave shared, in general Circuit seems to show a good performance, the connection isstable and the basic functionality such as audio communication and screen sharingis working well. The interviewees admitted, that since Circuit is still a developingproduct, there have been problems in the beginning, including a few critical areas. Somefunctions were difficult to use, but over the last years, Circuit has showed a significantimprovement and many technical issues such as problems with screen sharing havebeen solved. However, some problems still remain. While some of them, like problemscaused by the hardware or the network quality, are not problems of Circuit itself,some are directly related to the specifics of Circuit implementation. One of the mostoften received complaints concerned the confusing structure of the chat list, which isimplemented in Circuit. The problem is that the overview of the chats is "polluted" bythe meeting invitations, which makes it difficult to distinguish whether the chat in the

24


list is actually someone directly addressing the person or it is just one of the meetinginvitations. As a result, many important messages are missed by the recepient. Anotherfeature that people are lacking is the possibility to control the screen of a colleagueduring the meeting for the cases when several people want to interact with the screenat the same time.

An important part of the feedback that we have received from several people whentalking about their experience with Circuit, is that no matter how good online commu-nication software is, it is still not the same as having a face-to-face meeting, mostly dueto the loss of nonverbal cues. Below are quotes from two of our interviewees, who wereasked to provide their general feedback on using Circuit for online communicationwith their team members:

"Yeah, I mean it’s still different being in a same meeting room than [in a] meeting overCircuit, so you do not see the people, you do not see what mood the people are really in, so wecan only hear what they are saying and of course you lose some information".

"It’s better than having nothing. But the thing is that with this kind of communicationalso today you miss a lot of information. From my experience, a lot of communication is donewith non-verbal way, so not speaking words, so face, gestural things and whatever. And thisis lost by such things like Circuit. But you can’t make every meeting as a one-to-one meeting,on-site meeting. So, you have to decide where it makes sense and where not, so Circuit fulfillsthe minimal needs".

So, as one of the interviewees summarized it: "It’s getting better and better but it’snot perfect".

5.2. Decision-Making Process (RQ2)

After asking interviewees about their experience with Circuit, we introduced them toour idea of providing an automatic summary in the end of the meeting with all thedecisions that were made in the meeting. In order to better understand how and whenthe decisions are usually made, we asked the experts to tell us a little bit about theprocess of decision making in their teams.

5.2.1. When and where are decisions made?

Since the teams in which our interviewees work are mainly all very distributed ge-ographically, most of the communication is done online, via email and Circuit. As

25


the essence of our idea was to detect decisions in online meetings, we asked the inter-viewees whether their team decisions are actually made in Circuit or if they are onlydiscussed there, but are finalized somewhere else. We were able to gather evidence thatit is in fact common practice to finalize decisions in Circuit; for most of the interviewees,this is the case in their teams. The quotes below are the answers of some of ourinterviewees to the question on where are decisions made in their team:

"In the Circuit meeting. So, when everybody is discussing and has the chance to disagree".

"Usually it’s finalized over Circuit, so for the projects I’m working in, it’s maybe a special casebecause I’m working in Munich and the rest of my team is working elsewhere, so I don’t have a di-rect contact to them, I cannot talk to them like in coffee breaks and then we make a decision there".

As the author of the last quote mentions, his team is distributed and thus does not havea possibility to meet face-to-face in a meeting or during a coffee break, therefore theteam has no other option, but to make their decisions online, via online communicationtools. However, for the teams, that are at least partly located in the same building, itis often the case that many important decisions are actually made not in the officialmeetings, but rather spontaneously in more informal situations. One of the interviewedexperts has made a strong case for this:

"My gut feeling is that most decisions are not done in meetings, they are announced inmeetings. The most decisions are done in a coffee corner, on a way to lunch, on whatever butnot in a meeting. I’m really convinced by this".

The same was mentioned by another interviewee, who thinks that even though deci-sions may be discussed in Circuit, the final decision is nevertheless made in person:

"There are lot of situations when we discuss something and I think we try to come to asolution or discussion but then two hours later I hear "So after our meeting we went to drink acoffee and then we decided to go another way". Therefore, I think most of the decisions are madein person".

5.2.2. Group decisions or single-person decisions?

In order to get more insights about how the decisions are usually made and who ismore likely to express them during an online meeting, we asked our interviewees how itworks in their teams - if decisions are made collaboratively or if there is one responsibleperson who usually makes the final decision. The majority of the participants claimed

26


that most of the decisions made in their teams are group decisions. Usually, there aremultiple experts and stakeholders that need to be involved in a decision, who presenttheir arguments during the discussion process and usually reach a consensus decisionin the end. However, there are also situations when some of the team members mightnot be involved in a specific problem or know little about a certain topic and then theissue can be discussed only with the lead architect and then the decision would alsobe made just with him. In rare cases, it can also happen that the chief architect or theproject owner can make a certain decision individually and just inform the rest of theteam. So, although most of the decisions are team decisions, both decision-makingscenarios occur.

5.2.3. Are decisions documented?

We then asked the interviewed experts if and how decisions that their team makes aredocumented. The interviewees claimed that at least for the important decisions, theytry to keep some sort of a decision log. Depending on the team, decisions are usuallydocumented either in Confluence, Wiki, Jira or in slides. What also happens is that aftercertain decisions are verbally made in Circuit, someone writes them into the meetingminutes and sends via email to the participants of the meeting and other involvedpeople. However, we have observed that many teams do not have an official procedurefor documentation of their meetings and decisions. As one of the interviewed expertswith a big professional experience has shared with us, this is the case for most of theteams: "There are at least 50 to 60 or even 70 percent of the meetings where you don’t have thisdiscipline for creating such a proper summary at the end and if there would be an easier way tocreate these summaries, it would be more used and in this case, it would bring benefit, that’swhat I think".

5.3. Requirements for the System (RQ3)

5.3.1. Information in the summary

Next, we described to the interviewees our use case: providing summaries of themeetings which would include automatically detected decisions and other usefulinformation.

We have asked the interviewed experts what kind of information (except for decisions)they think it would be useful include in the summary, what would they personallywant to see in such a summary. Below we list the answers that were given by differentinterviewees:

• Action items / TODOs

27


• Decisions

• The person who has been assigned

• Who brought something up

• The person who made the decision

• Deadline

• Open topics (things that need follow-up)

• Catch words (Keywords)

• Come-in / Drop-out times

• Information / News (e.g. news from management)

• Participants’ telephone numbers

In the scope of this thesis, we have decided to include in the report the information,that has most often been identified by the interviewees as important, namely decisions/ action items, related person, deadline and keywords.

5.3.2. Intrusiveness of the bot

We explained to the interviewees our idea of how we would confirm and refineautomatically detected decisions: whenever a potential decision is detected, the botwould say something like "It seems that you’ve made this decision, do you confirmthis?" or "Would you want to provide some extra reasoning?" etc. We have asked whendo the interviewees think the bot should ask these questions, should it be as soonas a potential decision is detected, even though it can be just the beginning of theconversation or should it wait towards the end of the discussion and go through all thedecisions that presumably were made during the discussion.

Some interviewees said that it would be better to ask straight away, while the teamis still on the subject, because usually during a typical meeting multiple topics arediscussed and it might be difficult to remember the decisions of the first topics after along discussion on the other subjects. As one of the experts formulated it: "After wefinish this discussion [on a certain topic], then it would make sense to have this bot interactionand then start the next topic, that’s what I think" One of the interviewees also mentionedthat in his opinion confirming the decision as soon as it is expressed is the better way ofmoderating the meeting: "I think as a moderator, if I was a moderator of the meeting, I wouldask immediately after the decision, so "Do we now have a decision?", so that I can document

28


this. And probably the bot should behave similar or in the same way, so ask immediately, confirmimmediately after a decision was taken that there was a decision".

On the other hand, some interviewees were of a different opinion, fearing that askingquestions in the middle of the meeting would distract people. In this connection, peoplealso emphasized that it is very important that the bot is simple to interact with anddoes not require a lot of time and effort from the participants of an online meeting:"Yeah, I mean I think it’s likely to be a little bit distracting... It shouldn’t be too intrusive, Ithink it happens a lot during such online meetings that people sort of only half pay attentionto the meeting, while also doing something else, I think that’s very often a problem and I thinka chatbot there would make that problem worse. It really depends how intrusive it is and howcomplex it is to interact with".

So, both options have their pros and cons and different people prefer differentapproaches, as mentioned by one of the interviewees: "Both have advantages and disad-vantages and it’s hard to judge before. I even would not say that all people like the same idea. Imean it could happen that some people prefer the immediate interruption, others would like thecollected list at the end". However, one of the experts suggested a combination of bothapproaches. The proposed idea is that as soon as a potential decision is detected, theparticipants would receive a notification from the bot, asking for a short confirmation(yes/no) whether the identified sentence is in fact a decision. The approved decisionswould be maintained in a list and in the end of the meeting the participants would beable to edit or provide additional information for each of the decisions. In our opinion,this is an optimal combination, since it will not cause a lot of disturbance during themeeting, but at the same time people will know straight away that the decision wasrecorded by the system, which will make them more confident in relying on the bot.As one of the experts mentioned: "If there is a process implemented where you get immediatefeedback about detected decision, then you have already some confidence that this information isnot lost, because the bot already asked you the question and you replied".

5.4. Other Suggestions and General Feedback

5.4.1. Additional use case

Apart from the main use case that is presented in this thesis, for the future workpurposes we have also introduced to the interviewed experts another scenario of wherean assistive bot could help the participants of a meeting, namely providing them withsuggestions and recommendations related to the topic being discussed. For example, ifa certain technology is being discussed, it can provide a short textual information aboutthe technology from Wikipedia, or, fetch relevant documents from the user’s computer.

First of all, we asked people what do they currently do in cases, when something is

29


unclear or they require additional information. For most of the interviewees, naturally,the first thing to do in such case is simply "pause" the discussion and ask a question. Incase none of the participants can provide an answer straight away, sometimes peopledo a Google search or search for a certain email in their Outlook or a document in thefile structure that might contain necessary information. In order to avoid situationswhen everybody is searching individually and thus the meeting gets "disconnected",often people perform such a search via screen sharing, where one person is browsingand the others are helping. However, people mentioned that they usually search foradditional information only in cases when they know that it will not take much time.In case it is taking too long to find the additional information, teams usually decide totake the issue offline and discuss this question in a separate, follow-up meeting.

We have then asked our interviewees if they face any challenges while trying tofind the missing information manually during the meeting. Based on their answers,in the case when people need some additional internal information that they haveseen somewhere before, one of the first challenges that they usually face is trying toremember where exactly they can find it. Thus, to the question of what are the maindifficulties in manual search for information during a meeting, one of the intervieweesanswered that for him, it is his memory: "Yeah, my memory of course because sometimes Idon’t know whether it was email or browser or something [where I saw the information]. I haveno means to search across all the media that might contain the information that I’m looking for".However, probably the main challenge that was mentioned by most of the intervieweesis trying to still follow the conversation while searching for the information. As oneof the interviewees put it: "Of course, the discussion in the meeting is ongoing while Iconcentrate on searching, so I have kind of a problem of splitting my brain in two parts: onesearching, one following the discussion".

Finally, after discussing the process and difficulties, connected to the manual searchfor information during online meetings, we have asked the experts if they would like toreceive automatic recommendations and information suggestions from the bot duringtheir meetings. Interviewees had different opinions on this use case. While some see itas potentially helpful and a good idea, many fear that such a functionality can be moreannoying and disturbing rather than helpful. Quite a few of the of the intervieweeswere rather pessimistic about the idea, not being sure on how it would be possibleto make sure that the bot provides the information when it’s really needed and doesnot just “spam” with information that everyone already knows anyway. As one ofthe interviewers formulated it: "The question is, would the bot somehow detect that thepeople are now really requiring information? If we are now talking about a topic where weboth know what we are talking about, we would probably not want to get let’s say spammedby additional information by this bot. If the bot is really smart enough to detect that nowpeople are looking for something and now they would start googling around or something

30


like that, then it could be helpful". One of the experts also pointed out that in mostcases it seems to be easier just to quickly search for the information personally: “Ifit’s something I do not have any idea about, I’m not sure if it can really help... And for thingsthat I can really quickly find a solution within the meeting, I’m not sure if just googling isjust enough for me". The key point, which was mentioned by everyone is that it iscrucial that the recommendations that the bot would provide are truly helpful and arenot just some "same fancy sentences" that would only distract. Below are some fur-ther answers from the experts, who were asked to share their opinion on such a use case:

"Of course, it depends on the quality of the recommendations you get. First of all, it gen-erates some noise at first, so you get some information in the chat stream you maybe did not askfor, so it might be annoying if the information is not really helpful".

"So, I think it could be useful if the recommendations are useful. That’s I think the mostimportant part. [...] So, I think really the most important thing for such assistant is that therecommendations are really really good and helpful. So that everybody is excited about that thereis a new recommendation and not "Awww, not again".

5.4.2. Ideas from the interviewees

We have also asked interviewees whether they have any other ideas on where suchan assistive bot could be helpful to them, for example, in helping with some of theproblems or difficulties they might have had in the past during online meetings.Although most of the participants did not have anything in mind that they could sharestraight away, several interviewees had some ideas.

Since the interviews were semi-structured and the order of the questions sometimeschanged, to some of the interviewees the question about their personal ideas wasasked before introducing our pre-defined use cases. It was interesting to see that someof the expert’s proposals happened to correspond to our ideas, which served as anadditional proof of the relevance of our work. Thus, one of the interviewees respondedthat he would like to have an assistance in the area of collecting meeting minutes,action items and decisions. Another expert also mentioned the possible usefulnessof being provided with additional information, related to the topic being discussedduring an online meeting: "Sometimes, when people are talking about certain technologies ormethodologies, then you start immediately googling it. Maybe, that could be something whichwould be interesting, if he [bot] knows that it’s a certain special term or a certain product, thenhe could provide some links to it, so you don’t have to do it manually".

31


5.4.3. General feedback

Finally, we have asked the interviewees about their general impression: whether theythink an integration of automatic meeting summaries into their online conversationswould be useful, if it would bring benefits to them and their teams and possibly solvesome of the problems they may have had before.

We have received a lot of positive feedback on the practical relevance and potentialusefulness of such a system. Most of the interviewees agree that it would be veryconvenient to get an automatic independent summary of the meeting and it wouldhelp their teams to be more process-compliant. As one of the experts stated, it wouldbe especially relevant for teams who do not have a strictly defined process for doc-umentation of their meetings: "I mean I can imagine it will bring benefits, especially forteams who don’t have that much of communication discipline. I mean in some projects thisalready works quite well because you have good project managers who are used to have suchkind of summaries and they create the summaries usually at the end. But there are at least 50to 60 or even 70 percent of the meetings where you don’t have this discipline to create such aproper summary at the end and." This statement was also confirmed by the rest of theinterviews. Based on the interviewees’ answers, it is clear that currently in most oftheir teams there is no common way of noting down decisions, action items and otherimportant information, mentioned during online conversations. If there is a prjectmanager, scrum master or other responsible person participating in the call, he or shemay take notes, otherwise everyone is just taking their own notes if necessary, andsometimes might volunteer to send them across to other participants after the meeting.One of the interviewees described this as follows: "I would say clearly it would help us inbeing more process-compliant, document our stuff better. It’s like this typical meeting hygienethat you have to take, so you have meeting, you need a summary, you need TODOs, you needaction items, you need decisions out of it and often it’s just not done. And if you have it forevery meeting without major effort, this would be a huge step forward I would say". Anotherexpert had a very similar opinion, also pointing out that the introduction of such asystem will bring another important advantage that people would get used to the factthat there are notes, that can always be referred to if necessary: "So, it [the process ofdocumenting meetings] is quite diverse and if you could streamline this to a commonly agreedupon way of doing this and this even happens automatically, so I do not have to spend time onit myself or at least spend less time on taking notes myself, then it would be helpful in a ways ofsaving time and also making people get used to the fact that there are notes".

Among the ten interviewed experts, there was only one person who said that in theirteam they have a very organized procedure for taking minutes of the meetings anddocumenting the decisions afterwards. However, even though the interviewee doesnot feel a need for such functionality for their team meetings, the person agreed that

32


for bigger meetings, it can be useful: "Yeah, I like this. I mean meetings that I run, itfeels like are well taken care of, but sometimes in bigger meetings where there is like the wholedepartment or some other discussion or something and I’m just participating or listening andthen sometimes it feels like not everybody feels responsible kind of and some information mightget lost, so I think there it would be fantastic if something like this would be offered".

In order to understand if people would be willing to use this functionality and trustthe provided information, we have asked the interviewees whether they would relyon automatically generated summaries, or whether they would continue making theirown notes. Most answered that it would depend on the quality of summaries. If thesystem would prove to be working well and be very rarely incorrect, it would definitelybe used. As one of the experts said: "If I have confidence over time that this is exactly whatI would have written down [myself], this is the information needed by everybody, I would getused to it so much that I would simply trust it". Another interviewee also mentioned thatin his opinion, it is very important that the quality of the generated summary is goodfrom the very beginning, otherwise people who will try it once and will not be satisfiedwith the result, will be uneager to use it again later, even if the quality would improveover time. As the expert put it: "Well, I think that depends on how reliably this thing works,when you first try it out and then it doesn’t work well, then I could imagine that in the futureeverybody, not only me, stays with this [old] process, takes own notes".

5.5. Validity of the Case Study

The evaluation of the validity of qualitative research findings includes assessment of therelevance of used research methods and credibility of final conclusions. Guba proposesfour criteria that must be considered to ensure trustworthiness of a study [44]:

• Internal Validity (Credibility) - the degree to which the alternative explanationsfor a finding can be eliminated, i.e. the study measures what is actually intended

• External Validity (Transferability) - the degree to which the results may beapplicable in other contexts and to other subjects

• Reliability (Dependability) - refers to the question of whether researchers wouldobtain consistent results if the study was repeated with the same or similarsubjects in the same or similar context

• Objectivity (Confirmability) - the degree to which study results depend solelyon the nature of the studied problem and the subjects, and are not affected by thebiases, motivations and interests of the researcher.

33


Shenton discuss a range of strategies that may be used in order to satisfy the abovecriteria. Based on his work [45], we explain the specifics of our research, aimed atreducing the threats of validity of our findings.

In order to ensure internal validity, we make sure to use an appropriate researchmethod, that is well established in the qualitative investigation, namely a case study.Before conducting the interviews, we had preliminary visits to the organization, whoseemployees were the subjects of the study, to get an understanding of the organizationalculture and create a trustworthy relationship between the parties. In order to ensurehonesty in interviewees, we only interviewed people who showed genuine interest andwish to participate in the study by responding to our e-mail request. Furthermore, theindependent role of the researcher and anonymization of the feedback was emhasizedto make sure that participants talk freely about their viewpoints and experiences. Thewide range of participants provided the possibility to compare individual opinions andexperiences, checking the information provided by different people and thus getting aricher picture on the researched topic. Finally, the feedback, offered to the researcherby the advisors and colleagues, helped to bring a fresh perspective and challenge theassumptions made by the investigators.In order to increase the internal validity of the findings, further interviews can becarried out with experts from other organizations, so as to reduce the effects of factors,particular to one specific company, on the results of the study. Additionally, furtherdata collection methods such as observation and focus groups can be used.

Providing external validity is a complicated task because the findings of any qualitativeresearch are usually specific to a particular setting and subjects, and it is hardly possibleto prove that the results of a study can be applicable to other contexts. However, weprovide an extensive description of the setting of our research, including the used datacollection method, the number of participants that were involved in the study and theirexpertise, the length of the interviews and the time period over which the interviewswere conducted. This detailed information should allow the readers to judge the extentto which the presented findings can be transferred to other situations that they mightbe interested in.In order to increase external validity and get a wider view on the topic of the study,further case studies in multiple organizations and environments can be performed.

The detailed description of the case study setting also supports reliability, namelyenables future researchers to accurately repeat our study. We describe what wasplanned and what was executed in our study, providing details on the data collectionprocess and evaluating the effectiveness and validity of our results in the currentsection.

Ensuring complete objectivity in a qualitative research is highly difficult, since, accord-ing to Patton, the influence of researcher’s bias is inevitable [46]. In order to reduce the

34


threat of researcher predisposition, the interviews, discussed in this work, were carriedout by two people with different professional backgrounds, and the questionnaire wasprepared with the support of an experienced software architect to help reduce the riskof misunderstandings in the interviews.

35

6. Data Corpus

This chapter describes the approaches that were used to collect training data forour decision detection model, the process of analysis of this data and correspondingfindings.

Collecting and annotating new corpora is a difficult, expensive and time-consumingtask. Due to the limited time frame of this thesis, we have considered and evaluatedseveral possible sources of existing training data. One of the first considered optionswas the AMI Corpus1, which consists of 100 hours of meeting recordings. The datacollected in the corpus was gathered partially through a set of simulated meetings, inwhich the participants played different roles in a design team and partially consistedof naturally occurring meetings in a range of domains. However, decisions that aremade in the field of software architecture and engineering are very specific, with manytypical terms, concepts and characteristics that significantly distinguish them fromdecisions that are made in other fields. Moreover, it was very important for us togather real-world data, including examples of real decisions that are made in softwarearchitecture. For these reasons, using the AMI Corpus would not be appropriate forour purposes.

We have also considered using design decisions from Jira issues and extracted thembased on the work of Bhat, Shumaiev, Biesdorf, et al. [8]. However, during their analysisand comparison to online spoken discussions, it became apparent that the way peopleexpress decisions in written form and verbally differs fundamentally and thus, trainingdata obtained from Jira issues would not be relevant for identifying decisions in spokenmeetings.

One of the other ideas for gathering training data included conducting a set ofinterviews with software architects and engineers, asking them how do they usuallyexpress decisions, and then generating more training examples using the tool Chatito2,which allows to create or extend datasets for NLU models by generating uniquecombinations of words and phrases according to a predefined structure, thus creatingnew training examples. However, this approach would also give us not real data, sincethe way experts might think they express the decisions might not correspond to the waythey formulate them in actual meetings. Moreover, using Chatito for the generation of

1http://groups.inf.ed.ac.uk/ami/corpus/2https://rodrigopivi.github.io/Chatito/

36

http://groups.inf.ed.ac.uk/ami/corpus/

https://rodrigopivi.github.io/Chatito/

6. Data Corpus

more training data based on the available examples may result in overfitting.Thereby, for the training of our model we have decided to use real-world architectural

meetings exclusively. The data collection process is described in the next section.

6.1. Data Collection

In order to collect real-world training data for the decision-detection model, we usedrecordings from 17 architectural meetings, held by different software developmentteams within the industry partner company in the period from 28th of September,2018 to 13th of March, 2019. The total duration of collected meetings is more than620 minutes, with an average of 36 minutes per meeting. All meetings werd held inEnglish by non-native speakers and included 3 participants on average. The meetingswere automatically transcribed using Automatic Speech Recognition service calledSpeechmatics1, the same service that is used in the bot implementation for transcribingthe meetings in real time (see section 7.1). Meeting transcriptions were then used toanalyze the way software architects and developers conduct group discussions andformulate taken decisions.

6.2. Data Analysis

The author of the thesis analyzed the meetings by listening to them and reading andcoding the corresponding transcriptions in order to find cases when the participants ofthe meetings express decisions. Out of 17 meetings, in two of the meetings we were notable to find any decision instances. Out of the remaining 15 meetings, which were allmanually annotated by the author, in the 12 meetings that were used for training, wehave collected total of 129 examples of decision statements. It has to be noted that apartfrom the final decisions, these statements also include possible solutions to the problem,as introduced in the conceptual model for architectural design decisions in section 2.1.1.It was identified that the most common ways in which people express the fact thatsomething needs to be done, i.e. a decision or a task, involves expressions like "haveto", "should", "must", "let’s" and "will", combined with verbs, that are likely to describearchitectural changes, such as "add", "delete", "remove", "update", "fix", "implement","create" etc.

During the analysis of the meeting transcripts, we have also identified other keyelements of the conceptual model for architectural design decisions. As an example, let’sdiscuss a scenario from one of the analyzed meetings when architects were discussing

1https://www.speechmatics.com/

37

https://www.speechmatics.com/

6. Data Corpus

a system, that maintains configurations of services by third party systems (Context).There was a concern because there are scenarios when the services or configurationsof services change at runtime and if one configuration changes or is deleted, then it isnecessary that all the services also get deleted (Motivation). Therefore, the architectswere faced with a question of how to implement cascade delete (Problem). Theydiscussed two possible approaches. One option is to perform cascade delete based onforeign key constraints (Solution), while another way is to maintain a boolean field toindicate the deletion, without actually deleting the records (Solution). In the discussionit was pointed out that resolving this problem would take a lot of time and effort,but there is no end consumer asking for such a feature. Therefore, it was agreed topostpone fixing of this issue and revisit the topic when there is a real business need forit (Decision).

38

7. Implementation

This chapter describes the approaches, tools and technologies that were used to im-plement the technical part of this thesis, formulated in section 4.1, thus providing theanswer to the technical research question 4, formulated in section 1.2.

7.1. Automatic Speech Recognition

To convert meeting recordings into text, we use an Automatic Speech Recognitionservice called Speechmatics1. Speechmatics allows to produce transcripts in real time,with the capability to filter out background noise, recognize punctuation and a changeof speaker and understand accents. It is speaker independent, supports multiplelanguages, including English, German, French, Russian, Arabic, Hindi, Mandarin,Korean and more than 20 others. Speechmatics also provides the possibility of privatedeployment of the service on-premise, which is especially important when the companyusing the technology wants to keep its confidential data secure.

7.2. Decision Detection

After the transcript of the conversation is generated, the next step is to "understand"this text and identify decisions in it. For these purposes, we use a Natural LanguageUnderstanding tool called Rasa NLU2, which, according to the evaluation performedby Braun, Hernandez-Mendez, Matthes, and Langen [29], is one of the best-performingNLU services currently available. Among the advantages of Rasa is that it is opensource, highly customizable and GDPR compliant. Rasa is also easy to use - it can berun directly from the Python code or as a simple HTTP server. Another importantadvantage of Rasa is that it can be hosted on own servers or on-premise, which meansthat, unlike with other popular services, no data has to be passed to big companiessuch as Google or Amazon. In the same way as for Speechmatics, this is a decisivefactor when it comes to providing data confidentiality.

1https://www.speechmatics.com/2https://rasa.com/docs/nlu/

39


https://rasa.com/docs/nlu/

7. Implementation

7.2.1. Rasa NLU pipeline

Rasa offers two main predefined pipelines: spaCy and Tensorflow [47]. The maindifference between the two options is that spaCy uses pretrained word vectors, whileTensorflow pipeline does not use any pretrained word vectors but fits to the particulardataset instead. While the fact that Tensorflow can be customized for a specific domainis an obvious advantage, it can only do so if there is a sufficient amount of labeledexamples. For the cases when there is no much training data, using spaCy is moreadvantageous since the loaded language models are already pretrained to detect similarwords. As a rule of thumb, the Rasa team recommends to use spaCy if there are lessthan 1000 total training examples and Tensorflow otherwise. Since our training datasetis not so big, we decided to use the spaCy pipeline.

The spaCy pipeline consists of severals types of built-in components that are executedone after another:

• Model InitializerModel initializer nlp_spacy is the first component of the pipeline, which initializesspaCy structures. All the other components rely on the model initializer.

• TokenizerTokenizer spacy_tokenizer breaks up the input text into tokens (i.e. words), whichare used later to vectorize words and extract entities.

• FeaturizerFeaturizer intent_featurizer_spacy creates features and is used as an input for theintent classification component.

• Named Entity Recognizer (NER)NER extracts and classifies named entities (e.g. names, organizations, dates, etc.)from the text input. The spaCy pipeline uses conditional random field entityextractor ner_crf which analyzes the position and features of the words suchas capitalization or part-of-speech tagging to calculate probabilities of a wordbelonging to a certain entity class.The pipeline also includes a component ner_synonyms which maps entity valuesto their knows synonyms.

• Intent Classifierintent_classifier_sklearn, which is used in the spaCy pipeline, takes the featurescreated by feauturizer to classify the intent of the input using Support VectorMachine (SVM) optimized by a grid search. As an output, it provides the nameand confidence of the most probable intent, as well as of the other trained intents.

40

7. Implementation

In addition to the predefined pipeline components, it is also possible to add otheravailable components to the pipeline or even create custom components.

7.2.2. Decision-detection model

For the training of our decision-detection model, we use 129 examples of decisionstatements, found during the analysis of the transcripts of online meetings in softwaredevelopment teams. One training example represents a sentence or a part of sentence,separated by delimeters, which contains an expression of decision. Since sentences inthe transcription can get very long due to the specifics or spoken language, for eachtraining example we further label entities d and s, which contain the words that infact express decisions or suggestions respectively. Such approach allows us to keepthe natural structure of commonly expressed sentences, while distinguishing decision-denoting cues. Furthermore, by introducing two separate entities, we differentiatebetween more certain decisions and less certain suggestions that are expressed by theparticipants. An illustration of a training example in JSON format is presented inListing 7.1.

Listing 7.1: Model training example

{"intent": "decision","entities": [{"start": 113,"end": 177,"value": "we can experiment with the whole of the countrycredibility part","entity": "s"

},{"start": 314,"end": 364,"value": "we have to think of a strategy of integrating that","entity": "d"

}],"text": "And if we go to the next you did with the next big in herethe first one I think to begin with I would still say we canexperiment with the whole of the country credibility part as well

41

7. Implementation

with the existing dashboard as a service that is being developed inthe sense that For me the complexity still seems to be that we haveto think of a strategy of integrating that."

}

7.3. Concept Extraction

In order to generate the list of keywords for the meeting report, we perform conceptextraction, which works by identifying and disambiguating named entities mentionedin the transcription text and cross-linking them to DBpedia and Linked Data entities.

7.3.1. Linked data

The main idea of Linked Data is to use the architecture of World Wide Web [48] for thetask of linking and sharing structured data on a global scale [49]. It allows to connecta potentially endless amount of data, distributed across different sources around theWeb and navigate between them through the Resource Description Framework (RDF)links [50].

Berners-Lee has introduced the four design principles of Linked Data [51]:

• "Use URIs as names for things."

• "Use HTTP URIs so that people can look up those names."

• "When someone looks up a URI, provide useful information, using the standards(RDF*, SPARQL)."

• "Include links to other URIs, so that they can discover more things."

7.3.2. DBPedia

DBPedia [52] is a community project, which provides a publicly available data corpus,composed from the structured data extracted from Wikipedia articles - one of thegreatest sources of information on the Web. The first publicly available dataset waspublished in 2007 and contained information about more than 1.95 billion things.The latest release of DBpedia from October 2016 [53] consists of 23 billion pieces ofinformation (RDF triples), including 9 billion triples from the NLP Interchange Format(NIF) datasets for each language edition. The DBpedia RDF dataset is hosted andpublished using multi-model database management system OpenLink Virtuoso1, which

1https://virtuoso.openlinksw.com/

42

https://virtuoso.openlinksw.com/

7. Implementation

provides access to RDF data through a SPARQL [54] endpoint, as well as HTTP supportfor RDF or HTML representations of DBpedia resources [55]. The current DBpedia dataprovision architecture is illustrated on Figure 7.1.

Figure 7.1.: DBPedia Architecture. Reprinted from [55].

7.3.3. DBPedia annotator

For the task of extracting concepts in the DBpedia ontology, we use a publicly availableAPI function1 which was presented as a part of the research of Bhat, Shumaiev, Biesdorf,et al. [56] who adopted the work of Daiber, Jakob, Hokamp, and Mendes [57] on theproject called DBpedia Spotlight [58].DBpedia Spotlight is an open-source system used for automatic identification of naturallanguage mentions of DBpedia resources and annotation of text documents withcorreponding DBpedia URIs. The DBpedia Spotlight approach works in four stages:spotting, candidate selection, disambiguation and configuration. In the Spotting stage,

1https://spotlight.in.tum.de/processDocument

43

https://spotlight.in.tum.de/processDocument

7. Implementation

the possible candidates for annotation are generated using a string-matching algorithmproposed by Aho and Corasick [59]. In the Candidate selection stage, out of thesepre-selected phrases, the best candidates are selected, i.e. only phrases with a similarityscore above a specified score threshold are included in the result. In the Disambiguationstage, for the terms that have several meaning, the system identifies the most likelymeaning of the word based on the context around it using the generative probabilisticmodel [60] proposed by X. Han and Sun. Finally, the Configuration stage allows theusers to customize the annotation to their needs by configuring various parameterssuch as resource prominence, topic pertinence, contextual ambiguity etc.

7.4. Report Generation

The report generation component which has been used to produce the final report waswritten using the newest release of the Python language at the time of writing thisthesis - Python 3.7.3. The script consists of the three main methods:

• make_sentence(text)This method contains the main functionality which generates the list of decisionsand suggestions that are included in the report.First of all, we load the metadata of our decision-detection model and instantiatean interpreter. Then, we take the text received from the argument and break itdown into sentences and parts of sentences by delimeters that represent period(.), question mark (?), exclamation point (!), comma (,), semicolon (;) and colon(:). For each sentence in the parsed text, we remove word duplicates, which is acommon problem because people often repeat same words several times in theprocess of formulating a sentence, e.g. "We have to to change this". Then, we parseeach sentence using the loaded interpreter, getting a JSON response as showed infigure 7.2. If the model identifies a sentence as having an intent decision with aconfidence above set threshold, we continue with reviewing the entities, identifiedwithin the sentence. As introduced in section 7.2.2, we additionally check if thesentence contains entities d and s, which correspond to expressions of a decisionand suggestion respectively. If a decision sentence contains one of these entities,it gets appended to the corresponding lists of decisions and suggestions. We alsolook for entities DATE and PERSON or name, whose values are used to providethe information about the deadline and the person, related to a certain identifieddecision. If any of these entities have been indentified by the model, their valuesalso get added to to a corresponding list for further representation in the meetingreport.

44

7. Implementation

Figure 7.2.: Rasa response

• make_keywords(text)To identify keywords for our report, we make a POST request with the parsedtranscript text as the content, receiving an array of concepts in the JSON response,as demonstrated in figure 7.3. We then extract the value of the "token" key andappend it to the list of keywords.

• create_pdf_report(text)This method generates a PDF document using PyFPDF1 library according to thedefined style of the document and the custom HTML template. From this methodwe call methods make_sentence and make_keywords and write the results into thecorresponding place in the template. We then call PyFPDF method write_htmlwhich parses HTML and converts it to PDF and method output, which outputsthe PDF file with the defined name to the designated destination.

45

7. Implementation

Figure 7.3.: Concept extraction

The described above report generation implementation is integrated into the existingsystem as a separate component, as demonstrated on Figure 7.4.

7.5. Integration into Circuit

As described in section 4.1.3, the existing system, into which we had to integrate thereport generation functionality, followed an event-driven architecture using stream-processing software platform Apache Kafka. Therefore, the first step in the integrationprocess was adding a new event type TRANSCRIPTION_WITHREPORT to the alreadyexisting before Kafka topic transcription.full within the kafka-asr service. We define aclass TranscriptionFullWithReport, which will be used to send the PDF file with themeeting report to Circuit, where method to_dict is used to transform a Python objectto a Kafka message. To be able to transform Kafka message to a Python object, weintroduce from_message method.

1https://pyfpdf.readthedocs.io/

46

https://pyfpdf.readthedocs.io/

7. Implementation

Figure 7.4.: Modified architecture

When Kafka receives TRANSCRIPTION_FULL event, we call the described in sec-tion 7.4 method create_pdf_report with the text of the full transcription as the argumentand get the path of the PDF document pdf_filepath, which is stored in the shared dockervolume. The file itself is not sent via Kafka because the messages on Kafka are normallyvery small and it is not recommended to send bigger files over Kafka.

Finally, when Kafka receives an event of type TRANSCRIPTION_WITHREPORT, itgets the file path from the event and sends the corresponding PDF file to Circuit.

This process is illustrated in a sequence diagram in Figure 7.5. The user interfaceand the work of the bot is presented in Figure 7.6. A sample report is presented inAppendix B.

47

7. Implementation

Figure 7.5.: Sequence diagram

48

7. Implementation

Figure 7.6.: Bot showing a generated report in the Circuit web inteface

49

8. Evaluation and Results

In this chapter we describe the process of evaluation of our solution and discusscorresponding results and challenges.

8.1. Speech Recognition Accuracy

In order to understand if and how the quality of speech recognition affects the per-formance of the model, one of the meetings was manually transcribed and comparedto the transcription that was generated automatically by Speechmatics. The meetingwas transcribed by the author of the thesis, who is a computer science graduate andthus can understand and recognize technical vocabulary and context, however didnot participate in the meeting itself and is not part of the team. This means that thetranscriber was hearing the meeting for the first time and had no previous knowledgeon the discussion topic that could affect the quality of the resulting transcript.

The accuracy of the speech recognition was evaluated using the metric Word errorrate (WER) [61], which is defined as:

WER = S+D+IN = S+D+I

S+D+C ,

where:

• S is the number of substitutions, i.e. words that were wrongly transcribed bydifferent words

• D is the number of deletions, i.e. words that were completely omitted in thetranscript

• I is the number of insertions, i.e. words added to the transcription that were infact never said

• C is the number of correctly transcribed words

• N is the total number of words originally spoken (N=S+D+C)

The results are summarized in Table 8.1.

50


ASR toolSpeechmaticsv.2.0.1

Substitutions (S) 208Deletions (D) 110Insertions (I) 27Correct words (C) 1014Total number of words (N) 1333Word Error Rate (WER) 26%

Table 8.1.: Evaluation of ASR performance

8.2. Model Performance

We evaluated our decision detector with a five-fold validation procedure using theset of 15 architectural online meetings, in which we were able to detect any decisionexpressions. We trained the model that can classify decision intent on a subset of 12meetings; next, the trained model was tested on the remaining 3 meetings that werenot used in the training phase.

The performance of the model on the three test meetings was evaluated by performinga sentence-base calculation of its efficiency in terms of Precision (P), Recall (R), andtheir harmonic mean, the F1 score:

R = TPTP+FN , P = TP

TP+FP , F1 = 2 ∗ P∗RP+R ,

where TP (true positive) is the number of correctly identified decision sentences, FN(false negative) is the number of decision sentences not identified by the algorithm,and FP (false positive) is the number of sentences that are incorrectly identified by thealgorithm as such that contain a decision.

The results are summarized in Table 8.2. We additionally provide the informationon the number of True Negatives (TN), i.e. number of sentences that were correctlyidentified as such that do not contain a decision.

8.3. Challenges and Limitations

8.3.1. Data scarcity

For the reasons, described in Chapter 6, for the training of the model we have decidedto use real-world architectural meetings exclusively. Collecting and annotating datasetsfor new tasks is a very difficult and time-consuming process. Within the thesis period,

51


Meeting 1 Meeting 2 Meeting 3TP 2 6 8FP 13 23 22FN 2 11 8TN 91 202 121Precision 0.133 0.207 0.267Recall 0.5 0.353 0.5F1 score 0.207 0.261 0.35

Table 8.2.: Evaluation of the decision detector model

given the available resources, we have recorded and transcribed 17 meetings that lastedfor more than 10 hours in total. From the transcriptions, we were able to collect a totalof 129 training examples with expressions of decisions. However, this amount of datais not sufficient to train a well-performing model. We believe that such lack of datais the main reason for the underperformance of the model. In section 9.2 we discusspossible approaches for tackling the problem of data scarcity.

8.3.2. Quality of speech recognition

Another crucial problem that we were faced with concerns underperformance of thespeech recognizer. As introduced in section 8.1, the used ASR service gives around26% word error rate when applied to the evaluated recordings of online meetingsin software development teams. Such error rate results in significant parts of onlineconversations being transcribed incorrectly, thus loosing the conveyed meaning andmissing potential decisions.

8.3.3. Challenges of spoken language

The specifics of spoken language introduce further challenges to the task of identifyingdecisions in online meetings. Unlike in task management systems, where decisionsand tasks are explicitly formulated, in online meetings they are often "hidden" withinthe discussion. Our evaluation of meeting transcriptions showed that people do notformulate their thoughts when speaking as well as they do in writing. For example, theytend to repeat same words several times in the process of formulating their thought,e.g. "We have to to change this". Another problem that we have observed refers tothe pauses that people often make between phrases. ASR recognizes it as separatesentences, for example "We should create. Separate excel file", thus preventing the modelfrom identifying the decision.

52


8.3.4. Uncertainty expressions

In recent years, the linguistic phenomenon of uncertainty and its detection has beenreceiving more and more attention in the NLP community [62]. Letier, Stefan, andBarr claim that uncertainty is inevitable in software engineering [63]. It complicatesarchitecture decision making and may expose software projects to a significant risk.The authors state that software architects lack support for assessing uncertainty, itsimpact on risk, and the value of reducing uncertainty before making critical decisions.

In the explorative study by Shumaiev, Bhat, Klymenko, et al. [64], the authorsconducted a set of case-studies, examining the discussions in task management systemsfor the cases of uncertainty expressions and interviewed software architects of therespective projects to see how these uncertainties are comprehended by their authorsand colleagues. Based on the feedback from the interviewed experts, one of the mainfindings of this work was that often, uncertainty expressions such as "I think", "Maybe"or "In my opinion" are not used to express uncertainty itself, but are rather used as afigure of speech, feedback trigger, a polite expression of preference or reassurance. Theauthors mention that the disambiguation of uncertainty expressions, i.e. understandingwhether it was actually an expression of uncertainty or not, required a significantcognitive effort from the interviewees and in some of the cases they were still unable togive a confident answer.

In this work, we have also encountered multiple uncertainty expressions that areconfusing for the model in the way that it is unclear whether sentences that containsuch uncertainty cues should be identified as a decision or not:

• "I think we should follow up with Martin"

• "Maybe we should have another meeting on Wednesday

8.3.5. Referring expressions

People can use different words and expressions to refer to the same thing or person.For example, if someone is talking about their colleague, they can refer to this person as"my colleague", by person’s name, position or using a pronoun he or she. These are allexamples of referring expressions. Therefore, a referring expression is any expressionthat is used to refer to something or someone [65]. The task of automatically resolvingreferring expressions can get extremely difficult depending on the type of expression.In the easiest case, an object or person can be referred to by its direct name:

• "We will wait for Christoph in the next meeting room"

• "That’s why we came up with this Amelie system"

53


However, referring expressions can be noun phrases of any structure:

• "Let’s just keep that part the way it is"

• "So these are the two things that we need to implement"

Among the most challenging for resolution referring expressions are referential pro-nouns such as he, she, they and broad referring expressions, such as this and that[66]:

• "I will ask her to document it"

• "I’m supposed to send it to him"

• "We definitely have to implement it"

• "This could also be scaled up"

Moreover, the same expression can be a referring expression or not, depending onthe context. A common example is a non-referential "it" which can be used as a subjectin a sentence:

• "Let’s take on the architecture meeting, it is about time anyway"

8.3.6. Identifying context and distinguishing decision types

Another challenge is that currently the model is unable to distinguish between archi-tecturally significant decisions and other tasks and decisions. Consider the followingthree examples from the collected meetings:

• "I will create the API"

• "I will set up a meeting on Monday"

• "I will share my screen"

Although all three sentences have a very similar structure, only the first two areexamples of decisions. While the first one is an example of an architectural designdecision, the second one is rather an organizational decision, but at the moment, wedo not have any means of distinguishing between them. In the scope of this thesis, weare not dealing with this problem, however, identifying different types of decisions, asintroduced in section 2.1.2, is an interesting task for future work.

A more critical problem, that has caused a lot of false positives in our evaluation,is represented by the third example that we have provided above. Syntactically, thissentence is almost identical to the previous two, however it does not express a decision,but describes the action that the speaker is about to undergo. Dealing with thischallenge requires further work on the identification of the context of the conversation.

54

9. Conclusion

The final chapter of this thesis summarizes the work that has been performed in orderto reach the goal of the thesis and outlines the possible directions for future work.

9.1. Summary

The goal of this thesis was to develop a system that would support software architects,developers and team leads by automatically documenting the results of their onlineconversational meetings.

In the beginning of our research, we have conducted a thorough literature study,gathering theoretical knowledge in the fields of software architecture decision makingand natural language processing. We described the concept of architectural designdecisions and their types, the main approaches that are used to solve NLP tasks,and presented an evaluation of different speech recognition and natural languageunderstanding services.

We have then presented an overview of related work that deals with detection of de-cisions in natural-language communication, summarization of meetings, understandingthe process of decision-making in software architecture and extraction of architecturaldesign decisions. We discover that previous research on extraction and documentationof design decisions in the field of software development and architecture has mostlyfocused on detecting decisions in issue management systems and source code commits.

In order to answer the formulated research questions, we have conducted a set ofexpert interviews, gathering feedback on the way online meetings are held and docu-mented in practice (RQ1), the process of decision-making in distributed teams (RQ2)and the expectations from a system that automatically documents online architecturalmeetings (RQ3). In particular, we have confirmed our hypothesis that many decisionsare made during online meetings and that usually they are not captured anywhereafterwards, since software development teams often lack an official procedure fordocumentation of their meetings and decisions. In this connection, we have obtained alot of positive feedback on the practical relevance and potential usefulness of automaticsummarization of online meetings that would help software development teams to bemore process-compliant and save a lot of time.

55

9. Conclusion

Through the interviews, we have also identified the requirements for a system thatwould provide automatic summaries of the online meetings. We have determined themain elements of such a summary, that the participants find useful, namely decisions /action items, related person, deadline, keywords, open topics and time the participantshave joined and left the call. We have also explored the ways the participants wouldprefer to interact with the virtual meeting assistant (i.e. a bot) and further ideas onhow the bot could be used to support software architects and developers in their dailyonline meetings.

We have provided a detailed evaluation of the validity of our findings, describing theways in which we attempted to maximize the internal and external validity, reliabilityand objectivity of our results. We also discussed other means in which the validity canbe further increased.

Considering architectural design decisions to be the main results of architecturalmeetings, we implemented a solution that is focused on the automatic extraction ofdecisions made during online meetings of software development teams. Our apporachincludes converting meeting speech into text using ASR service Speechmatics anddetecting decisions in the generated transcript using Rasa NLU tool. In order to obtaintraining data for the decision detection model, we analyzed 17 architectural meetings,totalling to 620 minutes of conversation, held by different software development teams.During the analysis of the meetings we have collected 129 examples of decision state-ments, identified the most common ways in which software architects and developersexpress their decisions and detected examples of the key architectural design decisionselements.

We have evaluated our decision-detection model with a five-fold validation proce-dure, obtaining an average F-score of 0.273 and presented a detailed overview of thechallenges and limitations that negatively affected the performance of the model. Weconsider data scarcity and 26% word error rate of the automatic speech recognizer tobe the main obstacles to a better performance. Among the other challenges we identifyuncertainty expressions, referring expressions, identification of context and distinctionbetween decision types. In section 9.2.1 we discuss possible approaches to overcomethese problems in future work.

9.2. Future Work

We divide the possible future work into two main sections: work on the enhancement ofthe presented decision-detection model and further development of the virtual meetingassistant in general.

56

9. Conclusion

9.2.1. Model enhancement

Solving Data Scarcity

Data scarcity is one of the main challenges in achieving better performance of themodel. In order to improve its quality, it is necessary to gather more training data onthe basis of real-world architectural online meetings. Apart from data collection, thereare different strategies that can be considered using in order to deal with the problemof data scarcity:

• Semi-supervised LearningSemi-supervised learning is a combination of supervised and unsupervisedmachine learning methods, the core idea of which is to learn from a small amountof labeled data and apply the knowledge to a larger amount of unlabeled data(raw text) and label it. As a result, a potentially better model can be trained.

• Active LearningThe main hypothesis of active learning is that a machine learning algorithm canpotentially perform better even with less training data if it can choose the mostinformative examples that it wants to learn from. Active learning has proved tohelp in many NLP tasks by significantly reducing the amount of required labeleddata [67].

• Data AugmentationData augmentation is another way of artificially increasing the dataset by creatingaltered copies of training examples. It is mostly used in computer vision, byflipping, cropping, rotating or mirroring training images. For NLP this strategy ispotentially less effective, since changing one word can change the entire meaningof the sentence. However, there exist text editing techniques that can be used toimprove model performance on small datasets. For example, non stop words canbe substituted with their synonyms or randomly deleted or swapped with otherwords within the sentence [68].

• Transfer LearningTransfer learning is a technique that involves using prior knowledge, i.e. amodel used for solving a different, but related task and applying it to the targetproblem. Previously, transfer learning has also been mostly used in computervision, however lately it has also found its application in NLP. For example,M. E. Peters, Neumann, Iyyer, et al. used unlabeled text data to create a pre-training task, producing generic text representations that can be added to existingmodels [69]. The results showed that such approach can significantly improve thestate-of-the-art results for multiple NLP tasks.

57

9. Conclusion

• Domain AdaptationDomain adaptation is closely related to transfer learning and involves reusingrelated data that is available in other domains and applying it to the targetproblem. Although in this thesis we have intentionally decided to use only datafrom architectural meetings due to the specific nature of the field, in future itmight be interesting to see whether data, available in different domains canimprove our model.

Improving ASR performance

Our analysis shows that the insufficient quality of the ASR tool negatively affectsthe performance of the decision-detection model. Although for the speech-to-texttranslation we are using commercial software Speechmatics, there is a way to possiblyimprove its performance, namely by expanding and customizing the speech recognitionvocabulary. Last year, Speechmatics has announced the launch of a new feature calledCustom Dictionary1 that allows customers to easily add context specific words toSpeechmatics’ dictionary simply via text. For example, in our evaluation we haveidentified that Speechmatics fails to identify many names and field specific terms. Thus,for transcribing online conversations in software development teams, relevant IT-relatedconcepts, as well as the names of the team members can be added to the dictionary,which will enable Speechamtics to identify and spell them correctly, while dynamicallyadapting to the given context. As a result, we can achieve more accurate transcriptswhich should also positively affect the accuracy of the decision-detection model.

Another way in which the performance of automatic speech recognition can beimproved involves adding audio preprocessing before passing it to Speechmatics.One of the main approaches to ASR preprocessing is noise reduction. Noise fromthe surrounding environment or in the channel impairs the quality of speech signal,causing information loss and negatively affecting the performance of ASR systems.Garg and Jain present a study of different noise reduction techniques [70] , showingsignificant advantages of wiener filters [71] and gamma tone filters [72] over otherapproaches. Other ASR preprocessing techniques, including voice activity detection,pre-emphasis, framing and windowing are presented in the work of Ibrahim, Odiketa,and Ibiyemi [73].

Resolving referring expressions

Resolving referring expressions would solve the problem of missing subjects in identi-fied decisions. There has been a number of scientific works that deal with resolving

1https://www.speechmatics.com/product/custom-dictionary/

58

https://www.speechmatics.com/product/custom-dictionary/

9. Conclusion

referring expressions. McShane and Babkin propose a domain independent systemcalled CROSS that can identify difficult referring expressions expressed by pronouns(it, this), personal pronouns (he, her) and definite descriptions (this class, that method)[66]. Celikyilmaz, Feizollahi, Hakkani-Tur, and Sarikaya presented a domain inde-pendent framework for recognizing referring expressions in user’s speech duringhuman-computer conversations. Their model detects if the user is referring to a certainitem currently displayed on the screen and determines which item is referred to [74].Regan, Rastogi, Mathias, et al. approach reference resolution task by modelling it as auser query reformulation task given the dialog history [75].

Adding further intents

As discussed in section 8.3.6, currently our model does not distinguish between differenttypes of decisions that are made during a meeting, having only one common decisionintent. In future, we consider relevant having separate intents for different kinds ofarchitectural decisions, as presented in section 2.1.2, as well as for organizational andother types of decisions.

Furthermore, as described in section 5.3.1, except for decisions, software architectsfeel the need for other important information such as action items, open topics andnews to be included in the meeting report. Therefore, in future, the model can beextended to detect these types of information in the meeting transcripts.

9.2.2. Virtual assistant development

The next step for the advancement of the process of automatic documentation of onlinemeetings would be the implementation of summary refinement, where architects wouldhave to review automatically detected decisions and optionally add additional usefulinformation. In section 5.3.2 we discuss expert’s feedback on how in their opinion thishas to be put in practice. We believe that this additional step before the generationof the final meeting report would help to refine the list of decisions and make thesummary more detailed and customizable. Moreover, participants’ reflection on themade decisions can help to challenge the thinking behind design reasoning [76].

In section 5.4.1 we present a further use case that can be implemented to assistsoftware architects and developers during a meeting, namely by providing them withsuggestions and recommendations related to the topic being discussed. For instance,if they are discussing a certain technology, the bot would provide a short textualinformation about the technology from Wikipedia.

Furthermore, additional case studies can be conducted with the experts in the field ofsoftware engineering and architecture to get more empirical evidence about the current

59

9. Conclusion

challenges they face during online meetings and in the process of documentation oftheir decisions. Based on such case studies, more possible use cases for the virtualmeeting assistant can be derived.

60

List of Figures

2.1. Model for architectural design decisions. Reprinted from [12]. . . . . . . 62.2. NLP and NLU Tasks. Adapted from [28]. . . . . . . . . . . . . . . . . . . 12

4.1. Overall architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7.1. DBPedia Architecture. Reprinted from [55]. . . . . . . . . . . . . . . . . . 437.2. Rasa response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457.3. Concept extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467.4. Modified architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477.5. Sequence diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487.6. Bot showing a generated report in the Circuit web inteface . . . . . . . . 49

B.1. Meeting report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

61

List of Tables

2.1. Comparison of ASR services [24] . . . . . . . . . . . . . . . . . . . . . . . 102.2. Overall scores for intent and entity [31] . . . . . . . . . . . . . . . . . . . 132.3. Combined overall scores [31] . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.1. Interviewees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

8.1. Evaluation of ASR performance . . . . . . . . . . . . . . . . . . . . . . . . 518.2. Evaluation of the decision detector model . . . . . . . . . . . . . . . . . . 52

62

Listings

7.1. Model training example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

63

A. Appendix A: Interview DiscussionGuide

Thank you for taking time out for a quick discussion.

The reason why we wanted 30 minutes slot for this discussion with you is to know andunderstand your needs and expectations for our current project where we are tryingto capture your thoughts on virtual meeting environment. With respect to this wewould like to capture your experience in a virtual meeting environment.

We would like to know from you:

• Your idea of an Assistive Bot during a Virtual meeting Scenario

• Challenges with the existing systems

• Your expectations from an Assistive Bot

Please feel free to discuss anything that you feel is relevant.

A.1. Warm-up questions (6 minutes)

• What is your role / position in the project / company?

• How many years of experience in IT industry do you have?

• How big is your team? Where are the team members located?

• Which means of communication are used in your team?

• You would be using Circuit for conducting remote meetings, what is your experi-ence of using Circuit as a virtual meeting platform?(Focusing more on interaction with the attendees, not necessarily directed toconnection issues, mute / unmute).

– How often do you host meetings and what are your typical activities whilehosting the meeting?

64

A. Appendix A: Interview Discussion Guide

– What are your typical activities while participating in a meeting?

– On average, how many people are involved in these remote meetings?(Regular meetings or workshop type meetings. If long meetings, are thereany challenges in conducting them via Circuit?)

A.2. Challenges with the existing systems (6 minutes)

A.2.1. When the meeting is in progress (4 minutes)

• What kind of information do you capture during meetings?(Think of the last meeting you had where there was a need to do this)

• What do you do when you are having a discussion over Circuit and want moreinformation / recommendations related to your topic or want to refer to some ofthe documents?

– Do you face any challenges while trying to fetch the details?

A.2.2. Post a meeting (2 minutes)

• How often do you have to refer back to the information which you discussedduring the meeting on Circuit?

– Could you let us know where and how you look for this information?

A.3. User’s perception of working with futuristic concepts (18minutes)

• In your opinion, when are most of the decisions made in the project?(stand-up meetings, architecture review meetings, daily peer-to-peer offline com-munication, online communication over email (messaging platforms), Jira issues,your option)

• In your opinion, how are most of the decisions made in your team? (groupdecisions / by a single person)

• How do you find the idea of interacting with an assistive bot during your virtualmeetings and why?

• Where do you think an assistive bot could help you and why?

– How would you like to interact with the bot?

65

A. Appendix A: Interview Discussion Guide

– How would you like the bot sending useful suggestions / recommendationsrelated to the discussion during the meeting?

• Apart from decisions, what other information (results) would it be useful toinclude in the summary? e.g. new available information, status updates

• Reflection and evaluation of the automatic summary: what questions should thebot ask in order to refine the summary? e.g. Do you confirm that this decision wasindeed made in the meeting? Can you add a short reasoning for this decision?Are there alternative solutions?

• How do you find the idea of a bot participating in the meeting?

– How much are you ready to rely on the information given by an AI to youat your work place?

• What benefits do you think such a tool would bring to your project? Whatpotential problems could be avoided?

Is there anything else you want to add?

Thank you for your time!

66

B. Appendix B: Report ExampleSummary of the meeting from 26.05.2019

MEETING SUMMARY Topic: Demo Date: 26-05-2019 19:19 Keywords: Linux

Participant Joined LeftOleksandra Klymenko 19:17 19:19Manoj Mahabaleshwar - -Echo Meeting Assistant 19:17 19:18

Decisions made:

• So this speech gets transcribed in real time and what I will do now is i will just tweet out some shortextracts from the real meetings that we recorded previously to demonstrate how the body works

• let me just share my screen Okay so one thing we have to do is we have to think about how to resolve theDNS lookup service

• Let me just share my screen Okay so one thing we have to do is we have to think about how to resolvethe dns lookup service

• Also we must move the implementation to our linux server then we slowly need to prioritize what isimportant and what are urgent for us

• But let 's talk about this part in the architecture meeting we should have another meeting onwednesday

• Deadline: Wednesday Suggestions made:

• maybe we should also discuss it with michael

• Related person: Michael Complete meeting transcript: So my knowledge is not picking up but the board is already here and it's actually listening to me already. Wecan see it in the darker logs so we get events on call and recording audio junk. Here and if we look at thecontainer we can see that we're getting transcription hypothesis as I speak. So this speech gets transcribed inreal time and what I will do now is I will just tweet out some short extracts from the real meetings that werecorded previously to demonstrate how the body works . Hi. Hello everyone. So a lot of things happened inthere and there are a lot of tasks for the next week. Let me just share my screen Okay so one thing we have todo is we have to think about how to resolve the DNS lookup service. Maybe we should also discuss it withMichael . Also we must move the implementation to our Linux server then we slowly need to prioritize whatis important and what are urgent for us. But let's talk about this part in the architecture meeting we shouldhave another meeting on Wednesday. I hope it is OK for everyone . Great thanks. Bye bye everyone. So thiswas it. And now I will leave the conversation

Page 1

Figure B.1.: Meeting report

67

Bibliography

[1] J. D. Herbsleb and R. E. Grinter. “Architectures, coordination, and distance:Conway’s law and beyond.” In: IEEE software 16.5 (1999), pp. 63–70.

[2] J. A. Espinosa, S. A. Slaughter, R. E. Kraut, and J. D. Herbsleb. “Team knowledgeand coordination in geographically distributed software development.” In: Journalof management information systems 24.1 (2007), pp. 135–169.

[3] H. P. Andres. “A comparison of face-to-face and virtual software developmentteams.” In: Team Performance Management: An International Journal 8.1/2 (2002),pp. 39–48.

[4] L. Bass, P. Clements, and R. Kazman. Software architecture in practice. Addison-Wesley Professional, 2003.

[5] J. Bosch. “Software architecture: The next step.” In: European Workshop on SoftwareArchitecture. Springer. 2004, pp. 194–199.

[6] A. Tang, M. A. Babar, I. Gorton, and J. Han. “A survey of architecture designrationale.” In: Journal of systems and software 79.12 (2006), pp. 1792–1804.

[7] J. S. van der Ven and J. Bosch. “Making the right decision: supporting archi-tects with design decision data.” In: European Conference on Software Architecture.Springer. 2013, pp. 176–183.

[8] M. Bhat, K. Shumaiev, A. Biesdorf, U. Hohenstein, and F. Matthes. “Automaticextraction of design decisions from issue management systems: a machine learn-ing based approach.” In: European Conference on Software Architecture. Springer.2017, pp. 138–154.

[9] C. Miesbauer and R. Weinreich. “Classification of design decisions–an expertsurvey in practice.” In: European Conference on Software Architecture. Springer. 2013,pp. 130–145.

[10] L.-r. Jen and Y.-j. Lee. “Working Group. IEEE recommended practice for architec-tural description of software-intensive systems.” In: IEEE Architecture. Citeseer.2000.

[11] D. E. Perry and A. L. Wolf. “Foundations for the study of software architecture.”In: ACM SIGSOFT Software engineering notes 17.4 (1992), pp. 40–52.

68

Bibliography

[12] A. Jansen and J. Bosch. “Software architecture as a set of architectural design de-cisions.” In: 5th Working IEEE/IFIP Conference on Software Architecture (WICSA’05).IEEE. 2005, pp. 109–120.

[13] P. Kruchten. “An ontology of architectural design decisions in software intensivesystems.” In: 2nd Groningen workshop on software variability. Citeseer. 2004, pp. 54–61.

[14] N. Donges. Introduction to NLP. https://towardsdatascience.com/introduction-to-nlp-5bff2b2a7170. 2018 (accessed June 10, 2019).

[15] M. Mayo. The Main Approaches to Natural Language Processing Tasks. https://www.kdnuggets.com/2018/10/main-approaches-natural-language-processing-tasks.html. 2019 (accessed June 7, 2019).

[16] T. Young, D. Hazarika, S. Poria, and E. Cambria. “Recent trends in deep learningbased natural language processing.” In: ieee Computational intelligenCe magazine13.3 (2018), pp. 55–75.

[17] L. R. Rabiner, B.-H. Juang, and J. C. Rutledge. Fundamentals of speech recognition.Vol. 14. PTR Prentice Hall Englewood Cliffs, 1993.

[18] H. Petkar. “A review of challenges in automatic speech recognition.” In: Interna-tional Journal of Computer Applications 151.3 (2016).

[19] K. Davis, R. Biddulph, and S. Balashek. “Automatic recognition of spoken digits.”In: The Journal of the Acoustical Society of America 24.6 (1952), pp. 637–642.

[20] B.-H. Juang and L. R. Rabiner. “Automatic speech recognition–a brief history ofthe technology development.” In: Georgia Institute of Technology. Atlanta RutgersUniversity and the University of California. Santa Barbara 1 (2005), p. 67.

[21] W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G.Zweig. “Achieving human parity in conversational speech recognition.” In: arXivpreprint arXiv:1610.05256 (2016).

[22] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis, X. Cui,B. Ramabhadran, M. Picheny, L.-L. Lim, et al. “English conversational telephonespeech recognition by humans and machines.” In: arXiv preprint arXiv:1703.02136(2017).

[23] G. Kurata, B. Ramabhadran, G. Saon, and A. Sethy. “Language modeling withhighway LSTM.” In: 2017 IEEE Automatic Speech Recognition and UnderstandingWorkshop (ASRU). IEEE. 2017, pp. 244–251.

[24] O. Biran. You Shall Not Speak: Benchmarking Famous Speech Recognition APIs forChatbots. https://cai.tools.sap/blog/benchmarking-speech-recognition-api/. 2017 (accessed June 8, 2019).

69

https://towardsdatascience.com/introduction-to-nlp-5bff2b2a7170

https://towardsdatascience.com/introduction-to-nlp-5bff2b2a7170

https://www.kdnuggets.com/2018/10/main-approaches-natural-language-processing-tasks.html



https://cai.tools.sap/blog/benchmarking-speech-recognition-api/

https://cai.tools.sap/blog/benchmarking-speech-recognition-api/

Bibliography

[25] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan,R. J. Weiss, K. Rao, E. Gonina, et al. “State-of-the-art speech recognition withsequence-to-sequence models.” In: 2018 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP). IEEE. 2018, pp. 4774–4778.

[26] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V.Le. “Specaugment: A simple data augmentation method for automatic speechrecognition.” In: arXiv preprint arXiv:1904.08779 (2019).

[27] Gartner. Gartner IT Glossary: Natural Language Understanding. https : / / www .gartner.com/it-glossary/nlu-natural-language-understanding. 2019 (ac-cessed April 21, 2019).

[28] How to Build an NLP Engine that Won’t Screw up. https://labs.eleks.com/2018/02/how-to-build-nlp-engine-that-wont-screw-up.html. 2018 (accessed June9, 2019).

[29] D. Braun, A. Hernandez-Mendez, F. Matthes, and M. Langen. “Evaluating naturallanguage understanding services for conversational question answering systems.”In: Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue. 2017,pp. 174–185.

[30] T. Bocklisch, J. Faulkner, N. Pawlowski, and A. Nichol. “Rasa: Open sourcelanguage understanding and dialogue management.” In: arXiv preprint arXiv:1712.05181 (2017).

[31] X. Liu, A. Eshghi, P. Swietojanski, and V. Rieser. “Benchmarking Natural Lan-guage Understanding Services for building Conversational Agents.” In: arXivpreprint arXiv:1903.05566 (2019).

[32] J. Carletta. “Announcing the AMI meeting corpus.” In: The ELRA Newsletter 11.1(2006), pp. 3–5.

[33] P.-Y. Hsueh and J. D. Moore. “What decisions have you made?: Automatic deci-sion detection in meeting conversations.” In: Human Language Technologies 2007:The Conference of the North American Chapter of the Association for ComputationalLinguistics; Proceedings of the Main Conference. 2007, pp. 25–32.

[34] P.-Y. Hsueh and J. D. Moore. “Automatic decision detection in meeting speech.”In: International Workshop on Machine Learning for Multimodal Interaction. Springer.2007, pp. 168–179.

[35] G. Murray and S. Renals. “Detecting action items in meetings.” In: InternationalWorkshop on Machine Learning for Multimodal Interaction. Springer. 2008, pp. 208–213.

70

https://www.gartner.com/it-glossary/nlu-natural-language-understanding

https://www.gartner.com/it-glossary/nlu-natural-language-understanding

https://labs.eleks.com/2018/02/how-to-build-nlp-engine-that-wont-screw-up.html

https://labs.eleks.com/2018/02/how-to-build-nlp-engine-that-wont-screw-up.html

Bibliography

[36] M. Purver, J. Dowding, J. Niekrasz, P. Ehlen, S. Noorbaloochi, and S. Peters.“Detecting and summarizing action items in multi-party dialogue.” In: Proceedingsof the 8th SIGdial Workshop on Discourse and Dialogue. 2007, pp. 200–211.

[37] R. Fernández, M. Frampton, P. Ehlen, M. Purver, and S. Peters. “Modelling anddetecting decisions in multi-party dialogue.” In: Proceedings of the 9th SIGdialWorkshop on Discourse and Dialogue. Association for Computational Linguistics.2008, pp. 156–163.

[38] R. Fernández, M. Frampton, J. Dowding, A. Adukuzhiyil, P. Ehlen, and S. Peters.“Identifying relevant phrases to summarize decisions in spoken meetings.” In:Ninth Annual Conference of the International Speech Communication Association. 2008.

[39] L. Wang and C. Cardie. “Summarizing decisions in spoken meetings.” In: Pro-ceedings of the Workshop on Automatic Summarization for Different Genres, Media, andLanguages. Association for Computational Linguistics. 2011, pp. 16–24.

[40] C.-Y. Lin and E. Hovy. “Automatic evaluation of summaries using n-gram co-occurrence statistics.” In: Proceedings of the 2003 Human Language Technology Con-ference of the North American Chapter of the Association for Computational Linguistics.2003.

[41] G. Tur, A. Stolcke, L. Voss, S. Peters, D. Hakkani-Tur, J. Dowding, B. Favre,R. Fernández, M. Frampton, M. Frandsen, et al. “The CALO meeting assistantsystem.” In: IEEE Transactions on Audio, Speech, and Language Processing 18.6 (2010),pp. 1601–1611.

[42] G. Pedraza-García, H. Astudillo, and D. Correal. “Dvia: Understanding howsoftware architects make decisions in design meetings.” In: Proceedings of the 2015European Conference on Software Architecture Workshops. ACM. 2015, p. 51.

[43] Unify. Circuit API Examples. https://circuit.github.io/. 2019 (accessed April23, 2019).

[44] E. G. Guba. “Criteria for assessing the trustworthiness of naturalistic inquiries.”In: Ectj 29.2 (1981), p. 75.

[45] A. K. Shenton. “Strategies for ensuring trustworthiness in qualitative researchprojects.” In: Education for information 22.2 (2004), pp. 63–75.

[46] M. Q. Patton. “Qualitative research.” In: Encyclopedia of statistics in behavioralscience (2005).

[47] R. T. GmbH. Choosing a Rasa NLU Pipeline. https://rasa.com/docs/nlu/choosing_pipeline/. 2019 (accessed May 12, 2019).

[48] I. Jacobs. “Architecture of the world wide web, volume one.” In: https: // www.w3. org/ TR/ webarch/ (2004).

71

https://circuit.github.io/

https://rasa.com/docs/nlu/choosing_pipeline/

https://rasa.com/docs/nlu/choosing_pipeline/

https://www.w3.org/TR/webarch/

https://www.w3.org/TR/webarch/

Bibliography

[49] T. Heath and C. Bizer. “Linked data: Evolving the web into a global data space.”In: Synthesis lectures on the semantic web: theory and technology 1.1 (2011), pp. 1–136.

[50] G. Klyne and J. J. Carroll. “Resource description framework (RDF): Concepts andabstract syntax.” In: https://www.w3.org/TR/rdf-concepts/ (2006).

[51] T. Berners-Lee. “Linked data-design issues.” In: http://www. w3. org/DesignIs-sues/LinkedData. html (2006).

[52] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. “Dbpedia: Anucleus for a web of open data.” In: The semantic web. Springer, 2007, pp. 722–735.

[53] New DBPedia Release - 2016-10. https://wiki.dbpedia.org/blog/new-dbpedia-release-\T1\textendash-2016-10. 2014 (accessed May 2, 2019).

[54] SPARQL Query Language for RDF. https://www.w3.org/TR/rdf-sparql-query/.2019 (accessed June 6, 2019).

[55] Architecture | DBPedia. https://wiki.dbpedia.org/about/architecture. 2019(accessed June 6, 2019).

[56] M. Bhat, K. Shumaiev, A. Biesdorf, U. Hohenstein, M. Hassel, and F. Matthes.“An ontology-based approach for software architecture recommendations.” In:(2017).

[57] J. Daiber, M. Jakob, C. Hokamp, and P. N. Mendes. “Improving efficiency andaccuracy in multilingual entity extraction.” In: Proceedings of the 9th InternationalConference on Semantic Systems. ACM. 2013, pp. 121–124.

[58] P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer. “DBpedia spotlight: shed-ding light on the web of documents.” In: Proceedings of the 7th internationalconference on semantic systems. ACM. 2011, pp. 1–8.

[59] A. V. Aho and M. J. Corasick. “Efficient string matching: an aid to bibliographicsearch.” In: Communications of the ACM 18.6 (1975), pp. 333–340.

[60] X. Han and L. Sun. “A generative entity-mention model for linking entities withknowledge base.” In: Proceedings of the 49th Annual Meeting of the Association forComputational Linguistics: Human Language Technologies-Volume 1. Association forComputational Linguistics. 2011, pp. 945–954.

[61] Wikipedia. Word error rate. https://en.wikipedia.org/wiki/Word_error_rate.2019 (accessed May 23, 2019).

[62] R. Farkas, V. Vincze, G. Móra, J. Csirik, and G. Szarvas. “The CoNLL-2010shared task: learning to detect hedges and their scope in natural language text.”In: Proceedings of the Fourteenth Conference on Computational Natural LanguageLearning—Shared Task. Association for Computational Linguistics. 2010, pp. 1–12.

72

https://wiki.dbpedia.org/blog/new-dbpedia-release-\T1\textendash -2016-10

https://wiki.dbpedia.org/blog/new-dbpedia-release-\T1\textendash -2016-10

https://www.w3.org/TR/rdf-sparql-query/

https://wiki.dbpedia.org/about/architecture

https://en.wikipedia.org/wiki/Word_error_rate

Bibliography

[63] E. Letier, D. Stefan, and E. T. Barr. “Uncertainty, risk, and information value insoftware requirements and architecture.” In: Proceedings of the 36th InternationalConference on Software Engineering. ACM. 2014, pp. 883–894.

[64] K. Shumaiev, M. Bhat, O. Klymenko, A. Biesdorf, U. Hohenstein, and F. Matthes.“Uncertainty expressions in software architecture group decision making: explo-rative study.” In: Proceedings of the 12th European Conference on Software Architecture:Companion Proceedings. ACM. 2018, p. 42.

[65] J. R. Hurford, B. Heasley, and M. B. Smith. Semantics: a coursebook. CambridgeUniversity Press, 2007.

[66] M. McShane and P. Babkin. “Resolving difficult referring expressions.” In: Ad-vances in Cognitive Systems 4 (2016), pp. 247–263.

[67] S. Arora and S. Agarwal. “Active learning for natural language processing.” In:Language Technologies Institute School of Computer Science Carnegie Mellon University(2007).

[68] J. Wei. These are the Easiest Data Augmentation Techniques in Natural Language Pro-cessing you can think of - and they work. https://towardsdatascience.com/these-are-the-easiest-data-augmentation-techniques-in-natural-language-processing-you-can-think-of-88e393fd610. 2019 (accessed May 21, 2019).

[69] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L.Zettlemoyer. “Deep contextualized word representations.” In: arXiv preprintarXiv:1802.05365 (2018).

[70] K. Garg and G. Jain. “A comparative study of noise reduction techniques for auto-matic speech recognition systems.” In: 2016 International Conference on Advances inComputing, Communications and Informatics (ICACCI). IEEE. 2016, pp. 2098–2103.

[71] B. Widrow, J. R. Glover, J. M. McCool, J. Kaunitz, C. S. Williams, R. H. Hearn,J. R. Zeidler, J. E. Dong, and R. C. Goodlin. “Adaptive noise cancelling: Principlesand applications.” In: Proceedings of the IEEE 63.12 (1975), pp. 1692–1716.

[72] K. Mehta and R. Anand. “Robust front-end and back-end processing for featureextraction for Hindi speech recognition.” In: 2010 IEEE International Conference onComputational Intelligence and Computing Research. IEEE. 2010, pp. 1–4.

[73] Y. A. Ibrahim, J. C. Odiketa, and T. S. Ibiyemi. “PREPROCESSING TECHNIQUEIN AUTOMATIC SPEECH RECOGNITION FOR HUMAN COMPUTER INTER-ACTION:: AN OVERVIEW.” In: Annals. Computer Science Series 15.1 (2017).

73

https://towardsdatascience.com/these-are-the-easiest-data-augmentation-techniques-in-natural-language-processing-you-can-think-of-88e393fd610



Bibliography

[74] A. Celikyilmaz, Z. Feizollahi, D. Hakkani-Tur, and R. Sarikaya. “Resolving re-ferring expressions in conversational dialogs for natural user interfaces.” In:Proceedings of the 2014 Conference on Empirical Methods in Natural Language Process-ing (EMNLP). 2014, pp. 2094–2104.

[75] M. Regan, P. Rastogi, A. G. Mathias, et al. “A dataset for resolving referringexpressions in spoken dialogue via contextual query rewrites (CQR).” In: arXivpreprint arXiv:1903.11783 (2019).

[76] M. Razavian, A. Tang, R. Capilla, and P. Lago. “Reflective approach for softwaredesign decision making.” In: 2016 Qualitative Reasoning about Software Architectures(QRASA). IEEE. 2016, pp. 19–26.

[77] L. Lamport. LaTeX : A Documentation Preparation System User’s Guide and ReferenceManual. Addison-Wesley Professional, 1994.

[78] J. Levis and R. Suvorov. “Automatic speech recognition.” In: The encyclopedia ofapplied linguistics (2012).

[79] Speechmatics. Speechmatics - Automatic speech recognition technology. https://www.speechmatics.com/. 2006 (accessed April 16, 2019).

[80] R. T. GmbH. Rasa NLU: Language Understanding for chatbots and AI assistants.https://rasa.com/products/rasa-nlu/. 2006 (accessed April 16, 2019).

[81] Apache. Apache Kafka. https://kafka.apache.org/. 2017 (accessed May 1, 2019).

[82] F. P. Brooks Jr. The Mythical Man-Month: Essays on Software Engineering, AnniversaryEdition, 2/E. Pearson Education India, 1995.

[83] Unify. Collaboration and communication software by Unify. https://www.circuit.com/. 2019 (accessed April 23, 2019).

[84] D. Association. DBPedia. https://wiki.dbpedia.org/about. 2014 (accessed May2, 2019).

74



https://rasa.com/products/rasa-nlu/

https://kafka.apache.org/



https://wiki.dbpedia.org/about

Date post:	19-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

TECHNISCHE UNIVERSITÄT MÜNCHEN - TUM · In Section 6 we present our data corpus, explaining the...

Documents